Extended Theory - Data Compression
Data compression is the process of reducing the size of our data by minimizing the number of bits needed to represent the same information. Data compression is widely used across various file types, including images, sound, video, and text. It serves as an excellent example of enhancing our understanding of string handling with C#.
Types of Data Compression
There are two main types of data compression:
-
Lossy Compression: In this method, some data is irreversibly removed from the original file during compression. As a result, the original file cannot be restored precisely when decompressed. Lossy compression is typically used for multimedia files where some loss of quality is acceptable, such as:
- Images: JPEG
- Audio: MP3
- Video: MP4
-
Lossless Compression: This method allows the original file to be restored exactly as it was before compression. Lossless compression is essential for text and program data where data integrity is crucial. Common formats include:
- File Archives: ZIP
- Images: PNG
- There are two types:
Run length encoding
Run Length Encoding is a form of lossless compression where runs of repeated data are replaced by a single data value and a count of repetitions. It works best with simple graphics, icons, and line drawings.
For example, if we have a string of 20 characters: ABBBBCCCCCCCAABDDDDA
, this could be encoded as 1A4B7C2A1B4D1A
, which uses only 14 bytes instead of 20.
To decompress the data, we read the string and interpret the digits as counts for the number of characters that follow.
In C# the algorithm for compressing data this way would be as below. It uses, by way of example, the StringBuilder
class to build a string, from the System.Collections
namespace, takes the string to compress as a parameter returning the compressed string for further processing:
To decompress the string, we reverse the process:
RLE is particularly effective when the data contains many repeated runs of text. However, it can be inefficient for non-repetitive data. An improvement can be made by introducing a flag to indicate a repeated run. For example, A4B7C2AB4DA could use the asterisk as a flag, adding an extra byte for each run but potentially reducing the overall file size.
Image Compression using RLE
RLE can also be applied to image files. For monochrome bitmaps, one bit can represent black or white, a byte for the pixel value, and another for the run length. The most significant bit (MSB) indicates color: a 0 for white and a 1 for black.
For example, the string 00000111111111111100000000000000000000 would be encoded as 00000101 10001101 00010100, representing runs of white and black pixels.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
|
A run of this code yields:
An alternative, would be to make the first digit in the compressed version to represent white pixels. If the string was 11111000000000000011111111111111111111
this would become 0,5,13,20
.
A similar approach could be taken for images with blocks of colour. Each colour being represented by a byte, providing \(256\) different colours (that could be stored in a lookup file, or embedded in the image file itself). The data being represented as a series of two byte pairs, the first holding the frequency of repetition, the second the colour. For example, the following slice from an image file:
Assuming the colour blue used a code of 16, yellow of 12 and green of 3 this data can be compressed as:
00001000 00010000 00001000 00001100 00001100 00010000 00000100 00000011
If we need more colour, add in additional bytes for the colour codes. \(24\)-bit colour uses a combination of the colours red, green and blue so a compressed image file could use \(4\) bytes to represent the colour of each pixel. As with the previous examples the amount of saving achieved through compression is dependent on the frequency of repeated data.
Alternative Compression Techniques
-
Dictionary Compression: A more complex algorithm involves building a dictionary of patterns for compression. One of the most well-known examples is the LZW (Lempel-Ziv-Welch) algorithm. This algorithm is used in formats like GIF. For further exploration, consider reading about it on Wikipedia.
-
Comparison with Other Techniques: It’s useful to compare RLE with other compression techniques like Huffman coding, which can be more efficient for certain types of data. Each method has its strengths and weaknesses, depending on the data being compressed.