I Ain’t No Huffman

In terms of compression efficiency, I knew there were some obvious places that could use improvement. In particular, my Huffman trees…weren’t even really Huffman trees. The intent was for them to be Huffman-like in that the most frequently seen symbols would be closest to the top of the tree and thus have the shortest bitstrings, but the construction and balancing method was completely different. Whenever a symbol’s count increased, I compared it to the parent’s parent’s other child, and if the current symbol’s count was now greater, it swapped it with the current symbol, inserted a new branch where the updated node used to be, and pushed the other child down a level.

Unfortunately, that method led to horribly imbalanced trees, since it only considered nearby nodes when rebalancing, when changing the frequency of a symbol can actually affect the relationship of symbols on relatively distant parts of the tree as well. As an example, here’s what a 4-bit length tree wound up looking like with my original adaptive method:

Lengths tree:
    Leaf node 0: Count=2256 BitString=1
    Leaf node 1: Count=1731 BitString=001
    Leaf node 2: Count=1268 BitString=0001
    Leaf node 3: Count=853 BitString=00001
    Leaf node 4: Count=576 BitString=000001
    Leaf node 5: Count=405 BitString=0000001
    Leaf node 6: Count=313 BitString=00000001
    Leaf node 7: Count=215 BitString=000000000
    Leaf node 8: Count=108 BitString=0000000011
    Leaf node 9: Count=81 BitString=00000000101
    Leaf node 10: Count=47 BitString=000000001001
    Leaf node 11: Count=22 BitString=00000000100001
    Leaf node 12: Count=28 BitString=0000000010001
    Leaf node 13: Count=15 BitString=000000001000000
    Leaf node 14: Count=9 BitString=000000001000001
    Leaf node 15: Count=169 BitString=01
    Avg bits per symbol = 3.881052

If you take the same data and manually construct a Huffman tree the proper way, you get a much more balanced tree without the ludicrously long strings:

    Leaf node 0: Count=2256 BitString=10
    Leaf node 1: Count=1731 BitString=01
    Leaf node 2: Count=1268 BitString=111
    Leaf node 3: Count=853 BitString=001
    Leaf node 4: Count=576 BitString=1100
    Leaf node 5: Count=405 BitString=0001
    Leaf node 6: Count=313 BitString=11011
    Leaf node 7: Count=215 BitString=00001
    Leaf node 8: Count=108 BitString=000001
    Leaf node 9: Count=81 BitString=000000
    Leaf node 10: Count=47 BitString=1101000
    Leaf node 11: Count=22 BitString=110100110
    Leaf node 12: Count=28 BitString=11010010
    Leaf node 13: Count=15 BitString=1101001111
    Leaf node 14: Count=9 BitString=1101001110
    Leaf node 15: Count=169 BitString=110101
    Avg bits per symbol = 2.969368

That’s nearly a bit per symbol better, which may not sound like much but with the original method there was barely any compression happening at all, whereas a proper tree achieves just over 25% compression.

So, I simply dumped my original adaptive method and made it construct a Huffman tree in the more traditional way, pairing the highest count nodes in a sorted list. To keep it adaptive, it still does the count check against the parent’s parent’s other child, and when it crosses the threshold it simply rebuilds the entire Huffman tree from scratch based on the current symbol counts. This involves a lot more CPU work, but as we’ll see later, performance bottlenecks aren’t necessarily where you think they are…

My trees also differ from traditional ones in that they prepopulate the tree with all possible symbols with a count of zero, whereas usually you only insert nodes into a Huffman tree if they have a count greater than zero. This is slightly suboptimal, but it avoids a chicken-and-egg problem with the decoder not knowing what symbol a bitstring corresponds to if it doesn’t exist in the tree yet because it’s the first time the symbol has been seen.

Knowing that, and with the improved Huffman trees, another thing became clear: using Huffman trees for the offsets wasn’t really doing much good at all. With most files, the offset values are too evenly distributed, and many are never used at all, and all those zero-count entries would get pushed down the tree and become longer strings, so the first time an offset got used it would often have a string longer than its basic bit length, causing file growth instead of compression. I instead just ripped those trees out and emitted plain old integer values for the offsets.

The way I was constructing my trees also had another limitation: the total number of symbols had to be a power of two. With the proper construction method, an arbitrary number of symbols could be specified, and that allowed another potential optimization: merging the identifier tree and the literals tree. The identifier token in the output stream guaranteed that there would always be at least 1 wasted non-data bit per token, and often two. Merging it with the literals would increase the size of literal symbols, but the expectation is that the larger literal size would on average still be smaller than the sum of the identifier symbols and smaller literal symbols, on average, especially as more ‘special case’ symbols are added. Instead of reading an identifier symbol and deciding what to do based on that, the decoder would read a ‘literal’ symbol, and if it was in the range 0-255, it was indeed a literal byte value and interpreted that way, but if it was 256 or above, it would be treated as having a following offset/length pair.

The range of offsets to handle would also have to change, but that’s for next time… With the Huffman tree improvements, my 195kB test file that compressed to 87.4kB before now compressed to 84.2kB. Still not as good as gzip, but getting there.

Compressing History

While sorting through some old files of mine, I happened upon the source code to a compression engine I’d written 18 years ago. It was one of the first things I’d ever written in C++, aside from some university coursework, and I worked on it in the evenings during the summer I was in Cold Lake on a work term, just for fun. Yes, I am truly a nerd, but there wasn’t really much else to do in a tiny town like that, especially when you only get 3 TV channels.

Looking at it now it’s kind of embarrassing, since of course it’s riddled with inexperience. No comments at all, leaving me mystified at what some of the code was even doing in the first place, unnecessary global variables, little error checking, poor header/module separation, unnecessary exposure of class internals, poor const correctness, and so on. It kind of irks my pride to leave something in such a poor state though, so I quickly resolved to at least clean it up a bit.

Of course, I have to understand it first, and I started to remember more about it as I looked over the code. It’s a fairly basic combination of both LZ77 pattern matching and Huffman coding, like the ubiquitous Zip method, but the twist I wanted to explore was in making the Huffman trees adaptive, so that the symbols would shift around the tree to automatically adjust as their frequency changed within the input stream. There were two parameters that controlled compression efficiency: history buffer size, and maximum pattern length. The history size controls how far back it would look for matching patterns, and the length controlled the upper limit on the length of a match that could be found.

Compression proceeded by moving through the input file byte by byte, looking for the longest possible exact byte match between the data ahead of the current position and the data in the history buffer just behind the current position. If a match could not be found, it would emit the current byte as a literal and move one byte ahead, and if a match was found, it would emit a token with the offset and length of the match in the history buffer. To differentiate between these cases, it would first emit an ‘identifier’ token with one of four possible values: one for a literal, which would then be followed by the 8-bit value of the literal, and three for offset and length values, with three different possible bit lengths for the offset so that closer matches took fewer bits. Only matches of length 3 or longer are considered, since two-byte matches would likely have an identifier+offset+length string longer than just emitting the two bytes as literals. In summary, the four possible types of bit strings you’d see in the output were:

    | ident 0 | 8-bit literal |

    | ident 1 | 'x'-bit offset    | length |

    | ident 2 | 'y'-bit offset        | length |

    | ident 3 | 'z'-bit offset            | length |

And then I used a lot of Huffman trees. Each of these values were then run through a Huffman tree to generate the actual symbol emitted to the output stream, with separate trees for the identifier token, the literals, the lengths, and the three offset types. HUFFMAN TREES EVERYWHERE! The compression parameters were also written to a header in the file, so the decoder would know what history buffer size to use and maximum length allowed.

It worked…okay… I’ve lost my original test files, but on one example text file of 195kB, my method compresses it down to 87.4kB, while ‘gzip -9’ manages 81.1kB. Not really competitive, but not too bad for a completely amateur attempt either. There’s still plenty of room for improvement, which will come…next time.