September 2011 – The Mothership

The offset sizes were another area I could experiment with a bit. Originally I had three different offset lengths, a short-range one, a medium-range one, and a long-range one, on the theory that shorter offsets might occur more often than longer offsets, and could be stored in fewer bits. If a buffer size of ‘n’ bits was specified, the long-range offset would be ‘n’ bits, the medium range offset would be ‘n-1’ bits, and the short-range offset would be ‘n-2’ bits.

Some experimentation showed that having these different ranges was indeed more efficient than having just a single offset length, but it was hard to tell just what the optimal sizes were for each range. I kept it to only three different ranges because initially I didn’t want the number of identifier symbols to be too large, but after merging the identifiers into the literals, I had a bit more leeway in how many more ranges I could add.

So…why not add a range for every bit length? I set it up so that 256 would correspond to a 6-bit offset, 257 indicated a 7-bit offset, 258 is an 8-bit offset, etc., all the way up to 24-bit offsets. This also had the property that, except for the bottom range, an ‘n’-bit offset could be stored in ‘n-1’ bits, since the uppermost bit would always be ‘1’ and could be thrown away (if it was ‘0’, it wouldn’t be considered an ‘n’-bit offset, since it fits in a smaller range). Some testing against a set of data files showed that this did indeed improve the compression efficiency and produced smaller files.

With all of these possible bit values and lengths though, there was still the open question of what should be considered reasonable values for things like the default history buffer size and match length. Unfortunately, the answer is that it…depends. I used a shell script called ‘explode’ to run files through the compressor with all possible combinations of a set of buffer sizes and match lengths to see which would produce the smallest files, and the results varied a lot depending on the type and size of input file. Increasing the match length did not necessarily help, since it increased the average size of the length symbols and didn’t necessarily find enough long matches to cancel that out. Increasing the buffer size generally improves compression, but greatly increases memory usage and slows down compression. After some more experimentation with the ‘explode’ script, I settled on defaults of 17 bits for the buffer size, and a match length of 130.

Another idea I’d remembered hearing about was how the best match at the current byte might not necessarily be the most efficient match. It might be more efficient to emit the current byte as a literal instead if the next byte is the start of an even longer match. It was only an intuitive feeling though, so I implemented this and tested it and it did indeed seem to give a consistent improvement in compression efficiency. As an example, in one text document the phrase ‘edge of the dock’ was compressed like so:

Literal: 'e' (101) (4 bits)
Literal: 'd' (100) (6 bits)
Literal: 'g' (103) (8 bits)
10-bit offset: 544   Length: 3 'e o' (16 bits)
 8-bit offset: 170   Length: 6 'f the ' (17 bits)
10-bit offset: 592   Length: 3 ' do' (16 bits)
Literal: 'c' (99) (6 bits)
Literal: 'k' (107) (7 bits)

but with the new test, it generated the following instead:

Literal: 'e' (101) (4 bits)
Literal: 'd' (100) (6 bits)
Literal: 'g' (103) (8 bits)
Literal: 'e' (101) (4 bits) (forced, match len=3)
 8-bit offset: 170   Length: 8 ' of the ' (19 bits)
10-bit offset: 592   Length: 3 ' do' (16 bits)
Literal: 'c' (99) (6 bits)
Literal: 'k' (107) (7 bits)

The ‘forced’ literal normally would have been part of the first match, but by emitting it as a literal instead it was able to find a more efficient match and only two offset/length tokens were needed instead of three, for a difference of 80 bits for the original versus 70 bits for the improved match. Doing these extra tests does slow down compression a fair bit though, so I made it an optional feature, enabled on the command line.

At this point though, it’s getting harder and harder to extract gains in compression efficiency, as it starts devolving into a whole bunch of special cases. For example, increasing the buffer size sometimes makes compression worse, as in the following example:

'diff' output between two runs:
 17-bit offset: 87005   Length: 10 'with the t' (26 bits)
 14-bit offset: 10812   Length: 3 'arp' (18 bits)
-13-bit offset: 7705   Length: 3 ', w' (17 bits)
-13-bit offset: 5544   Length: 8 'ould you' (19 bits)
+18-bit offset: 131750   Length: 4 ', wo' (41 bits)
+13-bit offset: 5544   Length: 7 'uld you' (19 bits)
 16-bit offset: 50860   Length: 7 '?  You ' (22 bits)
 17-bit offset: 73350   Length: 10 'take that ' (26 bits)

The compressor looks for the longest matches, and in the ‘+’ run it found a longer match, but at a larger offset than in the ‘-‘ run. In this case, 18-bit offsets are rare enough that their symbol has been pushed low in the Huffman tree and the bitstring is very long, making it even less efficient to use a long offset, and in the end a whopping 24 bits are completely wasted. Detecting these kinds of cases requires a bunch of extra tests though, and this is just one example.

So, I think that’s about all I’m going to do for attempting to improve the compression efficiency. How does it do overall? Well, that 195kB text file that originally compressed to 87.4kB and then made it down to 84.2kB can now be compressed down, with harder searching on and optimal buffer and match length sizes determined, to 77.9kB. That’s even lower than ‘gzip -9’ at 81.1kB!

It’s not all good news, though. If I take the Canterbury Corpus and test against it, the original total size is 2810784 bytes, ‘gzip -9’ reduces them to a total of 730732 bytes (26.0%), and at the default settings, my compressor gets…785421 bytes (27.9%). If I enable the extra searching and find optimal compression parameters for each file via ‘explode’, I can get it down to 719246 bytes (25.6%), but that takes a lot of effort. Otherwise, at the default settings, some of the files are smaller than gzip and others are larger; typically I do worse on the smaller files where there hasn’t really been much of a chance for the Huffman trees to adapt yet, and the Excel spreadsheet in particular does really poorly with my compressor, for some reason I’d have to investigate further.

But I’m not going to. No, the main remaining problem was one of speed…

Month: September 2011

The Settings Are A Bit Off