It Better Be Made Of Diamond

I’m having a bit of trouble with my cable modem and TV, so I figured I should try replacing the splitter, since this one’s kinda old and they’ll need higher frequencies when DOCSIS 3.1 comes out anyway. And hey, there’s a The Source store (what used to be Radio Shack here) on the way back from lunch. I find a two-way splitter, and they want $15 for it. Ugh, kinda expensive, but I don’t want to wait for one to be shipped either… But then I notice that there’s no frequency range printed on it anywhere. And then I notice next to it a “high performance” splitter that does have an explicit 2.4GHz rating. Except that one is $35. It’s not even some unnecessarily-fancy brand like Monster, just their regular in-house junk. I swear, they exist solely to fleece businessmen wandering around downtown who don’t know any better.

It may be a bit out of my way, but I’ll just swing by Home Depot and get an equal-rating one for $8.

Okay Fine Facebook, You Half-Win

Well, it’s been over a year since I closed my Facebook account, but I recently realized that I needed some information from some old private messages on there, so I had to reactivate it.

I’m still not really keen on using it as some kind of all-encompassing social hub, but I figured that I may as well leave it open for one thing: to add Facebook Chat to Adium. The ‘old-school’ IM services like AIM and Yahoo seem to be rapidly dying out, and this is probably the best supplement/replacement to them right now.

No Room For My Stuff

I need to do something.

It feels like my life has been stagnant for a long time now, and there’s really nobody to blame but myself. I’ve written here before about what I ought to do, but very little of it has come to pass. I’m nearly middle-aged now, and things seem barely any different than they were 10 years ago.

But where to start? Well, before I can worry about dealing with other people, I have to be happy with myself first. I’m certainly no psychiatrist, but one of the more obvious manifestations of my troubles is the state of my apartment. I admit it, I’m a packrat, and over the course of the years my place has become cluttered and disorganized, filled with empty boxes, parts and books and odds and ends scattered willy-nilly, things going uncleaned and unrepaired, and so on. When you live by yourself, it’s easy to just get used to living that way and letting those things slide, and one day you realize what a mess it’s become but, eh, it’s been like that for years now, what’s the big deal… It leads to a weird state of mind where I’m simultaneously comfortable with and used to it, but also somewhat ashamed of what other people would think of it (I certainly wouldn’t want to bring a date back to it) so I become even more entrenched in my loner-hood.

So, I need to clean this place up, and that means throwing a bunch of this junk out, not just shuffling it around like I usually wind up doing. The main things I need to get rid of are:

  • Big boxes. Should be easy enough, a lot of them are mostly empty and were kept around because there were still a few parts or cables in them or I just never got around to breaking them down before tossing them into a corner.
  • Old parts. I’ve probably got enough spare pieces around to rebuild PCs stretching all the way back to the 486 era, plus a dead TV, way too many old cables, etc. I’ve been kind of reluctant to throw these out not because I’d possibly need them (I occasionally toy with the idea of building old PCs for games that no longer work well on modern Windows, but never have the time anyway), but because they really should be disposed of properly and getting them to a recycling centre is a pain. I might have to look into seeing if I can pay someone to haul it away.
  • Books. I’ve got more books than bookshelves now, and I’m always reluctant to let go of books. Maybe I can at least start with the clearly obsolete technical manuals. I sincerely doubt I’ll suddenly need to refer to a book for Java 1.1 programming information…
  • Old magazines, mail, invoices, etc. I’m really bad at letting paperwork accumulate. I’m paranoid enough to think that I really ought to dispose of it securely, but never get around to actually shredding it, so I open a box from Amazon, take out the receipt, and toss it on top of a desk, where it stays with a pile of others for the next 5 years. Easy enough to take care of, just tedious.
  • Old game boxes. I’ve still got a ton of these, though at least I’m not accumulating many more now that most of my purchases are downloads these days. They’ve still got CDs and manuals inside, but removing them and getting rid of the boxes alone should still save a bunch of space.

That would probably take care of a big chunk of it, at least. Once that’s done, then I’ll be in a better position to evaluate what should be next. Tonight, I got rid of an old box that my original Dell widescreen monitor came in, 9 years ago, and sorted through another box full of game manuals. I don’t really want to throw the manuals out, but I did at least sort out the useless ones like the French versions or manuals for long-obsolete hardware and got rid of those.

The Boringest Maverick

Now that I had some free time (and a backup drive within arm’s reach), I finally upgraded my laptop to OS X Mavericks. And it’s….eh. Everything works fine and I haven’t really had any problems so far, but not many of the new features really excite me that much. Better social integration only really matters if you’re a social person. I’d still rather go to the Google Maps site than use an Apple Maps app. I don’t use Safari, so I haven’t even checked its new features. And it’ll take a while before I can really tell how much the battery life improves. The tabs in Finder are nice, though.

The one other thing that does look interesting is iBooks. Not so much for the Apple book store, but just as a convenient place to organize and read stuff like the books I’ve bought from O’Reilly or Gumroad, since you can import your own files. I’m sure there are a dozen third-party apps that could do the same thing, but I’m lazy and iBooks was just one click away… :) It supports PDF and EPUB, but it uses different apps to view each type and it’s not really clear which is preferable yet. I’ll have to play with it a bit more.

So, not a great upgrade, but not a bad one either, and it was free after all.

What Do You Do With 30GB?

Also speaking of dead hard drives, now’s as good a chance as any to sort through some old drives left over from old systems, since I’ve now got a USB adapter (kind of like this one) that lets me hook up drives without having to open up the case.

Some of these are dead or dying drives that I didn’t want to immediately throw out because I wanted a chance to erase them first. My secret crab bisque recipe will never be yours, dumpster-divers! The drive I just replaced in this server actually crapped out completely while I was 3/4 of the way through erasing it, so it’s now relegated entirely to the role of ‘squeaky noise-maker’.

Some of them still work perfectly fine, but I’m not sure what to do with them yet. One is a 1TB drive, which is still a good amount of space, but it won’t fit into any of my current systems, and it’s too small to act as a backup drive to them. But it’s too (physically) big to use as an upgrade in my PS3. Probably the best thing to do would be to stick it in a USB enclosure and use it as a General Purpose Portable Drive ™. But then I’ve also got a 500GB drive left over. And a 320GB drive. And a 160GB…

And some of them are ooooold. In one box I found a dusty WD Caviar drive that’s 13 years old and is louder than every other hard drive I have put together, but amazingly enough it still works. But it only holds 30GB; what could I even use it for? And I’m trying to make my workspace quieter, not louder. But then my natural reluctance to throw out anything that might be useful in some theoretical way kicks in…

iDunno

Speaking of failures… I finally got around to upgrading my phone to iOS 7. I was hesitant at first after various reports of it performing poorly on older phones like mine, but after a while those reports started sounding a little overblown, so I took the plunge and it worked pretty well.

Except iMessage. Which is kinda important since I’d rather not incur SMS charges for myself and the friends that use it. It remained stuck at this “Waiting for activation…” state seemingly forever, even after running through all of the official troubleshooting steps, resetting everything, waiting a full day, trying a bunch of anecdotal “well this worked for me…” tips from forums, etc.

After a week of fiddling with it a hundred different ways, I eventually figured I’d try using iMessage from my Mac just to see if it at least worked from there, and maybe isolate whether the problem was with the phone or the account. But, when it asked me which email address I wanted to use, it wouldn’t let me actually select one. It displayed my email address and put a checkbox beside it, but it would immediately clear the checkbox if I tried to set it, and I couldn’t proceed any further without it checked.

Confused as hell by this point, I logged into my Apple ID to check if there was anything wrong with it or set incorrectly, but it all seemed fine. I did notice that I didn’t have an alternate email address set though, so I thought I may as well set one while I was there. And then suddenly the Messages app would let me finish setting it up now that it had two email addresses to choose from, and going back to the phone, it activated iMessage within seconds.

Was it the setting of the alternate address that fixed it? Do you really need an alternate address for iMessage? Maybe just logging into the Apple ID account management cleared out some old crud from the database that was interfering? *shrug* It’s one of the few times where Apple stuff hasn’t “just worked” for me and I start wishing I could browse through their server-side trace logs…

Arise!

Hard drive failures suck. A couple weeks ago, I noticed this server box was starting to make some odd squeaking noises, and I was reasonably sure that there weren’t any mice trapped inside. I checked the SMART stats, and sure enough, bad sectors had started appearing on one of the drives. In the days after that, the number kept steadily rising, and just trying to access some directories started spitting out I/O errors.

The data was safe, as it’s pretty robustly backed up, but it presented me with a bit of a dilemma: do I just replace the drive, or do I consider replacing the whole system, as it’s now around five years old?

In the end, I wound up kind-of upgrading it. I kept the system, but upgraded it to 4GB of memory cannibalized from my recently replaced gaming box, and replaced both of the hard drives in it instead of just the failing one. Instead of two 1TB drives, it now has one 240GB SSD, and one 4TB hard drive. The SSD makes for a nice and snappy boot and home drive, but it obviously can’t hold much, so any large collections of data will go on the 4TB drive, with symlinks all over the place as needed. The CPU’s still adequate for its relatively light needs, so with these upgrades, this system should hopefully last for quite a while longer.

Now how long is it going to take to restore all those files…

Damn Fans

I’m still having a bit of trouble with the new system, and primarily with the video card. The first problem is that I have to reboot every day or the drivers start ‘wigging out’, constantly issuing a stream of timeout and recovery errors and making the system unusable. This apparently happens after the system has been running for 36 hours, and there are a lot of complaints about it on NVIDIA’s forum and moderators have posted that it’s under investigation, so hopefully a fix isn’t too far away. At least rebooting is super-fast, though I’d be a lot more annoyed if I was currently doing development on this system and had to close and reopen everything every day.

The second problem was that I started getting spontaneous reboots in the middle of games. After checking the temperatures and seeing them get way too high, I cracked open the case and saw that a bundle of loose cables had drooped down into the video card fans, stopping one of them from spinning. There wasn’t really anything I could easily tie them to, but they were going underneath another cable that’s much stiffer, so it seemed like it would suffice to have them rest on top of it instead. Unfortunately that meant rerouting that other cable, which is the USB3 front panel cable that caused me grief before, so I had to pull the 3.5″ drive bay out again to put the USB3 plug back in.

And then in the course of doing that, I somehow yanked out two of those looser cables from the motherboard. They were just the front panel power LED pins, but if I didn’t do anything about them, then those would droop down into the video card fans. Unfortunately those pins are in a very narrow little region between the power supply and the video card, too narrow to reach with the kind of finger-pinching needed to guide the plugs properly. The one time in my life I actually specifically needed needle-nose pliers, and I discover I don’t have any…

So, I had to move the video card out of the way to give me more room to work in, and that meant moving some other cables that ran past the video card, and removing the ‘cage’ on the back of the case to get at some screws, and… In the end I wound up damn near disassembling and reassembling the whole system, just to put one cable on top of another. There’s probably a lesson here about doing proper cabling up front, but who knows if I’ll remember it by the next time I have to assemble a system…

Getting Rid Of The Loudmouth

All of my parts finally arrived, so I spent most of last weekend assembling the new PC and the last week reinstalling stuff and getting it set back up again. It is, of course, much faster than my old one, but the main thing I immediately noticed is just how much quieter it is compared to the old system, even when just sitting there idle. When I first powered it on, I was instantly concerned that I’d screwed something up because I couldn’t hear anything. Maybe I can hear myself think now…

As usual, the main hassle was cabling, especially since I went with a smaller mATX case this time. I always seem to screw up the order of something; I attached the USB3 header early on, but then it turned out it was blocking me from moving the video card into place, so I had to unplug it. But then with the card in place, I could no longer reach down to where the USB3 jack was, so I had to remove the 3.5″ drive bay for a second time, plug it in, put the drive bay back, reattach the drives… The SATA plug on the SSD also worries me a bit; there’s nothing for the clip on the plug to latch on to, so the cable sits rather loosely. Oh well, as long as nothing disturbs it. Ideally I’d like not to have to crack this thing open again for another 3-4 years.

I wound up putting Windows 8 on it anyway. I’m still not all that big a fan of the Metro side of things, but once you’ve got it all set up you can pretty much stay in the Desktop mode 99% of the time, and the 8.1 update due out soon is supposed to fix a lot of 8’s problems. Unfortunately my current version of Acronis True Image didn’t work on Win8, despite only being one version behind the current one, so they basically held my backup data hostage until I bought an upgrade…

With the SSD, Win8’s quick boot support, and the UEFI BIOS together, it only takes about 20 seconds to go from clicking ‘Restart’ to the login screen coming back up again. I’ve gone to reboot, turned to do something on the laptop, and turned back to see the login screen and suddenly become uncertain whether it really did reboot or if I accidentally just logged out instead.

The one problem I’m having right now though is that after the system’s been running for a while, the frame rate becomes erratic in games and it starts freezing for a couple seconds. A lot of people seem to be having troubles like these with the latest NVIDIA drivers, so I’ve reverted to the ones that ASUS themselves provide, so we’ll see if that helps. As always, driver issues are the bane of everyone’s existence.

Now I just have to get around to actually playing some games…

Forgettable

Remember that new router I mentioned below? I only just realized that it’s been running continuously since I first set it up, without a single reboot:

I’ve never had any trouble with its wireless reliability or firewalling or NAT, either. It’s nice to finally have a router that just does its job quietly and reliably, a far cry from the previous Linksys that would freeze up and need to be power cycled every week or two and would occasionally drop the wireless.

It’s Time To Get Smaller

Well, Haswell and the new Nvidia GPUs are out, so it’s time to start putting together a new system. It’s still too early for prebuilt ones to be available, and I wasn’t happy with the ones I looked at before since they all involved compromising on something due to limited choices, so I guess I’m rolling my own again. So far, the parts I’m looking at are:

CPU: Intel i5-4670. Of the new batch of Haswell chips, this seems to hit the sweet spot for price and performance. There are slightly faster ones, but they start to get much more expensive very fast after this point.

Motherboard: Asus Z87M-Plus uATX. There seems to be a lot less differentiation among motherboards nowadays, perhaps due to more stuff being moved right onto the CPU die. It’s mainly about size and number of USB and PCIe slots, and I don’t really need very many of them. I’m not going SLI, so I don’t need more than 1 PCIe 16x slot, so this board satisfies everything even though it’s on the cheaper end.

Case: SilverStone PS07B uATX. I could reuse my current Antec P180, but it’s a behemoth, and I’d like to try something a bit more compact. I don’t need four external 5 1/4″ bays! It’s my first time looking at uATX cases, so I’m slightly worried about things fitting and cooling, but as far as I can tell from measurements and other peoples’ reports, it should all work.

Video: Asus GTX 770 DirectCUII. Not much differentiation among video cards either, but this new generation should last me a good while and this card is supposed to be particularly cool and quiet. I’m slightly concerned that 2GB might not be enough if games ported from next-gen consoles need more and more texture memory, but it’s easy enough to replace the card down the road if that happens.

Storage: SanDisk Extreme 480GB SSD and Seagate Barracuda 3TB. It’s about time I got an SSD, and this one’s been recommended for being reliable and reasonably performant. It won’t be big enough by itself, so there’s ye olde regular hard drive in there too.

Power Supply: SeaSonic X650 Gold. Not a particularly glamorous component, but I’m going for a modular one to cut down on the mess of cables inside, especially with a smaller case.

I’m also trying to choose quieter parts this time. I didn’t really focus on it last time, and the fans are fairly noticeable even at idle.

Bad Timing

Ugh, it’s not dead yet, but at 7 bad sectors and steadily climbing, causing random system lockups, it won’t be long now. This drive’s had a good life, lasting at least five or six years, and everything’s already backed up, so there’s no concern over losing anything.

It’s just really bad timing, since I was planning on putting together a new gaming system in a couple months or so, based on the upcoming Haswell chips and new GPU generation. If the drive dies tomorrow, there’s not much point in spending the time to immediately replace it and restore everything when I’m going to wipe and reinstall in a couple months anyway. But if I don’t, I’ll be left without my main gaming system for that couple months instead.

For now I’m just crossing my fingers and hoping it manages to limp along just long enough to assemble the new system. Come on and hurry up, Intel!

Faceless

A couple of weeks ago, I closed my Facebook account. I haven’t really missed it since.

I wasn’t really using it much anyway, and it was just becoming more of a burden than a benefit. I never posted any status updates; nothing in my life really seems worth injecting into other people’s news streams. Reading other people’s updates was just a depressing inadequacy reminder. The constant stream of ads and game updates and ‘suggestions’ were annoying and nearly impossible to fully disable. The web page behaved poorly in Chrome, often chewing up huge amounts of CPU or crashing the tab. The privacy options were murky at best and what they track increasingly invasive. Perhaps what was the last straw was the update they tried to push to their Android client, that practically takes over the device and tracks what you do on it.

“But you’ll miss out on what all your friends are doing!”

I don’t need it.

Yeah, Facebook is great at efficiently distributing the news of your life to all the people you know, but I’ve been finding it increasingly alienating. Nobody communicates to me, it’s all indirect typing past each other. Telling Facebook about how your day went is not the same thing as telling me how your day went. Comments are not a replacement for conversation. If you want to talk to me, talk to me. If you don’t, that’s fine too, social interaction shouldn’t be forced.

And yeah, it’s a bit hypocritical in that this blog is the same kind of indirect interaction, but I’d like to think that the important differences are that I try to keep it to bigger and/or more niche topics, and stuff I’d rant about to no one in particular, not day-to-day stuff; that your decision to come here and read is voluntary, not obligatory just because you labeled someone a friend; and that I have total control here and am not trying to sell you something or track your reading habits.

Late Spam

Subject: They need you weak

Yeah, well, the cyber-samurai aren’t going to catch me off-guard that easily…

Subject: Postgraduate degree in economics is your dream? Get it right now over here.

Man, I don’t remember most of my dreams, but I hope they’re more exciting than that.

Subject: Get more swell down there

You’re going to help me make more friends down in the U.S.?

Subject: Check out my awesome racks

I would, but my keycard won’t let me in your server room.

Subject: Two simple steps,Add the title "Ph.D" to your resume

Hey, that’s only one step! I may not have a Ph.D, but I did take Counting 100.

Subject: Tarzan in bed after 1 doze.

He probably should have taken the NoDoz instead.

Roto-Router

Well I finally got sick of the glitchiness of my old Linksys router. Every once in a while it just ‘goes away’ and has to be power cycled, and it’s annoying when it happens while I’m at work and still need to access something at home.

Doing research on routers is usually a big pain, but this time I went with some recommendations and got the ASUS RT-N66U. Besides just generally good reviews, it’s one of the models that works well with TomatoUSB, and I’ve had good luck with the Tomato firmware in the past, so the first thing I did was replace the standard firmware with that via these instructions. It’s a bit trickier with this model since you have to use an external utility to flash the firmware, so there’s a slightly higher chance of turning it into an expensive brick, but it went pretty smoothly.

Only time will tell as far as reliability goes, but it’s nice to finally be able to use the 5GHz band separately, as it’s much less congested around here. Being an apartment and condo-dense area, there’s a crazy number of APs around here, and the status page for the 2.4GHz radio says “Interference level: Severe”. I’ll also have to fiddle with the QoS and USB options at some point.

Don’t Call It A Pad

Well, I gave in to my techno-curiosity about this whole Android thing (I just had an iPhone already) and picked up a Nexus 7 last night. It’s pretty cheap for a tablet while still being fairly high quality, so it lets me indulge my curiosity without feeling like a big commitment.

The build quality is pretty great, the screen is awesome, and no problems at all with the touch interface. The only real concerns were that the camera is pretty low-res and grainy in my living room light, but it’s clearly really just for Skype and such and people who try and take photos with tablets are evil anyway; and when holding it in landscape mode to watch a video, the sound distractingly comes from just one side, so have a good set of headphones instead. It’s a little bit heavier than expected, and my arm got a bit tired of holding it up, but that’s really just because I’m not exactly in great shape…

I checked out some comics, and although cramped, I still found them to be reasonably legible at the full-page-fit level. That was admittedly with some DC/Marvel-style bright, large-text pages though, and I can see comics with denser art and text needing more fiddling with zooming in and out. I’d say it’s ‘adequate’, obviously not as good as a full-size tablet but I’d certainly rather view them on this than my iPhone.

Google Play seems nice enough, though I haven’t really probed the depth of their selections yet. The major omission in Canada is the music store and sync/streaming service, so for us it’s just not going to be a very good music device unless you set up your own DLNA server and use something like Subsonic to do your own streaming. The Google integration is nice in that I recently switched to Chrome on the desktop, so it’s convenient already having bookmarks and such shared.

And I’m still exploring the whole Android ecosystem, but it does please the geek inside me that I can get fairly low-level with terminals, file management, etc. even without having rooted it. The app selection may be smaller, but it hasn’t really felt like anything’s missing yet (hell, I even found a .mod music player). I still have to check out the gaming selection in more depth.

I still do way too much fiddly stuff on my laptop for this to ever be a replacement for it, but I’ll just have to keep using it and see what kind of role it naturally falls into. It’s definitely way easier to use than the laptop while lying in bed…

(Though one thing that really bugged me while setting it up is that I had to log in to my Google account four times, typing out my whole crazy-long strong password each time. The first time fails because of two-factor authentication, so it redirects you to a web page to sign in again and enter the mobile code, but then I misclicked something and it took me to another web page with no obvious navigation or gesture controls to get back to the code entry. I wound up having to hard power it off and go through the whole setup process again, requiring another two password prompts. At least it was only a one-time process… The paranoid side of me also isn’t crazy about having another way essential credentials for things like Google could be leaked, but hopefully having a strong PIN on it suffices.)

It’s Shinier Too

(Yikes, it’s been a while…)

Well, it finally did it. I’ve been a staunch Firefox user ever since it was in beta, since I liked using the same browser across multiple platforms, but it’s gotten to the point where it’s just too glitchy to tolerate. Pages not responding to the refresh button properly, unexplained choppiness in video playback, the tab bar not scrolling all the way and leaving some tabs accessible only by the tab list, crashes when I use the Flash player full-screen, the tab list missing some tabs at the bottom, high CPU usage on OS X even when lightly loaded, the Firefox window suddenly losing focus so I have to click on it again, not remembering passwords on some pages it should, and so on. Little things, but they add up.

So, I’m going to give Chrome a try for a while. There’s no guarantee that I won’t find a bunch of things about it that’ll annoy me just as much, but it’s worth a shot. Now I just have to find a set of equivalent extensions…

The Settings Are A Bit Off

The offset sizes were another area I could experiment with a bit. Originally I had three different offset lengths, a short-range one, a medium-range one, and a long-range one, on the theory that shorter offsets might occur more often than longer offsets, and could be stored in fewer bits. If a buffer size of ‘n’ bits was specified, the long-range offset would be ‘n’ bits, the medium range offset would be ‘n-1’ bits, and the short-range offset would be ‘n-2’ bits.

Some experimentation showed that having these different ranges was indeed more efficient than having just a single offset length, but it was hard to tell just what the optimal sizes were for each range. I kept it to only three different ranges because initially I didn’t want the number of identifier symbols to be too large, but after merging the identifiers into the literals, I had a bit more leeway in how many more ranges I could add.

So…why not add a range for every bit length? I set it up so that 256 would correspond to a 6-bit offset, 257 indicated a 7-bit offset, 258 is an 8-bit offset, etc., all the way up to 24-bit offsets. This also had the property that, except for the bottom range, an ‘n’-bit offset could be stored in ‘n-1’ bits, since the uppermost bit would always be ‘1’ and could be thrown away (if it was ‘0’, it wouldn’t be considered an ‘n’-bit offset, since it fits in a smaller range). Some testing against a set of data files showed that this did indeed improve the compression efficiency and produced smaller files.

With all of these possible bit values and lengths though, there was still the open question of what should be considered reasonable values for things like the default history buffer size and match length. Unfortunately, the answer is that it…depends. I used a shell script called ‘explode’ to run files through the compressor with all possible combinations of a set of buffer sizes and match lengths to see which would produce the smallest files, and the results varied a lot depending on the type and size of input file. Increasing the match length did not necessarily help, since it increased the average size of the length symbols and didn’t necessarily find enough long matches to cancel that out. Increasing the buffer size generally improves compression, but greatly increases memory usage and slows down compression. After some more experimentation with the ‘explode’ script, I settled on defaults of 17 bits for the buffer size, and a match length of 130.

Another idea I’d remembered hearing about was how the best match at the current byte might not necessarily be the most efficient match. It might be more efficient to emit the current byte as a literal instead if the next byte is the start of an even longer match. It was only an intuitive feeling though, so I implemented this and tested it and it did indeed seem to give a consistent improvement in compression efficiency. As an example, in one text document the phrase ‘edge of the dock’ was compressed like so:

Literal: 'e' (101) (4 bits)
Literal: 'd' (100) (6 bits)
Literal: 'g' (103) (8 bits)
10-bit offset: 544   Length: 3 'e o' (16 bits)
 8-bit offset: 170   Length: 6 'f the ' (17 bits)
10-bit offset: 592   Length: 3 ' do' (16 bits)
Literal: 'c' (99) (6 bits)
Literal: 'k' (107) (7 bits)

but with the new test, it generated the following instead:

Literal: 'e' (101) (4 bits)
Literal: 'd' (100) (6 bits)
Literal: 'g' (103) (8 bits)
Literal: 'e' (101) (4 bits) (forced, match len=3)
 8-bit offset: 170   Length: 8 ' of the ' (19 bits)
10-bit offset: 592   Length: 3 ' do' (16 bits)
Literal: 'c' (99) (6 bits)
Literal: 'k' (107) (7 bits)

The ‘forced’ literal normally would have been part of the first match, but by emitting it as a literal instead it was able to find a more efficient match and only two offset/length tokens were needed instead of three, for a difference of 80 bits for the original versus 70 bits for the improved match. Doing these extra tests does slow down compression a fair bit though, so I made it an optional feature, enabled on the command line.

At this point though, it’s getting harder and harder to extract gains in compression efficiency, as it starts devolving into a whole bunch of special cases. For example, increasing the buffer size sometimes makes compression worse, as in the following example:

'diff' output between two runs:
 17-bit offset: 87005   Length: 10 'with the t' (26 bits)
 14-bit offset: 10812   Length: 3 'arp' (18 bits)
-13-bit offset: 7705   Length: 3 ', w' (17 bits)
-13-bit offset: 5544   Length: 8 'ould you' (19 bits)
+18-bit offset: 131750   Length: 4 ', wo' (41 bits)
+13-bit offset: 5544   Length: 7 'uld you' (19 bits)
 16-bit offset: 50860   Length: 7 '?  You ' (22 bits)
 17-bit offset: 73350   Length: 10 'take that ' (26 bits)

The compressor looks for the longest matches, and in the ‘+’ run it found a longer match, but at a larger offset than in the ‘-‘ run. In this case, 18-bit offsets are rare enough that their symbol has been pushed low in the Huffman tree and the bitstring is very long, making it even less efficient to use a long offset, and in the end a whopping 24 bits are completely wasted. Detecting these kinds of cases requires a bunch of extra tests though, and this is just one example.

So, I think that’s about all I’m going to do for attempting to improve the compression efficiency. How does it do overall? Well, that 195kB text file that originally compressed to 87.4kB and then made it down to 84.2kB can now be compressed down, with harder searching on and optimal buffer and match length sizes determined, to 77.9kB. That’s even lower than ‘gzip -9’ at 81.1kB!

It’s not all good news, though. If I take the Canterbury Corpus and test against it, the original total size is 2810784 bytes, ‘gzip -9’ reduces them to a total of 730732 bytes (26.0%), and at the default settings, my compressor gets…785421 bytes (27.9%). If I enable the extra searching and find optimal compression parameters for each file via ‘explode’, I can get it down to 719246 bytes (25.6%), but that takes a lot of effort. Otherwise, at the default settings, some of the files are smaller than gzip and others are larger; typically I do worse on the smaller files where there hasn’t really been much of a chance for the Huffman trees to adapt yet, and the Excel spreadsheet in particular does really poorly with my compressor, for some reason I’d have to investigate further.

But I’m not going to. No, the main remaining problem was one of speed…

I Ain’t No Huffman

In terms of compression efficiency, I knew there were some obvious places that could use improvement. In particular, my Huffman trees…weren’t even really Huffman trees. The intent was for them to be Huffman-like in that the most frequently seen symbols would be closest to the top of the tree and thus have the shortest bitstrings, but the construction and balancing method was completely different. Whenever a symbol’s count increased, I compared it to the parent’s parent’s other child, and if the current symbol’s count was now greater, it swapped it with the current symbol, inserted a new branch where the updated node used to be, and pushed the other child down a level.

Unfortunately, that method led to horribly imbalanced trees, since it only considered nearby nodes when rebalancing, when changing the frequency of a symbol can actually affect the relationship of symbols on relatively distant parts of the tree as well. As an example, here’s what a 4-bit length tree wound up looking like with my original adaptive method:

Lengths tree:
    Leaf node 0: Count=2256 BitString=1
    Leaf node 1: Count=1731 BitString=001
    Leaf node 2: Count=1268 BitString=0001
    Leaf node 3: Count=853 BitString=00001
    Leaf node 4: Count=576 BitString=000001
    Leaf node 5: Count=405 BitString=0000001
    Leaf node 6: Count=313 BitString=00000001
    Leaf node 7: Count=215 BitString=000000000
    Leaf node 8: Count=108 BitString=0000000011
    Leaf node 9: Count=81 BitString=00000000101
    Leaf node 10: Count=47 BitString=000000001001
    Leaf node 11: Count=22 BitString=00000000100001
    Leaf node 12: Count=28 BitString=0000000010001
    Leaf node 13: Count=15 BitString=000000001000000
    Leaf node 14: Count=9 BitString=000000001000001
    Leaf node 15: Count=169 BitString=01
    Avg bits per symbol = 3.881052

If you take the same data and manually construct a Huffman tree the proper way, you get a much more balanced tree without the ludicrously long strings:

    Leaf node 0: Count=2256 BitString=10
    Leaf node 1: Count=1731 BitString=01
    Leaf node 2: Count=1268 BitString=111
    Leaf node 3: Count=853 BitString=001
    Leaf node 4: Count=576 BitString=1100
    Leaf node 5: Count=405 BitString=0001
    Leaf node 6: Count=313 BitString=11011
    Leaf node 7: Count=215 BitString=00001
    Leaf node 8: Count=108 BitString=000001
    Leaf node 9: Count=81 BitString=000000
    Leaf node 10: Count=47 BitString=1101000
    Leaf node 11: Count=22 BitString=110100110
    Leaf node 12: Count=28 BitString=11010010
    Leaf node 13: Count=15 BitString=1101001111
    Leaf node 14: Count=9 BitString=1101001110
    Leaf node 15: Count=169 BitString=110101
    Avg bits per symbol = 2.969368

That’s nearly a bit per symbol better, which may not sound like much but with the original method there was barely any compression happening at all, whereas a proper tree achieves just over 25% compression.

So, I simply dumped my original adaptive method and made it construct a Huffman tree in the more traditional way, pairing the highest count nodes in a sorted list. To keep it adaptive, it still does the count check against the parent’s parent’s other child, and when it crosses the threshold it simply rebuilds the entire Huffman tree from scratch based on the current symbol counts. This involves a lot more CPU work, but as we’ll see later, performance bottlenecks aren’t necessarily where you think they are…

My trees also differ from traditional ones in that they prepopulate the tree with all possible symbols with a count of zero, whereas usually you only insert nodes into a Huffman tree if they have a count greater than zero. This is slightly suboptimal, but it avoids a chicken-and-egg problem with the decoder not knowing what symbol a bitstring corresponds to if it doesn’t exist in the tree yet because it’s the first time the symbol has been seen.

Knowing that, and with the improved Huffman trees, another thing became clear: using Huffman trees for the offsets wasn’t really doing much good at all. With most files, the offset values are too evenly distributed, and many are never used at all, and all those zero-count entries would get pushed down the tree and become longer strings, so the first time an offset got used it would often have a string longer than its basic bit length, causing file growth instead of compression. I instead just ripped those trees out and emitted plain old integer values for the offsets.

The way I was constructing my trees also had another limitation: the total number of symbols had to be a power of two. With the proper construction method, an arbitrary number of symbols could be specified, and that allowed another potential optimization: merging the identifier tree and the literals tree. The identifier token in the output stream guaranteed that there would always be at least 1 wasted non-data bit per token, and often two. Merging it with the literals would increase the size of literal symbols, but the expectation is that the larger literal size would on average still be smaller than the sum of the identifier symbols and smaller literal symbols, on average, especially as more ‘special case’ symbols are added. Instead of reading an identifier symbol and deciding what to do based on that, the decoder would read a ‘literal’ symbol, and if it was in the range 0-255, it was indeed a literal byte value and interpreted that way, but if it was 256 or above, it would be treated as having a following offset/length pair.

The range of offsets to handle would also have to change, but that’s for next time… With the Huffman tree improvements, my 195kB test file that compressed to 87.4kB before now compressed to 84.2kB. Still not as good as gzip, but getting there.