100% Guaranteed Genuine Drivel

After putting it off way too long, I finally have a TLS certificate set up for all the web stuff I host here, courtesy of Let’s Encrypt and Certbot.

It’s not like I’m particularly worried about attackers hijacking the site (because it has so much influence…), but the winds are blowing towards HTTPS-everywhere, and no point in getting left behind. Certbot makes it pretty painless, at least, though I’m sure there are some older broken images or links that I’ll have to clean up over time. Any non-HTTPS links on or to the site should automatically redirect to the appropriate HTTPS URL.

Back…Sort Of

My old Linux box keeled over and died on me on the weekend, I’ve narrowed the problem down to either a bad motherboard or CPU, and it’s far too old to bother replacing individual parts on it, so I’ve gone ahead and ordered a replacement system. I’m kind of tired of assembling my own systems, for once.

It’s hard to find something that perfectly suits my needs: I wanted something that was small, quiet, capable of recording TV, could have lots of storage, had DVI or HDMI video output, and a powerful enough CPU to do some video work (potentially playing HD in the future). There are a lot of options that fulfill *some* of those requirements, but not *all* of them. The Dell Studio Hybrid lacked eSATA or FireWire 800, so storage would have been slow. The HP Slimline only had 100Mbit Ethernet. The Mac Mini is getting old and a potential refresh is too far away. And so on…

I eventually settled on the Dell Vostro 220s. The base spec isn’t perfect, and it’s a bit bigger than the Studio Hybrid, but the inadequacies can be overcome via the PCI/PCIe slots. I upgraded the video card and ordered a low-profile TV tuner card, which takes care of the DVI/HDMI and TV recording. There’s two internal bays, so storage won’t be a problem, and I can add an eSATA card later on if I want fast external storage.

It could be a few weeks before the new system arrives and is set up though, so for now I ripped the main boot/home drive out of the old server and have it running it on my gaming box. It actually booted straight off of a USB enclosure and ran with no major problems (just some device names changing), other than being a little on the slow side…

Ha Ha, You Can’t Read This (Yet)

You never really realize how much you’ve come to depend on the Internet until you lose it for a while. It’s been really dodgy all weekend for me now, even as I write this (yay for hosting your own server).

So, maybe I should play a game. My current Diablo 2 character is level 26 on the USWest realm and OH WAIT. Well, I could play single-player. I don’t really want to build a character from scratch again though, so maybe I should go download the Atma editor and OH WAIT. Well, maybe I could play a class I haven’t really used much before, and see how it plays from the start. I’ll just check for a decent skill build on the Arreat Summit site and OH WAIT.

I can still watch TV though MythTV, at least. Some of the channel listing entries look a bit out of sync though. I should check that the lineup is right at schedulesdirect.org and OH WAIT. Oh well, I wonder what the other F1 thread guys thought about the race OH WAIT.

There is still some network activity getting through, but not enough to reliably connect to web sites and such. I’m curious as to what kind of traffic there is, so I could sniff it by getting the ‘tcpdump’ package, and Ubuntu will even download and install OH WAIT.

I even popped over to the office for a little while just to check some of my regular sites. Good thing it’s only a quick trip away. 99% of the time these problems resolve themselves, since they tend to affect large blocks of customers, but I’m not so sure this time.

Edit: Well, it seems to be partly working again, or at least some sites load (albeit slowly), though I still can’t get to some others. Only took 36 hours…

Edit Edit: And it’s dodgy again as of the morning. Woo.

Bad!

aipbot, you are a *bad robot*. Not only do you fetch the same URL multiple times unnecessarily, but you ignore robots.txt, and continue to repeatedly fetch URLs even after being 403ed.

Not that they’re draining all of my bandwidth or anything like that, but it’s annoying to see this kind of behaviour filling my logs on a very low-traffic site like this.

Bad robot! Go sit in the corner…

Thanks, eBay

No, seriously, I’m not being sarcastic for once. I was browsing through the server logs and noticed a sudden increase in hits on the picture of my iBook, with referers pointing back to an eBay auction. Great, another image leecher.

However, a few hours later in the log, the hits suddenly stopped. The referer of the final one looked slightly different, and looking up its IP address revealed that it was from within eBay itself. I tried to visit the eBay URL in the referers, but was informed that the item had been removed from auction.

A few hours later I started getting more hits from a different auction. Since I caught it sooner this time, I was able to visit eBay and catch the auction in action, and it was indeed largely composed of images taken from other sites (one from mine and a lot from a laptop review site). My anti-leeching protection was making my image show up as a ‘broken image’ icon at least, but it was lunchtime and I was feeling mischevous, so I started working on a script and rules to redirect leeching attempts to a random image instead. Nothing offensive, but pictures that would be confusing out of context.

Except before I could apply it, eBay apparently yanked this auction as well. Apparently they take image leeching quite seriously already…

Sneaky

Spammers are always waging an ever-escalating war, and it’s interesting to note what tricks they get up to. Lately I’ve been seeing a lot of hits with referrer strings back to various top-level domains, obviously hoping to show up in a site statistics page. And now, I’ve seen referrer strings that are actually actions to secretly add someone’s RSS feed to your own portal page (mainly with Yahoo so far).

I’d be impressed, if they weren’t such sleazeballs.

Not Paranoid Enough

Dammit. Despite thinking of myself as someone careful about these things, my web server was hacked earlier this morning. It’s my own fault though, as I’ve been getting a bit sloppy. I tested out AWStats a while back, left it installed, forgot about it, didn’t keep it updated, and of course the hack was then done through an AWStats flaw…

What I should have done was either 1) not have kept it installed, 2) placed a password check on it, 3) joined the AWStats announcements list, where I would have gotten a notice about the flaw earlier, or 4) used a distro where it would have been part of the standard packages and automatically updated.

Oh well. Fortunately, since I watch logs like a hawk, I noticed it and shut it down within 15 minutes of the initial break. Since the web server runs as ‘nobody’ it couldn’t actually damage anything; it just kicked off a script to port scan other systems. It’s still depressing to realize that you’ve helped make the problem worse though, even by only a little, and if I can’t find the time to admin this properly, maybe it’s not worth the hassle.

Non-Stop Upgrades

If the site is acting slightly quirky, I’ve recently upgraded to WordPress 1.5 and some kinks probably still need to be worked out. It should be worth it though, since it adds a fair bit:

1) Bulk moderation of comments, including a separate spam category. It was a pain going through and marking them one-by-one, when they accumulate hundreds at a time…

2) There’s now a limit to the number of articles per page, and entries beyond that are kept on separate pages. This is important as time goes on, as some of the categories were getting pretty big and selecting them displayed every single entry in them all at once.

3) Pages other than the main page can be managed from within WP now, too. That’ll help get a consistent look for things like the links page, which still has the old MT style.

4) A better theme system, that’s split up so it’s easier to edit and should be more future-proof.

And various other minor improvements in the admin panels and comment authentication…

Lockdown

The work/coding-related entries are now locked and discontinued. Sorry to all you searchers who were looking for some of the technical stuff in there.

If I have time, I may go back and edit out the work context from some of the more useful entries.

No Rank For You

Looks like Google’s starting to look at how blogs and spam interfere with search engines and indexing.

On the plus side, this will help cut down on spammers who generate a high rank by spamming blogs. A recent search on a particular quote from Plato turned up pages and pages and pages of personal blogs where that quote had been used as part of some spammer’s random text corpus.

On the other hand, it’ll reduce the rank of normal commenters and their links back to their own homepages and blogs, too. Whether that’s a good thing or a bad thing is probably debatable; some people argue that blogs are overrepresented in search engine results already, but someone who regularly comments on-topic in various places might deserve a bit more exposure…

Math Is Fun II

Let’s see… Apparently there are 373 or so blog entries on this site, each one with a unique URL.

There’s also an RSS comments feed link for each entry, and although it’s not explicitly linked, some crawlers are following the trackback URL as well, for another 746 URLs.

Plus, entries can be referenced by date. I pretty much write only once per day, so that’s another 373 by-day links, plus 16 by-month links.

And, of course, various other miscellaneous links. 16 categories, two main page feeds, and the main page itself, for another 19.

That makes a total of at least 1527 different URLs just within the site itself.

It’s no wonder then that web crawlers make up the vast majority of server hits…

Numbers Are Fun

Out of curiosity, I tried to break down all of the hits this site has ever received (except a couple months where I lost the client and referrer data) and see what kinds of categories they fell in to. Out of 132,046 hits:

People I Know Personally: 10.5%
Myself: 10.6%
Web Spiders: 28.9%
Directed Here By Search Engines: 30.2%
RSS Aggregators: 4.5%
Bandwidth Thieves: 0.3%
IIS Backdoor Attempts: 5.7%
Proxy/Mail Relay Exploit Attempts: 0.2%

Yahoo Goes Nuts

Looking through the logs, I’ve been seeing some strange queries from Yahoo’s crawler recently:

[19/Jul/2004:21:34:09 -0600] "GET /MadonnaCiconne/parcel-problems/mboic.htm HTTP/1.0"
[19/Jul/2004:21:43:14 -0600] "GET /ambush/000122.htm HTTP/1.0"
[20/Jul/2004:05:21:00 -0600] "GET /sis/000186/favorpopscandy.htm HTTP/1.0"
[20/Jul/2004:07:20:38 -0600] "GET /lokalen_pa_nett.htm HTTP/1.0"

It’s like bits and pieces of legitimate paths on my site are getting mixed in with random keywords. Either their crawler has gone a bit bonkers, or some other site out there is making up random links and it’s trying to follow those…

Too Close, So Far

I know a lot of people on the Net. Or, rather, I know *of* a lot of people. Would I call these people friends? Not many of them; I don’t really know them at a sufficiently personal level to think of them as friends. Acquaintances, then? Some of them, certainly. Many though, are people I know of by having been led to them through some other means (search engine, posting, referral, etc.) and they in turn aren’t necessarily even aware of my existence. A ‘fan’ then, perhaps.

On the Net though, everyone is equal. When someone’s name comes up via a comment or a link or such, it’s not immediately clear just what the relationship is; there’s often no distinction between a lifelong pal, a beer buddy, an acquaintance, or plain old hangers-on. As a result, someone’s circle of friends can appear to be larger than it really is.

This can lead to some odd behaviour, at least so far as I’ve seen. If a particular topic of interest comes up, someone may be inclined to comment on it. But, given the circles of friends that are already established, that person may also be afraid of overrepresenting their relationship with this group, and feel uncomfortable posting. Why would they care what some random bozo barges in and says, after all? Who are you to just show up and start spouting off? But, on the other hand, how else do connections get established in the first place? These circles had to start somewhere and develop somehow. Plus, those circles may not actually be as strong as they might seem to an outsider, due to the effect above.

I would imagine that there’s at least some portion of the Internet population who *want* to reach out to other people, but are afraid to, but for reasons that are often illusiory, but difficult to clarify. The question is, how do you break the cycle…

Napster Still Bad, But WordPress Good

After using it a bit further, it looks like WordPress will work out well. In particular I like:

1) Changes are reflected immediately. With MT I had to load the admin URL, select the template section, select the Main Index template, edit in the change, save it, hit the Rebuild button, and wait a while, just to update the numbers on the front page. With WordPress I can just edit the main index file in a plain old text editor.

2) Experimental changes are easier to do. Modifying the style sheet or templates in MT often caused temporary disruptions while I worked out problems, and required a lot of rebuilding. Now I just make a copy of the main index page or style sheet, (e.g., ‘index-test.php’), experiment with changes on it, and then copy it over the regular page when it’s complete, with no disruptions.

3) Posts can be given times and dates in the future, and will remain hidden until that time arrives and then automatically become visible. Writing posts a couple days in advance like this lets me sit on them a while and think of better ways to word things, more points to add, etc., without having to remember to go back and ‘activate’ them later on like a regular draft post.

It also lets posts be released on a more regular schedule, without me having to write them on that same schedule. If I’m feeling productive some night I could write out a half-dozen posts, and then automatically release one per day over the next week instead of overwhelming the site all at once, or trying to remember day-by-day what I was going to write about.

4) The handling of multiple and subdivided categories in WordPress is much better. Subcategories appear as expected in the category lists, and when multiple categories are specified they’re treated equally instead of being separated into ‘primary’ and tacked-on ‘secondary’ categories. (Though for some reason selecting a subcategory does not automatically select its parent category.)

5) Being PHP-based should allow for more flexibility. MT limited you to what you could do through their specific substitution tags, but if I wanted to I could use PHP and MySQL functions to extract whatever raw data I want from the WordPress database.

6) Posts can be split into multiple pages if they’re *really* long.

7) There’s a link management system built-in (i.e., the links in the upper-right corner). It’s not really necessary since you can always add links to the template yourself, but it works reasonably well enough.

There are a few downsides too though:

1) Since everything is dynamically generated, it doesn’t get cached. Just browsing around the site generates a lot more page hits than you’d normally expect, and frequent hits from spiders and aggregators keep getting full sets of new data instead of the ‘304 you’ve-already-got-that’ response. At least it’s all text.

2) The post preview is on the same page as the edit boxes for the post itself. This is kind of a mixed blessing as it doesn’t require you to go to a separate page just to preview, but it can also have unexpected side effects. I have one draft post that I can’t edit anymore because I accidentally put an HTML refresh directive in it and as soon as I go to edit the post it redirects to the new page.

3) The user registration system isn’t quite as complete as I’d have liked. Ideally registered users should be able to be marked for automatic access to protected posts, automatic clearance on comments while holding unregistered ones for approval, and such, but right now it doesn’t seem to do much besides let those users write their own posts. The groundwork is there though, so maybe in a future version…

4) Comments and trackbacks are mixed together, though they serve different purposes. There is a hack out there to fix that and put them in their own separate sections on the page though, and it’s not like I get a lot of trackbacks. :-)

Mapping The Past

Though I’ve converted the site over to WordPress, the old Movable Type archives are still there and accessible. This presents a bit of a problem, because how do I get rid of them? Search engines are going to continue referring to them for a while, so it would make sense to leave them in place so the searcher doesn’t get a 404, but leaving them there will just keep them in the search engines even longer. Redirecting all requests that point at the old archives to the new main page would be unfriendly, since it wouldn’t be what they were looking for.

What would be preferable is to have requests to the old pages automatically redirect to the same article under the new site. The major problem there though, is that the article numbers under the two systems are not the same; what was article #77 under MT is now #74 under WordPress, for example. It’s not a simple single offset for all the articles either, since MT exported them in the order they were posted, not by numerical order.

So, clearly what’s needed is some kind of script that will map the old numbers to the new numbers. Figuring that PHP might be worth a look, since WordPress itself is implemented in it, I brought up the PHP Manual, opened some of the WordPress code in an editor as a sample, and hacked out the following: redirect-mt.php

Instead of mapping each and every article number, it takes advantage of the fact that large ranges of numbers have the same offsets, so all it has to keep track of is the boundaries of those ranges. It took longer than I thought it would to extract those ranges from the MT articles, but the generation of that list could be automated, too. I was just lazy.

This script will remap the article numbers, but I still have to capture the requests and feed them to this script. Fortunately that’s a fairly simple addition to the .htaccess:

# Redirect Movable Type archive requests to the WordPress articles,
# through a remapper script
RewriteRule ^archives/([0-9]{6}).html$ /heide/redirect-mt.php?index=$1 [QSA]

It seems to work well enough, though I’m sure there’s probably some better way. I don’t think I’ll put PHP down on the resume quite yet… :-)