Keeping your Website Clean for Search Engines (and People!)

Thank goodness it's Friday! It's been an interesting week (aren't they all?), and I'm looking forward to the weekend when I get to spend more time with the family, and a bit less time with work. But then again, that's not what the title's about, is it?

Yesterday, while looking through the Google Webmaster Tools page for radkeland.org, I noticed that the googlebots had stumbled across a whole bunch of bad links. Ok, so google gets to 'know' about these links in two ways. First, I build (with the help of a nice script they provide) a sitemap of the site every night (actually, the server does it ... I gave it a schedule, and it just does so when it's told ... nice, isn't it? Wouldn't it be swell if people acted that way? Isn't ... oh. Sorry, I think I was wandering off on a tangent there.) a sitemap, based on pages visited (from the website log). The other place is by following links already on web pages.

Now ... there are roughtly 340 links in the generated sitemap. I'm not about to go through them manually, but I may write a tool to parse the sitemap and report any bad pages to me (I have control over what goes into the sitemap, so this gives me a way to manage those bad links). Writing that tool is a couple of hour job, so it might be awhile before I do it. What about broken links on the page? Well, they are very easy typographical mistakes to make. Take a look at:

<a href="www.whitehouse.gov">Whitehouse</a>
<a href="http://www.whitehouse.gov">Whitehouse</a>

In the first form, I get a link to http://www.radkeland.org/www.whitehouse.gov. In the second form, we get a link to http://www.whitehouse.gov. The first one is a boo boo, and having pages with bad links makes search engines think you're a schmuck. Of course, you just might be a schmuck, but shhh .... you don't want the search engines knowing it, so we need to fix it. We have about a hundred pages to look at, and some of them have multiple entry points.

There are lots of link checkers out there, but I finally found a nice free (as in freedome) one, called gurlchecker (Gnome URL Checker). The first step is to create a project (Project>New Project), and configure it as a website. You probably don't want to check the links of everything you point to, since ... for example, Wikipedia might keep your computer busy for a long time. When you accept the project's setup, gurchecker will do a quick scan, based on your front page, and generate some liks. You then need to actually do the full (recursive scan) of your website, by selecting Project>Update all links. At this point, you just need to wait, while you crawl yourself. Gurlchecker generates a nice tree showing all of your links, but better, it flags bad links in red, so you just need to scroll down the list to see what's missing. This run definately takes awhile, and I'm certain there are more efficient ways of doing it. If you have any, please leave a comment. Otherwise, I'll update when I find a better tool.

Finally, here's a picture of the generated website: Gnome URL Checker in action: Here it is.  This is a scan of the Radke Land website as of 3-30-07Gnome URL Checker in action: Here it is. This is a scan of the Radke Land website as of 3-30-07

Happy friday!

Addendum: After work, the kids went to bed before me (a rare occasion!). I've been playing with GURLChecker a bit more, and I finally have got a clean bill of health (no bad links). There is a nasty bug, which seems to manifest when you've saved a session, and re-crawl your site. It's very quick and easy to create a new session, so that's what I'll run with for now. Should I ever stumble into a pile of time, I may take up helping with the project. I think (pure speculation) there's some python back there, and I need to learn that language anyhow, so ... that's as good an incentive as any.