Clobbered Apache, Virtual Hosts, and More Sitemap Cleaning with Drupal 6.x

Wow ... in the spirit of blog entries I feel the need to ... express a bit before getting to the meat. All of "this" started when I decided I should have a nifty GUI HTML editor to make it easier on users. Then ... oh! Yeah ... Drupal 6 (next blog, right after this ... if only rough notes ... I promise) and ... Image Module issues ... and then taking another look at my sitemap from previous efforts ... well ... it all devolved into chaos.

I recently noted in this post that log format changes had left my site updates in a sort of ... stasis. As I investigated this issue, I realized that a year ago, I had chosen to ignore the fact that I was hosting two other virtual domains (http://www.basswackertackle.com and http://www.rapidbassfishing.com) and filtering out 'popular' hits from my own sitemap as a bad ugly will never work in the long term hack. (As a side note, this is an encouragement to all those who think commenting in code and configuration files is useless ... this is what I found in the sitemap configuration files for those two domains:)

<!-- JR 5-30-07  I give up, doing directory crawl
    "accesslog" nodes tell the script to scan webserver log files to
    extract URLs on your site.  Both Common Logfile Format (Apache's default
    logfile) and Extended Logfile Format (IIS's default logfile) can be read. -->

That was my hint to actually separate out the virtual hosts logs. Unfortunately, the default Apache logging doesn't record which virtual host was targeted by a request (in the access_log's), so we had to change a couple of things. As I struggled with this problem, here are the (raw) notes I came up with:

Ok, I wrestled with this last year, and finally gave up. I guess we get more stubborn every year. At any rate, I want to leave the existing mechanisms for log generation and rotation in place, and also create virtual host specific logfiles for each virtual host. This will be a monolithic Perl beastie, which will read the access logs, and report a single access log per host. We also want to minimize any change to the log file format, since third party programs will be looking for 'default' settings. So ... here's the start:

Edit: /etc/httpd/conf/httpd.conf
Fine the line:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
and change it to:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" vhost=%v" combined

So there you have it. I wrote the 'monolithic Perl beastie', which needs to be run as a cron job before the sitemap generation cron jobs. Here are its contents:

#!/usr/bin/perl
#
# splitapachelogstovsites.pl
#
# Copyright 2008 Joshua Radke (josh at radkeland dot org)
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License (version 2) as
# published by the Free Software Foundation.
#
# Based on previous notes, the general format is as follows:
# Cat all of the log files together and put them in a single file at /tmp
# Remove all site-specific reprocessed log files
# Read through the temp file, and for each line with a
# '^(.*) vhost=([a-zA-Z0-9-\.]+)(.*)$' line
# (see http://www.dns.net/dnsrd/trick.html#legal-hostnames),
# make the new site specific logfile if it doesn't exist, then plunk the
# line into it, minus the non-vhost match ...  Note that we'll keep the
# open filehandles around (and available in a hash of references to
# filehandles) until we're done parsing.
# Close filehandles
# Delete tmp logfile
# Done.
#
# Ok, IO::All presents way too many ... excellent paradigms.  I will
# continue to use it.
#
# We make the assumption that logfiles are numbered 1, 2, 3, 4 .... etc.

use strict;
use IO::All;
use Data::Dumper;

my $logpath = "/var/log/httpd/";
# my $logpath = "/home/josh/tmp/logs/";
my $currentlogfile = "access_log";
my $logfilebasename = "access_log.";
# my $ologpath = "/home/josh/tmp/logs/vhosts/";
my $ologpath = "/var/log/httpd/vhosts/";
my $pid = $$;
my $tmppath = "/tmp/";
my $i;
my $error = 0;
my $contents;

# I was pondering whether to use the filesystem paradigm or perl paradigm
# for the file concatenation, but I think it's best to just do it in
# program (for portability).  We'll leave as much as we can to forward
# global variables.

# Okies, read and cat ... no formatting.
io("$logpath$currentlogfile") > $contents or die "Cannot read initial logfile";
$contents > io("${tmppath}acl-$pid"); # For Apache Cumulative Log

# Now ... we do the rest.
for ($i = 1; not $error; $i++) {
unless (-e "$logpath$logfilebasename$i") {
  $error = 1;
  last;
}
io("$logpath$logfilebasename$i") > $contents or
  die "Cannot read $logpath$logfilebasename$i";
$contents >> io("${tmppath}acl-$pid") or
  die "Unable to append to tmp cumulative file"
}

# Done ... now we have our fun.  Parse through the entire tmp log file,
# and sort the lines into their appropriate vhost specific files. Note
# That we'll also do our own IO, so we don't open/close files for every
# matched line.


open ('ALLLOGS', "<", "${tmppath}acl-$pid") or
  die "Unable to open ${tmppath}acl-$pid for reading";

my %openvlogs;

while (<ALLLOGS>) {
  chomp;
  next unless $_ =~ /^(.*) vhost=([a-zA-Z0-9-\.]+)(.*)$/;
  my ($pref, $vhost, $suff) = ($1, $2, $3);
 
  # Make sure we have an existing filehandle.  Create if needed.
  unless (exists($openvlogs{$vhost})) {
    # We need to open this file, and make the value a pointer to the open
    # file.  Delete first if needed.
    if ( -e "$ologpath$vhost") {
      unlink "$ologpath$vhost" or
       die "Unable to delete old log files at $ologpath$vhost";
    }
    open $openvlogs{$vhost}, ">", "$ologpath$vhost" or
      die "Unable to open $ologpath$vhost for writing";
  }
 
  # For each virtual host, the above section will only be evaluated once.
  # After that, we simply need to file the line where it goes, and be done.
  print {$openvlogs{$vhost}} "$pref$suff\n";
}

unlink "${tmppath}acl-$pid";
close 'ALLLOGS';
foreach (keys(%openvlogs)) {
  close *{$openvlogs{$_}};
}

After running this, adjusting my sitemap config file ( see this (If you don't have an account, get one) ), and filtering out all of the cross site scripting attacks, I feel like my sitemap is once again in a state of good maintenance.

For the sake of the curious, and maybe robots (of the future?), I should mention that this is on a Fedora Core 8 machine, with all efforts being taken to adhere to the prescribed upgrade path.

Happy Hacking!

Josh