Taking out the Trash

Regular readers are aware that I disabled comments on this site about a week ago, after receiving over 2000 spam comments in just a few hours. Alert readers may have also noticed that all comments have been missing since; all postings have showed 0 comments regardless of prior comments. This is because when I disabled comments, I not only wanted to prevent further postings but also to prevent the display of the 2000+ morsels of nastiness. The quickest way to do both given my Blosxom setup was to simply remove the writeback plugin.

I’ve made adjustments to my .htaccess file what will prevent the same type of posting in the future, but I expect the spammers to adapt. The delay in re-enabling comments was the need to clean out the spam. Because I use Blosxom and writeback, my comments are store in the filesystem, in a dir tree that matches the site layout, one file per blog post (i.e., all comments for a given post are in one file). With over 2000 bad comments spread across 224 different files, I wasn’t about to clean things up by hand. Instead, I wrote a perl script to help me do it.

Even though it’s used alot in Blosxom and various plugins, I’ve never had a firm grasp of perl’s File::Find module, so instead I decided to use File::Find::Rule instead. (For a nice explanation of this module, see File::Find::Rule in the 2002 Perl Advent Calendar.) The only problem is that the module (and several prerequisites) were not installed on my webserver. Having shell access, I was able to install local copies of the needed modules. Being very busy over the holidays, I only got around to this today.

The script is called scrub, and is available under the GNU General Public Licence. You can download scrub. Please note that scrub is written as a command line utility – you will need shell access on your webserver to use scrub. If there is demand (and if no one else does it first), I may develop a CGI version of scrub to run from a web server.

So how does scrub work? Why, Voodoo magic, of course. In fact, the darkest Voodoo of all… regular expressions. Supply a regex, and scrub can list all comment files containing the regex. It can also display the matching comments, and a count of files matched. Most importantly, it can remove offending comments when run with the -scrub option.

scrub is designed to work with comment files created by the writeback plugin. These files contain each comment, along with the name and url of the poster as supplied on the posting form. I’ve modified my copy of writeback to also log the IP address. Any information in the comment file can be matched by the regex… so if you are logging IPs as I am, you can quickly find (and eliminate) all comments from a given IP. scrub overrides perl’s $/ magic variable, which is the input separator. By setting $/ to "-----\n" (the comment separator in writeback files), scrub can process each comment as a single unit.

Here are a few examples of scrub usage:

  1. List all files containing ‘spam.com’, display total:

    scrub -regex 'spam.com' -list -count
  2. Remove all comments containing ‘spam.com’, show progress via filenames:

    scrub -regex 'spam.com' -list -scrub
  3. Show all files containing raw html hyperlinks, and the actual comments:

    scrub -regex '<a href' -list -show

Example 3 above brings to light a deficiency with scrub – the regex’s are always case sensitive. If I post a revision, this will be addressed.

If you find this at all useful, please leave me a comment… they are enabled once again.

Both comments and pings are currently closed.

3 Responses to “Taking out the Trash”

  1. chornbe Says:

    Slick! I’m glad you were able to get it cleaned up. What a pain in the $@&@#(*&^@$(. Sorry it had to happen :(

  2. Chet Says:

    You rock.<br/>

    Thanks VERY MUCH for this.

  3. roland Says:

    oh please, please please, call it something else<br/>

    scrub is a DOE utility used for overwriting files, and couldn’t you just use grep to do the same thing?