Notes on Converting from Blosxom to WordPress

When I switched from Blosxom to WordPress, I had intended to write a HOWTO explaining exactly how to accomplish the task. Unfortunately, this proved to be a real challenge, since no two Blosxoms are exactly the same – there are hundreds of plugins that can change the core behavior in ways both subtle and profound. Instead, I decided to write up some notes on what I did (and why), in hopes that it can help others who want to try and make the switch. Note: I’m using WordPress 2.0.4, if you are on an older (especially pre-2.0) WordPress, your mileage may vary.

After getting a basic WordPress install setup on my webserver, I began by looking for a Blosxom import tool for WordPress. I found Eric Davis’ import-blosxom.php, available from Eric’s software page. (Note: Houran Bosci has a modified version of Eric’s script which I only discovered after the fact, I’ve not tested it, but you may want to have a look at the changelog). Eric’s script didn’t come with any docs, so I tried dropping it into my wp-admin/import folder. I then went to the Import section of my WordPress admin page, expecting to find a new option. Instead, the Blosxom import script’s output ( a series of instructions ) were oddly interspersed with the normal Import screen. I eventually found that the script had to be put directly in my wp-admin folder, and I had to go directly to that page with my browser.

Once you load import-blosxom.php in your browser, you get a page full of instructions. These instructions include the text for a Blosxom theme file, which creates a special RSS feed for your Blosxom site. It uses the extension .rss20 instead of .rss, so it shouldn’t conflict with your existing feed. You then fetch a copy of this feed, and the importer reads the feed to load your new WordPress blog with your old content. Sounds simple, right? Yeah, I thought so too.

First, a few notes about setting up the special RSS feed. By default, it needs the theme plugin, which I didn’t use. You can break the theme up into multiple flavour files, but I found it easier to just install the theme plugin. Drop and go, as I recall. The theme plugin requires the interpolate_fancy plugin, which I did use, and the filesystem plugin, which I didn’t, but that was also a quick install.

Now, by default, Blosxom doesn’t show all posts, only the most recent (by default, 10). This limit is applied for any theme/flavour, including feeds. The number can be configured via the $num_entries variable in the Blosxom script. I played around briefly with trying to use the config plugin to allow me to change the number of entries for only this feed- I didn’t want anyone visiting the site or fetching the feed to get all my entries accidently. Unfortunately, config could not do what I wanted (plugins are just loaded too late, I believe), so I cheated and made a copy of my blosxom.cgi, changed $num_entries, and used that one to fetch the special feed. I also had to disable my moreentries plugin, although I can’t remember exactly what it was breaking.

Now, the importer doesn’t fetch your new feed directly, you have to fetch and save a copy (curl or wget are good for this), copy the file to your server, and edit the import script to point to this file. This is for safety, to make sure you only import what you want to import. My first attempt revealed a few deficiencies in the rss feed and the importer.

First of all, I used Markdown to author my posts in Blosxom, and intended to keep doing so with WordPress. However, the RSS feed contains the rendered HTML, which is what gets imported into WordPress. This imports your posts just fine, but I wanted to maintain my Markdown formatting in case of future edits. I ended up disabling Markdown long enough to fetch the feed with the original Markdown formatting intact.

The next hurdle was URIs. In an earlier post I discussed some of the steps I went through to ensure all of my old URIs would work in WordPress, but the very first hurdle was preserving the post slug. In WordPress parlance (and I believe, in publishing in general), the slug is a short name for an article. Specifically for WordPress, the slug is the ‘file name’ of the post URI. The original import-blosxom.php created a slug for each post based on the title, similar to WordPress’s default mechanism for creating slugs. However, in order to keep my old URIs working with a little mod_rewrite magic, I needed the slugs to match the original filenames used to store the Blosxom posts. I hacked this in as follows:

  1. I modified the rss20 theme file to include the original file name, by adding a line to the <item> section:

    <slug>$fn</slug>
    
  2. I re-fetched the feed to pick up the change.

  3. I edited import-blosxom.php to use the slug. I replaced this line:

    $post_name = sanitize_title($title);
    

    with this:

    $slug = ''
    preg_match('|<slug>(.*?)</slug>|is', $post, $slug);
    $post_name = $slug[1];
    

Now, my after import, my WordPress slugs matched my Blosxom post names, making support of old URIs much simpler (see The Permalink Problem for more). But I wasn’t done yet.

Now, this is going to seem petty to some of you, and that’s fine. But it bothered my that the WordPress post numbers of my import posts were backwards. The most recent imported post was post 1, and the oldest imported post was post 300-and-something. Even though you should never see the post number since I use fancy URIs, I bugged me. So, I fixed it. It may also be worth noting that though these iterations, I ended up futzing my WordPress DB by hand to remove prior imports and reset the post numbering. Hopefully, if you’re following along at home on your own import, you’ll get this all right the first time.

The problem is, Blosxom renders posts in reverse chronological order, like every other blog. The import script reads the rss file and imports the posts in the order in which they appear in the file (which is to say, reverse chronological). But I wanted my posts numbers to be chronological. Yes, I’m a geek. Came to terms with it years ago. Anyway, remember earlier that I couldn’t get config to change the number of entries for the feed? I was, however, able to use it to install a custom Blosxom sort method, like so:

  1. Config has to run first, so i renamed the installed config plugin to 000config, the standard Blosxom hack for plugin load ordering.

  2. In my Blosxom content directory, I created config.rss20, the theme-specific config file for the rss20 theme:

    package config;
    
    sub sort {
      return sub {
        my($files_ref) = @_;
        return sort { $files_ref->{$a} <=> $files_ref->{$b} } keys %$files_ref;
      }
    };
    
    1;
    
  3. Another fetch-import cycle, and my posts were numbered chronologically.

Almost there. I had my content (including comments), but the comment and posts-per-category counts were all 0. It seems WordPress stores these numbers in the database instead of calculating them on the fly, and the importer didn’t update them. A couple of quick sql statements set everything right:

UPDATE wp_posts p SET comment_count = ( SELECT count( * )
FROM `wp_comments` c
WHERE c.comment_post_id = p.id ) 

UPDATE wp_categories c SET category_count = ( SELECT count( * )
FROM wp_post2cat p
WHERE p.category_id = c.cat_id ) 

That’s everything I have in my notes. Hopefully, there’s enought here to help others with a similar conversion. If you find other conversion issues or have questions about my process, please leave a comment below and I’ll try to help

The Permalink Problem

The conversion from Blosxom to WordPress began (in my head, at least), over a year ago, when I began to consider switching from categories to tags as a way to reduce the “friction” of writing. Blosxom is good at many things, but its filesystem based storage isn’t well suited for tagging. After exploring several possiblilites, I decided I’d move away from Blosxom; eventually I settled on WordPress.

The biggest hurdle I faced was dealing with permalinks. Blosxom supports two styles of permalinks: date-based and category based. I decided to go with category style permalinks (e.g., /weblog/Apple/macbook.html) when I first started using Blosxom because the URIs are “hackable”; you can whack the end (filename) off the URI and you’ve got the URI for the category. This worked fine when the site was category based, but fails with tags. For this reason, I’ve configured WordPress to use date based URIs (e.g., /weblog/2006/01/01/macbook/); I also decided to drop the “.html” bit since its not necessary.

However, Cool URIs Don’t Change. There are plenty of old links to my site out around the Internets, which use category based-permalinks, that I don’t want to break. Even the internal links within existing posts on this site will continue to use the old URIs, at least until I can get around to cleaning them up. I wanted to keep supporting these existing URIs, so I needed to make sure I can support the old permalinks.

WordPress supports category based permalinks, but with some caveats. If I configure the whole site to use category permalinks, then date-based URIs don’t work. If I configure the site to use date-based permalinks, category-based URIs don’t work. After searching for a plugin to help without success, I tried the WordPress forums. When that failed to turn up a solution, I poked around the code looking for a solution.

(A brief aside. If you want to explore the WordPress codebase, I recomment this excellent online cross-reference.)

I ended up creating a simple plugin that uses the “generate_rewrite_rules” action hook to add additional URI rewriting rules to the internal set used by WordPress to resolve each URI. I hope to make it available as a plugin someday when I have time to make it more generic; currently its hardcoded to solve my problem. Here’s the heart of the code, if you want to build your own version:

function add_permalink_style($rewriteobj) {
    $extra_rewrite = $rewriteobj->generate_rewrite_rules('/%category%/%postname%.html', EP_PERMALINK);
    $extra_rewrite = apply_filters('post_rewrite_rules', $extra_rewrite);

    $rewriteobj->rules = array_merge($rewriteobj->rules, $extra_rewrite);
}

add_action('generate_rewrite_rules', 'add_permalink_style');

To help me test everything, I tossed together a quick and dirty test suite, a simple list of links to test from my browser. The code above allows them all to pass.

Because my old permalinks all included .html at the end, I can use the rewrite rule above to strip it out. I originally was doing that with a mod_rewrite rule in my .htaccess file, but that caused problems for old permalinks to my category archives. Since the mod_rewrite rule was stripping the .html suffix before wordpress could see the URI, category URIs (e.g. /weblog/Apple/OSX/) and category-style post permalinks (e.g. /weblog/Apple/OSX/howto-install-carbon-emacs/) matched the same pattern, and WordPress thought the first example was for a post named “OSX” in the Apple category. I believe this is the same thing thats causing my tag URIs (e.g., /weblog/tag/wordpress/) to fail at the moment; the only workaround I’ve found requires a mod_rewrite RewriteRule, but only works with a browser redirect. I’m still working on this problem, as I really don’t want to use redirects. Also, for reasons I haven’t quite figured out, the above makes category permalinks work (and without a category URI prefix, see below), even though the rule expects a post name (slug) with .html appended- but I’m not going to argue with success.

Another issue I ran into was WordPress always wanting a prefix in front of category archive links. For example, my Apple category’s archive link (by default) was /weblog/category/Apple/. Even though my plugin’s extra rewrite rule seems to make URIs like /weblog/Apple/ work, I wanted the catgeory links on my Archive page to omit the extra prefix, for consistancy. WordPress allows you to change, but not omit, this prefix. I came up with a cheap hack for that problem; since I could already handle the URIs without the prefix, I only needed to change the URIs generated on the archive page via WordPress’ wp_list_cats() function. I was able to do this in my plugin by hooking into the list_cats filter:

function fix_category_links($content) {
    $content = str_replace('/category/', '/', $content);
    return $content;
}

add_filter('list_cats', 'fix_category_links');

It’s a hack, but it’s a hack that works.

At this time, I believe all the old permalinks from my Blosxom blog (both posts and categories) should work, as well as the new date-based permalinks. A key part of all of this was to import my existing Blosxom content while preserving the filename (slug in WordPress parlance), I’ll cover this and rest of the import process in a subsequent post. If you should find any links that don’t work, especially from external sources, please let me know. Now if I can just fix those tag URIs….