Archive for the 'WebDev' Category

Note: I've reorganized this site to use tags; the category archive remains to support old links. Only posts prior to April, 2006 are categorized. Tag Archive »

No Follow

Unless you’ve been living under a rock, you’ve probably seen the big Google Announcement, entitled “Preventing comment spam”. By adding a rel='nofollow'attribute to a link, you instruct Google (and Yahoo, and MSN Search) not to consider the link in things like PageRank calculations. Blog software producers have jumped on the bandwagon, promising to use this attribute in all links within comments, referer lists, etc.- anywhere a website visitor can create a link. The idea is that this removes the incentive for comment spam: nofollow = No PageRank.

Will it work? Ben Hammersley provides a lucid explanation of the economics of spam (it’s free!) and concludes that nofollow may increase the amount of spam. I tend to agree that spammers will pay it little mind; I think the real value (and Google’s real purpose) is that it may help improve search engine results. Remember when finding things via Google was easy, and you never saw links to commerce sites?

Robert Scoble notes this has another use - it allows a blogger to link directly to someone without giving Google juice. How often have you seen someone complain in a blog post about a spammer, without linking directly, for just this reason? Phil Ringnalda points out this can be used to selectively control the PageRank you bestow. Phil has added special style rules to his user style sheet (userContent.css in Firefox) to display all nofollow links in flashing lime green, which ensures he knows when a page he is viewing is fiddling with PageRank. I liked the idea so much I copied his style rule into my user stylesheet, so far it’s been enlightening to see who is using it, and for what links.

For now, I’m taking no action on this site - I already treat any comment containing a raw html tag as spam.

Fighting Referer Spam with deferer

I’ve been getting swamped by referer spam lately. Most of it is for domains that appear to have had hosting suspended. I had already seen a correlation between this referer spam and comment spam; what I didn’t know (but should have guessed) is that lots of folks are seing this. Tim Bray wrote about the problem today, pointing to more info from John Sinteur and Ann Elisabeth. Apparently all of these referer URIs resolve to a single webhost, with an IP of 161.58.59.8.

Ann’s post is one of many on her blog about the subject, she is actively pursuing this and trying to get Verio to pull the miscreant’s hosting. I suggest reading everything on her homepage for lots of good info.

John’s post on the WordPress support blog includes PHP code that sends a 301 Moved Permanently redirection header to any request with a referer URI that resolves to the IP above. Where does he redirect them? Back to the referer URI, of course.

Now, that’s an idea I like. I liked it so much I wrote a perl version as a Blosxom plugin. It’s called deferer, and is available here. Right now, it’s a one-trick pony, but (when time permits) I intend to expand it to a more comprehensive referer spam solution. I don’t know how effective redirecting these requests is- the spider scripts sending the requests may not follow redirects. Even so, deferer reduces server load (it ends the blosxom invocation early) and saves bandwidth.

Update: Deferer has been updated to version 0+2i, to fix a bug that caused a 500 Server Error if the referer hostname could not be resolved.

Site Linking Policy

All of the content on jclark.org is Creative Commons licenced - you may use it as you like as long as you give attribution and share your changes. This has been my policy almost since the beginning of the site.

Linking to the site is also encouraged; I make every effort to ensure my permalinks always work. However, linking directly to images is prohibited. If you’d like to use one of my images on your site, please make a copy and host it yourself. Someone has begun linking directly to one of my badges, which I consider a gross abuse of my bandwidth.

I have searched all over the offending website, and cannot find an email address for the site owner. Since I’ve never posted a linking policy, I would like to contact him and allow him to change his site before I take steps to prevent this. Therefore I’m posting this now, to allow him time to correct the problem. By this weekend I will be actively blocking this and possibly serving alternate content instead.

Subtraction

Jeremy at Alpha-Geek.com pointed today to Subtraction.com as a website design that stands out from the crowd (in a good way). He writes:

For some reason, I am really taken aback by the site. It’s so simplistic in design but so rich in all of the nuances.

Boy, do I agree. As I’ve noted before, I love clean, simple layouts (and do not hold up my own as a shining example). Subtraction is almost breath-takingly clean and stark, without feeling bare. There are so many nice things about this design.

I’ve been wanting to do a redesign for some time. As I first looked at Subtraction, I had a number of “I wish I’d thought of that” moments. With a design as distinctive as this, any thoughts of borrowing design elements are quickly suppressed by respect for the original. Take the links in the ‘Colophon’ section of the right column. They are text links with little pictures, so similar and yet so different to the ‘badges’ seen here and on so many other websites.

A couple of ideas which I saw were not completely new to me. I’ve actually been kicking around the idea of a black and white layout for a little while. Of course, if I do it now, it’ll feel like copying. The little image blocks at the beginning of many of the stories is another idea I’ve kicked around, but this implementation is much better than anything I’ve considered. The categorization scheme is also something I’ve been thinking alot about - Subtraction uses tags, like del.icio.us. I’ve already made plans to switch this site to a tag based system as soon as I figure out how I want to handle the permalinks.

Although the tag-category idea isn’t surprising, the four-column layout for displaying all of the categories is. As is the linkblog at the bottom of the page, which is more than just a link blog. Each item has a star-rating, and many of the comments are more extensive than the average linkblog. The layout is (like the rest of the design) understated and elegant.

Other details are subtle, but accumulate to great affect. The use of simple fonts, the font sizing, and the use of several shades of gray in addition to black and white form a whole greater than the sum of the parts. The use of horizontal lines also strikes me as stunning in a way I can’t quite put my finger on.

My only dislike is the use of done-to-death bright orange as the highlight color- in such a breakout design, it’s disappointing to see such a cliche. But this is a minor (and quickly forgettable) flaw, and every true masterpiece must have a single flaw.

January Blogging Challenge

Uncle Roger has again laid down the gauntlet, establishing the January challenge for posts about resolutions:

…come up with an evaluation of last year’s goals and a new set of goals for this year.

This should not be a facile collection of cliches like “I will lose weight” or “I’m going to save some money” but carefully thought out, significant goals for your life. Don’t just list indefinite, non-specific platitudes, but specific, achievable goals. Include a plan for accomplishing each goal with concrete milestones and dates.

And:

…this challenge includes a review of your goals from last year (if any).

I’m pleased to say that I’m half finished with the challenge, as I’ve already reviewed last year’s resolutions. I’ll be posting this year’s by the end of the week(end).

These challenges have been alot of fun… I encourage you to join in this one. Be sure to link to Roger’s post, and comment over there (and here if you want). If you’ve already posted your resolutions, go link them to the challenge.

Googlebot, Update These

This post is for Googlebot. Go follow these links, and index them:

Thanks pal.

For the rest of you, who are probably wondering if I’ve hit my head (not that I remember), I’ll explain. While checking my referers briefly (whole other post on that topic forthcoming), I noticed what looked like a search engine hit for a topic I don’t normally post about, of an adult, or more likely a teenage male, nature. (I’m not a prude, but listing the search terms would defeat my purpose). A quick check showed that the site was a non-English search engine powered by Google.

It seems that Googlebot stopped by the day I was hit by over 2000 comment spams. Although I took the entire comment system offline to remove that crap as soon as I saw it, Googlebot must have indexed a few of the pages. Those pages linked above are the pages it indexed that day and has apparently not reindexed since. I like website traffic as much as the next blogger, but I’m really not in the market for search engine hits for animated attacks on women. I certainly don’t want to be the number #6 Google hit for that search (which I am, for the moment). The sooner Googlebot indexes the clean versions of the pages above, the better.

Also of interest, Googlebot found those pages via www.jclark.org, instead of jclark.org. They are the same, but I omit the www as it is unnecessary. Other people who link to me occaisionally use the www, which is why it is indexed both ways. Of course, it shouldn’t be indexed twice, so I’ll be adding a mod_rewrite rule to my .htaccess file to permanently redirect all www. URIs to their sub-domain-free equivalents. Just to be safe, however, I think I’ll wait until Googlebot reindexes the above links.

Blapp

Inspired as I was by Michael McCraken’s I-Search plugin for Cocoa apps, I decided it was time to try another of Michael’s projects, Blapp. Blapp is a Cocoa weblog editor for Blosxom (and Blosxom-like) blogs. I’ve always edited my posts with wikieditish, a browser-based Blosxom editing solution. This has the advantage of allowing me to post from anywhere I have web access.

On the other hand, I write most of my posts on my Mac, and I can always use wikieditish when I’m not at my Mac. So far, I’m pretty impressed. Blapp allows me to preview my post as I type, even using my story template and css files from the site so I can see what the article will really look like. It also supports the use of an external app as a filter, which allows me to see my preview with full Markdown rendering.

Setup was a bit of a bear - I had to edit my story template a bit (Blapp doesn’t seem to understand interpolate_fancy-style variables, and I had to merge my css files (normally, one @import’s the other. Figuring out the rsync setup was a bit of pain as well, but liberal command-line testing of the rsync command with the -n (make no changes) switch helped.

More good news, Blapp is open-source, so I can take a shot at fixing some of my complaints myself (when I have time, and when I finish reading Cocoa Programming for Mac OS X). If you blog with Blosxom and use OS X, I recommend giving it a try.

And yes, this is my first Blapp-edited post.

Taking out the Trash

Regular readers are aware that I disabled comments on this site about a week ago, after receiving over 2000 spam comments in just a few hours. Alert readers may have also noticed that all comments have been missing since; all postings have showed 0 comments regardless of prior comments. This is because when I disabled comments, I not only wanted to prevent further postings but also to prevent the display of the 2000+ morsels of nastiness. The quickest way to do both given my Blosxom setup was to simply remove the writeback plugin.

I’ve made adjustments to my .htaccess file what will prevent the same type of posting in the future, but I expect the spammers to adapt. The delay in re-enabling comments was the need to clean out the spam. Because I use Blosxom and writeback, my comments are store in the filesystem, in a dir tree that matches the site layout, one file per blog post (i.e., all comments for a given post are in one file). With over 2000 bad comments spread across 224 different files, I wasn’t about to clean things up by hand. Instead, I wrote a perl script to help me do it.

Even though it’s used alot in Blosxom and various plugins, I’ve never had a firm grasp of perl’s File::Find module, so instead I decided to use File::Find::Rule instead. (For a nice explanation of this module, see File::Find::Rule in the 2002 Perl Advent Calendar.) The only problem is that the module (and several prerequisites) were not installed on my webserver. Having shell access, I was able to install local copies of the needed modules. Being very busy over the holidays, I only got around to this today.

The script is called scrub, and is available under the GNU General Public Licence. You can download scrub. Please note that scrub is written as a command line utility - you will need shell access on your webserver to use scrub. If there is demand (and if no one else does it first), I may develop a CGI version of scrub to run from a web server.

So how does scrub work? Why, Voodoo magic, of course. In fact, the darkest Voodoo of all… regular expressions. Supply a regex, and scrub can list all comment files containing the regex. It can also display the matching comments, and a count of files matched. Most importantly, it can remove offending comments when run with the -scrub option.

scrub is designed to work with comment files created by the writeback plugin. These files contain each comment, along with the name and url of the poster as supplied on the posting form. I’ve modified my copy of writeback to also log the IP address. Any information in the comment file can be matched by the regex… so if you are logging IPs as I am, you can quickly find (and eliminate) all comments from a given IP. scrub overrides perl’s $/ magic variable, which is the input separator. By setting $/ to "-----\n" (the comment separator in writeback files), scrub can process each comment as a single unit.

Here are a few examples of scrub usage:

  1. List all files containing ’spam.com’, display total:

    scrub -regex 'spam.com' -list -count
    
  2. Remove all comments containing ’spam.com’, show progress via filenames:

    scrub -regex 'spam.com' -list -scrub
    
  3. Show all files containing raw html hyperlinks, and the actual comments:

    scrub -regex '<a href' -list -show
    

Example 3 above brings to light a deficiency with scrub - the regex’s are always case sensitive. If I post a revision, this will be addressed.

If you find this at all useful, please leave me a comment… they are enabled once again.

An Announcement

To anyone who has viewed my site in the past 24 hours or so- If you have seen any comments on this site which you found offense, please accept my appologies. I have once again been hit by a determined comment spammer- an order of magnitude worse than anything I have seen before. Over 2000 spams have been posted. I have completely disabled the comment system.

Take a Hint, AllResearch - Go Away

It was a couple of months ago when I had to ban a bot for the first time. At the time, I noticed one IP address (My Most Frequent Visitor) outshining all others in my web stats. A little research showed MMFV to be an outfit called AllResearch.com, who appears to specialize in hoovering down other sites in order to provide such services as trademark tracking, webclipping, and “law enforcement” services.

As bad as that sounds, the only reason I took even took note was the bandwidth comsumption. Every 60 minutes, they hit my RSS feed, and then pulled down every item listed in the feed. What a horrible perversion of intent. At the time, I banned the IP address, thusly (in .htaccess):

RewriteCond %{REMOTE_ADDR} “^38.144.36.16$” RewriteRule .* - [F,L]

I then watched, ammused, for several days as the 403 errors stacked up, once an hour, from their IP address. Worst. Bot. Ever.

Well, I got around to checking my stats again today, and what do you know? My Most Frequent Visitor just can’t take a hint. He’s back, using IP address 38.144.36.19. And I didn’t find him due to vigilance; he’s just stupid. The exact same usage pattern. While expanding my ban, I even corrected the unescaped dots in my original version:

RewriteCond %{REMOTE_ADDR} “^38.144.36.” RewriteRule .* - [F,L]

I’m not the only one seeing this. If you run a site, take a minute and check your stats or your logs for addresses beginning with 38.144.36.. If you see abuse like I’ve seen, take a minute and ban them. Maybe they will eventually take a hint.