UTF-8 Madness

One of my stated goals with this weblog is to better learn web standards, such as XHTML and CSS. So each time I make a change to the design, or post a new story, I use the validation badges (see the right hand column) to check that everything is still copascetic. Yesterday, it was not.

The error message from the XHTML validator, in all its yellow-highlighted glory, was:

Sorry, I am unable to validate this document because on lines 52, 56-58, 62-66, 68-70, 73, 76-81, 83-84, 86 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

Now, this is clear (invalid characters), and yet not so helpful (what characters? Where in the line?) I didn’t see anything in the view source, or opening the original file in an wikieditish. And of course, I’m using textile, so that could be a culprit as well.

After some experimentation (by which I mean to say, several hours of searching Google, threatening the Validator, and banging my head against the wall), I found that by saving the XHTML from my browser, and running it through less, I could see the bad characters. They were all 0xA0, which is 160 decimal, a.k.a. . Of course, is valid in XHTML, as is . But utf-8 does not encode single byte values above 0x7F as single byte values, so 0xA0 is not the utf-8 encoding for nbsp.

The thing is, I’m not sure how they got there. The seemed to mostly appear inside blocks, in place of the original indent spacing. So I suspect it was a combination of textile2 trying to format the code, and wikieditish saving the document. I'm going to have to do some experimenting to make sure I understand how they interact.

I tried to fix the problem my adding some filtering to wikieditish, in the form of:

$_ = $body;
s/\xA0/&#160;/;  #non-breaking space
#snip!  more translations in here,
#  like emdash, soft hyphen, etc.
$body = $_;

I then re-edited the file, but this didn’t seem to fix it either. I eventually downloaded the story to my local system, cleaned it up with a perl one-liner (perl -pi -e 's/\0xA0//;' filename.txt), chosing to just eliminate the nbsp’s entirely for now until I better understand them.

And the best part of all? My host is running perl 5.8.0, with PerlIO enabled; which if I read perluniintro correctly is supposed to automagically take care of all that utf-8 madness for me, so I don’t have to.

Tags:

Posted August 17th, 2003 at 7:50 pm No Comments »

Both comments and pings are currently closed.

Comments are closed.

jclark.org/weblog

UTF-8 Madness