{"id":12,"date":"2003-08-17T19:50:00","date_gmt":"2003-08-17T19:50:00","guid":{"rendered":"http:\/\/jclark.org\/weblog\/WebDev\/Blosxom\/utf8madness.html"},"modified":"-0001-11-30T00:00:00","modified_gmt":"-0001-11-30T04:00:00","slug":"utf8madness","status":"publish","type":"post","link":"https:\/\/jclark.org\/weblog\/2003\/08\/17\/utf8madness\/","title":{"rendered":"UTF-8 Madness"},"content":{"rendered":"<p>One of my stated goals with this weblog is to better learn web standards, such as <span class=\"caps\">XHTML<\/span> and <span class=\"caps\">CSS.  <\/span>So each time I make a change to the design, or post a new story, I use the validation badges (see the right hand column) to check that everything is still copascetic.  Yesterday, it was not.<\/p>\n<p>The error message from the <span class=\"caps\">XHTML<\/span> validator, in all its yellow-highlighted glory, was:<\/p>\n<blockquote><p><code>Sorry, I am unable to validate this document because on lines         52, 56-58, 62-66, 68-70, 73, 76-81, 83-84, 86         it contained one or more bytes that I cannot interpret as         utf-8 (in other words, the bytes         found are not valid values in the specified Character Encoding).         Please check both the content of the file and the character         encoding indication.<\/code><\/p><\/blockquote>\n<p>Now, this is clear (invalid characters), and yet not so helpful (what characters?  Where in the line?)  I didn&#8217;t see anything in the view source, or opening the original file in an wikieditish.  And of course, I&#8217;m using textile, so that could be a culprit as well.<\/p>\n<p>After some experimentation (by which I mean to say, several hours of searching Google, threatening the Validator, and banging my head against the wall), I found that by saving the <span class=\"caps\">XHTML<\/span> from my browser, and  running it through less, I could see the bad characters.  They were all 0xA0,  which is 160 decimal, a.k.a. \u00a0.  Of course, \u00a0 is valid in <span class=\"caps\">XHTML,<\/span> as is \u00a0.  But utf-8 does not encode single byte values above 0x7F as single byte values, so 0xA0 is not the utf-8 encoding for nbsp.<\/p>\n<p>The thing is, I&#8217;m not sure how they got there.  The seemed to mostly appear inside <code> blocks, in place of the original indent spacing.  So I suspect it was a combination of textile2 trying to format the code, and wikieditish saving the document.  I&#039;m going to have to do some experimenting to make sure I understand how they interact.<\/code><\/p>\n<p>I tried to fix the problem my adding some filtering to wikieditish, in the form of:<\/p>\n<pre>\n$_ = $body;\ns\/\\xA0\/&amp;#160;\/;  #non-breaking space\n#snip!  more translations in here,\n#  like emdash, soft hyphen, etc.\n$body = $_;\n<\/pre>\n<p>I then re-edited the file, but this didn&#8217;t seem to fix it either.  I eventually downloaded the story to my local system, cleaned it up with a perl one-liner (<code>perl -pi -e &#039;s\/\\0xA0\/\/;&#039; filename.txt<\/code>), chosing to just eliminate the nbsp&#8217;s entirely for now until I better understand them.<\/p>\n<p>And the best part of all?  My host is running perl 5.8.0, with PerlIO enabled; which if I read <cite>perluniintro<\/cite> correctly is supposed to automagically take care of all that utf-8 madness for me, so I don&#8217;t have to.<\/p>","protected":false},"excerpt":{"rendered":"<p>One of my stated goals with this weblog is to better learn web standards, such as XHTML and CSS. So each time I make a change to the design, or post a new story, I use the validation badges (see the right hand column) to check that everything is still copascetic. Yesterday, it was not. [&hellip;]<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"class_list":["post-12","post","type-post","status-publish","format-standard","hentry","category-blosxom"],"_links":{"self":[{"href":"https:\/\/jclark.org\/weblog\/wp-json\/wp\/v2\/posts\/12","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jclark.org\/weblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jclark.org\/weblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jclark.org\/weblog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jclark.org\/weblog\/wp-json\/wp\/v2\/comments?post=12"}],"version-history":[{"count":0,"href":"https:\/\/jclark.org\/weblog\/wp-json\/wp\/v2\/posts\/12\/revisions"}],"wp:attachment":[{"href":"https:\/\/jclark.org\/weblog\/wp-json\/wp\/v2\/media?parent=12"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jclark.org\/weblog\/wp-json\/wp\/v2\/categories?post=12"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jclark.org\/weblog\/wp-json\/wp\/v2\/tags?post=12"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}