UTF-8 isn't the default for HTML (was: xkcd: LTR)

Leif Halvard Silli Wed, 28 Nov 2012 09:56:11 -0800

Philippe Verdy, Wed, 28 Nov 2012 11:02:45 +0100:
> In this case, Firefox and IE should not even be able to render
> *any* XHTML page because it violates the HTML5 standard.


(1) The page in question (http://www.xn--elqus623b.net/XKCD/1137.html) 
is (from a source code point of view) a pure XHTML page, and contains 
no HTML-compatible methods for declaring the encoding. And therefore, 
that page does indeed violate the HTML5 standard, with the result that 
browsers are permitted to fall back to their built-in default encodings.

(2) According to XML, the XML prologue can be deleted for UTF-8 encoded 
pages. And when it is deleted/omitted, XML parsers assume that the page 
is UTF-8 encoded. And if you try that (that is: if you *do* delete the 
XML prologue from that page), then you will see that the Unicorn 
validator will *continue* to stamp that Web page as error free. This is 
because the Unicorn validator only considers the rules for XML - it 
doesn't consider the rules of HTML.

(4) Also, when you do delete the XML prologue, then not only Firefox 
and IE will render the page in the "wrong" encoding, but even Safari. 
However, Opera and Chrome will continue to render the page as UTF-8 due 
to the UTF-8 sniffing that they cleverly have built in. Clearly, Opera 
and Chrome's behaviour is the way to go.

(5) It is indeed backwards that the W3C Unicorn validator doesn't 
inform its users when their pages fail to include a HTML-compatible 
method for declaring the encoding. This suboptimal validation could 
partly be related to libxml2, which Unicorn is partly based on. Because 
- as it turns out - the command line tool xmllint (which is part of 
libxml2) shows a very similar behaviour to that of Unicorn: It pays no 
respect to the fact that the MIME type (or Content-Type:) is 
'text/html' and not an XML MIME type. In fact, when you do delete the 
XML prologue, Unicorn issues this warning (you must click to make it 
visible): "No Character Encoding Found! Falling back to UTF-8." Which 
is a quite confusing message to send given that HTML parser does not, 
as their last resort, fall back to UTF-8.
-- 
leif halvard silli

UTF-8 isn't the default for HTML (was: xkcd: LTR)

Reply via email to