Philippe Verdy, Wed, 28 Nov 2012 11:02:45 +0100: > In this case, Firefox and IE should not even be able to render > *any* XHTML page because it violates the HTML5 standard.
(1) The page in question (http://www.xn--elqus623b.net/XKCD/1137.html) is (from a source code point of view) a pure XHTML page, and contains no HTML-compatible methods for declaring the encoding. And therefore, that page does indeed violate the HTML5 standard, with the result that browsers are permitted to fall back to their built-in default encodings. (2) According to XML, the XML prologue can be deleted for UTF-8 encoded pages. And when it is deleted/omitted, XML parsers assume that the page is UTF-8 encoded. And if you try that (that is: if you *do* delete the XML prologue from that page), then you will see that the Unicorn validator will *continue* to stamp that Web page as error free. This is because the Unicorn validator only considers the rules for XML - it doesn't consider the rules of HTML. (4) Also, when you do delete the XML prologue, then not only Firefox and IE will render the page in the "wrong" encoding, but even Safari. However, Opera and Chrome will continue to render the page as UTF-8 due to the UTF-8 sniffing that they cleverly have built in. Clearly, Opera and Chrome's behaviour is the way to go. (5) It is indeed backwards that the W3C Unicorn validator doesn't inform its users when their pages fail to include a HTML-compatible method for declaring the encoding. This suboptimal validation could partly be related to libxml2, which Unicorn is partly based on. Because - as it turns out - the command line tool xmllint (which is part of libxml2) shows a very similar behaviour to that of Unicorn: It pays no respect to the fact that the MIME type (or Content-Type:) is 'text/html' and not an XML MIME type. In fact, when you do delete the XML prologue, Unicorn issues this warning (you must click to make it visible): "No Character Encoding Found! Falling back to UTF-8." Which is a quite confusing message to send given that HTML parser does not, as their last resort, fall back to UTF-8. -- leif halvard silli

