Lachlan Hunt writes:
Andrew Cunningham wrote:
I was wondering if you should have another test in there: XHTML document with no encoding declared in the http header or in a meta tag, and no xml declaration. Sent as html/text.

That's text/html and an XHTML document served as text/html is HTML, regardless of any lies the DOCTYPE tells you.

opps ... yep text/html, more sleep last night would jhave helped ;)
In theory the docuemnt should only be in one of the unicode encodings, so without a BOM, the browser should try to render it as UTF-8.

No, because when it's served as text/html, HTML rules apply, not XML rules. So without the encoding declared in the HTTP headers or the meta element, the default of ISO-8859-1 should be used (if served over HTTP, technically US-ASCII otherwise). However, browsers will actually interpret ISO-8859-1 as the Windows-1252 superset and will also attempt to use unspecified heuristics to guess the encoding, before falling back to the default.

If you're going by the HTTP specs. If you go by the XHTML 1.0 recomendation, appendic C would indicate that "... that when the XML declaration is not included in a document, the document can only use the default character encodings UTF-8 or UTF-16.: But all that is neither here nor there. I'm not fussed about the whole HTML vs XHTML debate. The point I wnated to make is that there is another way to declare encoding for docuemnts in UTF-16 or UTF-32: and thats teh BOM; and that the test should also include BOM detection as an option, i.e. do various web browsers use the BOM as part of their heuristics. As it is web browsers do some odd things. You've alreday mentioned the iso-8859-1 -> Windows-1252 behaviour, likewise Gb2312->GBK, Big5 and avrious supersets of it, etc. It is unfortunate behaviour. Things would be more straight forward if browsers didn't do this. If you need to do an encoding conversion on a document before processing the document, we find that in most cases you can rely on the declared encoding within a document. But there will be cases where this will not work. In some cases we have to track declared and actual encodings of external documents. Unfortunate, but necessary.
Andrew
******************************************************
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list & getting help
******************************************************

Reply via email to