Lachlan Hunt writes:
Andrew Cunningham wrote:
I was wondering if you should have another test in there: XHTML document
with no encoding declared in the http header or in a meta tag, and no xml
declaration. Sent as html/text.
That's text/html and an XHTML document served as text/html is HTML,
regardless of any lies the DOCTYPE tells you.
opps ... yep text/html, more sleep last night would jhave helped ;)
In theory the docuemnt should only be in one of the unicode encodings, so
without a BOM, the browser should try to render it as UTF-8.
No, because when it's served as text/html, HTML rules apply, not XML
rules. So without the encoding declared in the HTTP headers or the meta
element, the default of ISO-8859-1 should be used (if served over HTTP,
technically US-ASCII otherwise). However, browsers will actually
interpret ISO-8859-1 as the Windows-1252 superset and will also attempt to
use unspecified heuristics to guess the encoding, before falling back to
the default.
If you're going by the HTTP specs.
If you go by the XHTML 1.0 recomendation, appendic C would indicate that
"... that when the XML declaration is not included in a document, the
document can only use the default character encodings UTF-8 or UTF-16.:
But all that is neither here nor there. I'm not fussed about the whole HTML
vs XHTML debate.
The point I wnated to make is that there is another way to declare encoding
for docuemnts in UTF-16 or UTF-32: and thats teh BOM; and that the test
should also include BOM detection as an option, i.e. do various web browsers
use the BOM as part of their heuristics.
As it is web browsers do some odd things. You've alreday mentioned the
iso-8859-1 -> Windows-1252 behaviour, likewise Gb2312->GBK, Big5 and avrious
supersets of it, etc. It is unfortunate behaviour. Things would be more
straight forward if browsers didn't do this.
If you need to do an encoding conversion on a document before processing the
document, we find that in most cases you can rely on the declared encoding
within a document. But there will be cases where this will not work.
In some cases we have to track declared and actual encodings of external
documents. Unfortunate, but necessary.
Andrew
******************************************************
The discussion list for http://webstandardsgroup.org/
See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list & getting help
******************************************************