On Wed, 28 Dec 2011 03:20:26 +0100, Leif Halvard Silli
<xn--mlform-iua@målform.no> wrote:
By "default" you supposedly mean "default, before error
handling/heuristic detection". Relevance: On the "real" Web, no browser
fails to display utf-16 as often as Webkit - its defaulting behavior
not withstanding - it can't be a goal to replicate that, for instance.
Do you mean heuristics when it comes to the decoding layer? Or before
that? I do think any heuristics ought to be defined.
utf-16le becomes a label for utf-16.
* Logically, utf-16be should become a label for utf-16 then, as well.
That's not logical.
Is that what you suggest? Because, if the BOM can change the meaning of
utf-16be, then it makes sense to treat the utf-16be label as well as
the utf-16le label as synonymous with utf-16. (Thus, effectively
utf-16le and utf-16be becomes defunct/unreliable on the Web.)
No, because utf-16be actually has different behavior in absence of a BOM.
It does mean they can share some common algorithm(s), but they have to
stay different encodings.
SECONDLY: You effectively say that, for the UTF-16 BOM, then the BOM
should override the HTTP level charset info. OK. But then you should go
the full way, and give the BOM the same, overriding authority when it
comes to the UTF-8 BOM. For instance, if the HTTP server's Content-Type
header specifies ISO-8859-1 (or 'utf-8' or 'utf-16'), but the file
itself contains a BOM (that contradicts the HTTP info), then the BOM
"wins" - in IE and WEbkit. (And, btw, w.r.t. IE, then the
X-Content-Type: header has no effect w.r.t. treating the HTTP's charset
info as authoritative - the BOM wins even then.)
No, I don't see why we have to go there at all. All this suggests is that
within the two utf-16 encodings the first four bytes have special meaning.
That does not all suggest we should do the same for numerous other
encodings unrelated to utf-16.
--
Anne van Kesteren
http://annevankesteren.nl/