On Wed, 28 Dec 2011 12:31:12 +0100, Leif Halvard Silli <xn--mlform-iua@målform.no> wrote:
Anne van Kesteren Wed Dec 28 01:05:48 PST 2011:
On Wed, 28 Dec 2011 03:20:26 +0100, Leif Halvard Silli wrote:
By "default" you supposedly mean "default, before error
handling/heuristic detection". Relevance: On the "real" Web, no browser
fails to display utf-16 as often as Webkit - its defaulting behavior
not withstanding - it can't be a goal to replicate that, for instance.

Do you mean heuristics when it comes to the decoding layer? Or before
that? I do think any heuristics ought to be defined.

Meant: While UAs may prepare for little-endian when seeing the 'utf-16'
label, they should also be prepared for detecting it as big-endian.

As for Mozilla, if HTTP content-type says 'utf-16', then it is prepared
to handle BOM-less little-endian as well as bom-less big-endian.
Whereas if you send 'utf-16le' via HTTP, then it only accepts
'utf-16le'. The same also goes for Opera. But not for Webkit and IE.

Right. I think we should do it like Trident.


utf-16le becomes a label for utf-16.

* Logically, utf-16be should become a label for utf-16 then, as well.

That's not logical.

Care to elaborate?

To not make 'utf-16be' a de-facto label for 'utf-16', only makes sense
if you plan to make it non-conforming to send files with the 'utf-16'
label unless they are little-endian encoded.

I personally think everything but UTF-8 should be non-conforming, because of the large number of gotchas embedded in the platform if you don't use UTF-8. Anyway, it's not logical because I suggested to follow Trident which has different behavior for utf-16 and utf-16be.


Meaning: The "BOM" should not, for UTF-16be/le, be removed. Thus, if
the ZWNBSP character at the beginning of a 'utf-16be' labelled file is
treated as the BOM, then we do not speak about the 'utf-16be' encoding,
but about a mislabelled 'utf-16' file.

I never spoke of any existing standard. The Unicode standard is wrong here for all implementations.


the first four bytes have special meaning.
That does not all suggest we should do the same for numerous other
encodings unrelated to utf-16.

Why not? I see absolutely no difference here. When would you like to
render a page with a BOM as anything other than what the BOM specifies?

Interesting, it does seem like Trident/WebKit look at the specific byte sequences the BOM has in utf-8 and utf-16 before paying attention to the "actual" encoding.


--
Anne van Kesteren
http://annevankesteren.nl/

Reply via email to