Re: Default endianness of Unicode, or not

Mark Davis Sun, 14 Apr 2002 16:32:25 -0700

If UTF-16 (serialized) without a BOM, could be in either order, then
the interpretation would be indeterminate. If you want to output <0x34
0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> then tag it as UTF-16BE, not just
UTF-16.


Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Cc: "Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Sunday, April 14, 2002 15:28
Subject: Re: Default endianness of Unicode, or not


> Mark Davis <[EMAIL PROTECTED]> wrote:
>
> > Part of the problem is that the term "UTF-16" means two different
> > things. Let me see if I can make it clearer.
> >
> > Let "UTF-16M" refer to the in-memory form, which is sequence of
16-
> > bit code units. The byte ordering is logically immaterial, since
it
> > is not a sequence of bytes. Such a sequence does not use a BOM.
The
> > code point sequence <U+1234 U+0061 U+10000> is represented as the
> > UTF-16M sequence <0x1234 0x0061 0xD800 0xDC00>.
> >
> > Let "UTF-16", on the other hand, refer to only the byte-serialized
> > form.
>
> I think I understand the difference between the CEF called "UTF-16"
and
> the CES called "UTF-16."  That isn't where I'm having a problem.
>
> > The UTF-16M sequence <0x1234, 0x0061, 0xD800, 0xDC00> is
represented
> > as one of:
> > <0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOMless
> > <0xFE 0xFF 0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOM
> > <0xFF 0xFE 0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOB
>
> *This* is where I'm having a problem.  Mark states here, again, that
> BOM-less UTF-16 (the CES) must be big-endian.  That is:
>
> <0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOBless
>
> is not an instance of any valid CES.  That, to me, is a change from
what
> Unicode has stated before, and from what Ken just said about using
> "other information" (which could include external tagging, knowledge
of
> the originating platform, or heuristics) to determine the intended
byte
> order.
>
> Remember, I like the BOM.  I happen to think it's a useful indicator
of
> both file type and byte order (not really two different topics).
But I
> do think the official deprecation, or omission from mention, of
BOM-less
> little-endian UTF-16 is a change from past definitions that renders
> nonconformant a potentially large amount of existing UTF-16 data.
>
> -Doug Ewell
>  Fullerton, California
>
>
>

Re: Default endianness of Unicode, or not

Reply via email to