Part of the problem is that the term "UTF-16" means two different things. Let me see if I can make it clearer.
Let "UTF-16M" refer to the in-memory form, which is sequence of 16-bit code units. The byte ordering is logically immaterial, since it is not a sequence of bytes. Such a sequence does not use a BOM. The code point sequence <U+1234 U+0061 U+10000> is represented as the UTF-16M sequence <0x1234 0x0061 0xD800 0xDC00>. Let "UTF-16", on the other hand, refer to only the byte-serialized form. The UTF-16M sequence <0x1234, 0x0061, 0xD800, 0xDC00> is represented as one of: <0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOMless <0xFE 0xFF 0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOM <0xFF 0xFE 0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOB UTF-16BE is a serialization of UTF-16M into bytes. The UTF-16M sequence <0x1234, 0x0061, 0xD800, 0xDC00> is represented as: <0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> UTF-16BE is a serialization of UTF-16M into bytes. The UTF-16M sequence <0x1234, 0x0061, 0xD800, 0xDC00> is represented as: <0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> Note: if you have a code point starting with U+FEFF (e.g. <U+FEFF ...>, it is represented as: UTF-16M: <0xFEFF ...> UTF-16BE: <0xFE 0xFF ...> UTF-16LE: <0xFF 0xFE ...> UTF-16: <0xFF 0xFE 0xFF 0xFE ...> OR <0xFE 0xFF 0xFE 0xFF ...> Mark ————— Γνῶθι σαυτόν — Θαλῆς [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: "Doug Ewell" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: "Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Saturday, April 13, 2002 11:42 Subject: Re: Default endianness of Unicode, or not > On Wednesday 2002-04-10, Kenneth Whistler <[EMAIL PROTECTED]> wrote: > > > There, feel better? > > Not really. I'm getting the sense on one hand that UTF-16, sans BOM, > can be big-endian or little-endian depending on the platform, on the > other hand that little-endian UTF-16 isn't "legal" unless it has a BOM, > and on the third hand (!) that all this still hasn't been fully thought > out. > > (In the following text, I will deliberately spell out "big-endian" and > "little-endian" instead of using the handy abbreviations "BE" and "LE," > because those refer to the specifically defined encoding schemes > UTF-16BE and UTF-16LE and I don't always mean to do that.) > > > * In UTF-16, <004D 0061 0072 006B> is serialized as > > <FF FE 4D 00 61 72 00 6B 00>, <FE FF 00 4D 00 61 00 72 00 6B>, or > > <00 4D 00 61 00 72 00 6B>. > > > > The third instance cited above is the *unmarked* case -- what > > you get if you have no explicit marking of byte order with the BOM > > signature. The contrasting byte sequence <4D 00 61 72 00 6B 00> > > would be illegal in the UTF-16 encoding scheme. > > You mean because of the missing 00 byte? (Rim shot.) > > > [It is, of course, > > perfectly legal UTF-16LE.] > > I don't know, looks to me like a perfectly good sequence of four CJK > ideographs. (Rim shot.) > > No, but seriously, folks. Can we interpret the UTF-16 encoding > *scheme* -- we're not talking about *form* here, since that has nothing > to do with byte order -- as being platform-endian, or does it absolutely > have to be big-endian? Because if it has to be big-endian, even on a > little-endian platform, then there's an awful lot of non-conformant > "UTF-16" lurking around in Windows NT (e.g. NTFS filenames). > > > The intent of all this is if you run into serialized UTF-16 data, > > in the absence of any other information, you should assume and > > interpret it as big-endian order. The "other information" (or > > "higher-level protocol") could consist of text labelling (as > > in MIME labels) or other out-of-band information. It could even > > consist of just knowing what the CPU endianness of the platform > > you are running on is (e.g., knowing whether you are compiled > > with BYTESWAP on or off :-) ). And, of course, it is always > > possible for the interpreting process to perform a data heuristic > > on the byte stream, and use *that* as the other information to > > determine that the byte stream is little-endian UTF-16 (i.e. > > UTF-16LE), rather than big-endian. > > That's quite different from Yves' original statement that "UTF-16 is > big-endian unless a BOM is present." > > > And a lot of the text in the standard about being neutral between > > byte orders is the result of the political intent of the standard, > > way back when, to deliberately not favor either big-endian or > > little-endian CPU architectures, and to allow use of native > > integer formats to store characters on either platform type. > > This is a bit troubling. It seems to imply that the decision "way back > when" to be neutral about byte order was merely a political gesture to > get the little-endian guys on board, and that the rules are changing > somewhat to favor the big-endian guys. > > > Again, as for many of these kinds of issues being discovered by > > the corps of Unicode exegetes out there, part of the problem is > > the distortion that has set in for the normative definitions in > > the standard as Unicode has evolved from a 16-bit encoding to > > a 21-bit encoding with 3 encoding forms and 7 encoding schemes. > > No argument there. There are still plenty of common-man > interpretations, and plenty of text in TUS 3.0, that treat UTF-16 as the > "one true" encoding form of Unicode. I know this is being cleaned up > for 4.0; I just hope public perceptions will follow. > > > For the UTF-16 character encoding *form*: > > > > "D32 <ital>UTF-16 character encoding form:</ital> the Unicode > > CEF which assigns each Unicode scalar value in the ranges U+0000.. > > U+D7FF and U+E000..U+FFFF to a single 16-bit code unit with the > > same numeric value as the Unicode scalar value, and which assigns > > each Unicode scalar value in the ranges U+10000..U+10FFFF to a > > surrogate pair, according to Table 3-X. > > > > * In UTF-16, <004D, 0430, 4E8C, 10302> is represented as > > <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds > > to U+10302." > > Fine. I don't think there are any questions concerning UTF-16 as a CEF. > > > For the UTF-16 character encoding *scheme*: > > > > "D43 <ital>UTF-16 character encoding scheme:</ital> the Unicode > > CES that serializes a UTF-16 code unit sequence as a byte sequence > > in either big-endian or little-endian format. > > > > * In UTF-16 (the CES), the UTF-16 code unit sequence > > <004D 0430 4E8C D800 DF02> is serialized as > > <FE FF 00 4D 04 30 4E 8C D8 00 DF 02> or > > <FF FE 4D 00 30 04 8C 4E 00 D8 02 DF> or > > <00 4D 04 30 4E 8C D8 00 DF 02>." > > Here the draft text is saying in the description that UTF-16 can be > either big-endian or little-endian, and can include a BOM or omit it. > Four possibilities. Good. But then the examples leave out the non-BOM > little-endian serialization, which implies it is not conformant like the > other three. Not so good, because (a) the description and examples > don't really match and (b) the examples rule out the possibility of > UTF-16 text that we might know darn well to be little-endian, not > because of a BOM but perhaps because of the other indicators Ken > mentioned: MIME labeling, knowledge of the originating platform, > heuristics, etc. > > The exegesis continues.... > > -Doug Ewell > Fullerton, California > > > >

