If UTF-16 (serialized) without a BOM, could be in either order, then the interpretation would be indeterminate. If you want to output <0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> then tag it as UTF-16BE, not just UTF-16.
Mark ————— Γνῶθι σαυτόν — Θαλῆς [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: "Doug Ewell" <[EMAIL PROTECTED]> To: "Mark Davis" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Cc: "Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Sunday, April 14, 2002 15:28 Subject: Re: Default endianness of Unicode, or not > Mark Davis <[EMAIL PROTECTED]> wrote: > > > Part of the problem is that the term "UTF-16" means two different > > things. Let me see if I can make it clearer. > > > > Let "UTF-16M" refer to the in-memory form, which is sequence of 16- > > bit code units. The byte ordering is logically immaterial, since it > > is not a sequence of bytes. Such a sequence does not use a BOM. The > > code point sequence <U+1234 U+0061 U+10000> is represented as the > > UTF-16M sequence <0x1234 0x0061 0xD800 0xDC00>. > > > > Let "UTF-16", on the other hand, refer to only the byte-serialized > > form. > > I think I understand the difference between the CEF called "UTF-16" and > the CES called "UTF-16." That isn't where I'm having a problem. > > > The UTF-16M sequence <0x1234, 0x0061, 0xD800, 0xDC00> is represented > > as one of: > > <0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOMless > > <0xFE 0xFF 0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOM > > <0xFF 0xFE 0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOB > > *This* is where I'm having a problem. Mark states here, again, that > BOM-less UTF-16 (the CES) must be big-endian. That is: > > <0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOBless > > is not an instance of any valid CES. That, to me, is a change from what > Unicode has stated before, and from what Ken just said about using > "other information" (which could include external tagging, knowledge of > the originating platform, or heuristics) to determine the intended byte > order. > > Remember, I like the BOM. I happen to think it's a useful indicator of > both file type and byte order (not really two different topics). But I > do think the official deprecation, or omission from mention, of BOM-less > little-endian UTF-16 is a change from past definitions that renders > nonconformant a potentially large amount of existing UTF-16 data. > > -Doug Ewell > Fullerton, California > > >

