Default endianness of Unicode, or not

Doug Ewell Tue, 09 Apr 2002 20:57:43 -0700

Yves Arrouye <[EMAIL PROTECTED]> wrote:

> The last time I read the Unicode standard UTF-16 was big endian
> unless a BOM was present, and that's what I expected from a UTF-16
> converter.


Conformance requirement C2 (TUS 3.0, p. 37) says:

"The Unicode Standard does not specify any order of bytes inside a
Unicode value."

In Section 2.7, the passage on page 28 titled "Byte Order Mark (BOM)"
says:

"... Ideally, all implementations of the Unicode Standard would follow
only one set of byte order rules, but this scheme would force one class
of processors to swap the byte order on reading and writing plain text
files, even when the file never leaves the system on which it was
created."

Section 13.6, "Specials: U+FEFF, U+FFF0-U+FFFF," again acknowledges the
potential ambiguity of byte order without indicating a preference:

"... Some machine architectures use the so-called big-endian byte order,
while others use the little-endian byte order.  When Unicode text is
serialized into bytes, the bytes can go in either order, depending on
the architecture."

And Unicode Standard Annex #19, "UTF-32," Section 2, distinguishes
between UTF-32BE, UTF-32LE, and UTF-32, specifically stating that the
latter may be serialized "in either big-endian or little-endian format."
Presumably UTF-16 would be consistent with this.

I do remember reading once, somewhere, that big-endian was a preferred
default in the absence of *any* other information (including platform of
origin).  But I can't find anything in the Unicode Standard to back this
up, so I'll assume for now that both byte orientations are considered
equally legitimate.

-Doug Ewell
 Fullerton, California
 "Little-endian" user

Default endianness of Unicode, or not

Reply via email to