Re: Why is "endianness" relevant when storing data on disks but not when in memory?

Doug Ewell Sun, 06 Jan 2013 20:06:03 -0800

Leif Halvard Silli wrote:

By definition, data in the "UTF-nBE" or "UTF-nLE" encoding scheme
(for whatever value of n) does not have a byte-order mark.
Sounds like you see "UTF-32BE data" as synonym for "UTF-32BEencoding".

"UTF-32BE data" is character data that is encoded according todefinition D99, which defines the UTF-32BE encoding scheme. "UTF-32BE"is not merely a short way to say "big-endian UTF-32." As defined in TUS,it has a specific meaning that goes beyond that.

The encompassed languages of a macrolanguage are not variations of the
macrolanguage. Likewise, the "UTF-32BE" label does not designate a
variant of "UTF-32". You may read me that way, but I have not meant
that "UTF-32BE" is a variant of "UTF-32".

The analogy with ISO 639-3 macrolanguages implies that under somecircumstances, it is appropriate to consider "UTF-32BE" and "UTF-32LE"as separate encoding schemes, while under other circumstances, it ismore appropriate to lump them together under the single term "UTF-32".But that isn't right; it still misses the point. "UTF-32BE" is not thesame as "UTF-32 that happens to be big-endian." The latter MAY beginwith a BOM; the former MUST NOT.

If something is labelled "no" (for "Norwegian"), then one must "taste"
it to know whether the content is Norwegian Bokmål ("nb") or Norwegian
Nynorsk ("nn"). Likewise, if something is labelled, by default (as in
XML) or explicitly, as "UTF-16", then the parser must taste/sniff -
typically by sniffing the BOM - whether the document is big-endian or
little-endian.

Data tagged as "UTF-16" might contain a BOM, or it might not. If it doesnot, it is much more likely that platform or operating systemconventions will be used to determine the endianness of the data thanheuristics. There are comparatively few systems that will accept andcomprehend UTF-16 or UTF-32 data of the "wrong" endianness for theplatform. Andrew West's BabelPad is one tool that will sniff non-BOMdata, but the whole point of BabelPad is to be Unicode-aware and to helpthe user be Unicode-aware; most systems and apps are not like that.

When the BOM is supposed to be interpreted as the BOM, then we cannot
label the document as e.g. "UTF-16BE" but must use, by default  or
explicitly, the label "UTF-16". But "UTF-16BE data" should be a valid
term in either case (provided the UTF-16 file is big-endian).

Propose this change of terminology to the UTC. It is not consistent withtheir existing use of the terms.

A file labelled "UTF-16" is specified to contain BOM + big-endian data
or BOM + little-endian data or - third - just big-endian data, without
BOM. Thus, one of the encoding variants that can be legally labelled
"UTF-16", is inseparable from "UTF-16BE" in every way.

That's correct. That situation is called out in the officialdefinitions; it doesn't imply a loosening of them.

The "UTF-16" label does not mandate the use of the BOM.


I never said it did.

We are pretty much going round and round on this. The bottom line for meis, it would be nice if there were a shorthand way of saying "big-endianUTF-16," and many people (including you?) feel that "UTF-16BE" is thatway, but it is not. That term has a DIFFERENT MEANING. The followingstream:


FE FF 00 48 00 65 00 6C 00 6C 00 6F

is valid big-endian UTF-16, but it is NOT valid "UTF-16BE" unless theleading U+FEFF is explicitly meant as a zero-width no-break space, whichmay not be stripped.


--
Doug Ewell | Thornton, CO, USA

http://ewellic.org | @DougEwell

Re: Why is "endianness" relevant when storing data on disks but not when in memory?

Reply via email to