Leif Halvard Silli wrote:

By definition, data in the "UTF-nBE" or "UTF-nLE" encoding scheme
(for whatever value of n) does not have a byte-order mark.

Sounds like you see "UTF-32BE data" as synonym for "UTF-32BE encoding".

"UTF-32BE data" is character data that is encoded according to definition D99, which defines the UTF-32BE encoding scheme. "UTF-32BE" is not merely a short way to say "big-endian UTF-32." As defined in TUS, it has a specific meaning that goes beyond that.

The encompassed languages of a macrolanguage are not variations of the
macrolanguage. Likewise, the "UTF-32BE" label does not designate a
variant of "UTF-32". You may read me that way, but I have not meant
that "UTF-32BE" is a variant of "UTF-32".

The analogy with ISO 639-3 macrolanguages implies that under some circumstances, it is appropriate to consider "UTF-32BE" and "UTF-32LE" as separate encoding schemes, while under other circumstances, it is more appropriate to lump them together under the single term "UTF-32". But that isn't right; it still misses the point. "UTF-32BE" is not the same as "UTF-32 that happens to be big-endian." The latter MAY begin with a BOM; the former MUST NOT.

If something is labelled "no" (for "Norwegian"), then one must "taste"
it to know whether the content is Norwegian Bokmål ("nb") or Norwegian
Nynorsk ("nn"). Likewise, if something is labelled, by default (as in
XML) or explicitly, as "UTF-16", then the parser must taste/sniff -
typically by sniffing the BOM - whether the document is big-endian or
little-endian.

Data tagged as "UTF-16" might contain a BOM, or it might not. If it does not, it is much more likely that platform or operating system conventions will be used to determine the endianness of the data than heuristics. There are comparatively few systems that will accept and comprehend UTF-16 or UTF-32 data of the "wrong" endianness for the platform. Andrew West's BabelPad is one tool that will sniff non-BOM data, but the whole point of BabelPad is to be Unicode-aware and to help the user be Unicode-aware; most systems and apps are not like that.

When the BOM is supposed to be interpreted as the BOM, then we cannot
label the document as e.g. "UTF-16BE" but must use, by default  or
explicitly, the label "UTF-16". But "UTF-16BE data" should be a valid
term in either case (provided the UTF-16 file is big-endian).

Propose this change of terminology to the UTC. It is not consistent with their existing use of the terms.

A file labelled "UTF-16" is specified to contain BOM + big-endian data
or BOM + little-endian data or - third - just big-endian data, without
BOM. Thus, one of the encoding variants that can be legally labelled
"UTF-16", is inseparable from "UTF-16BE" in every way.

That's correct. That situation is called out in the official definitions; it doesn't imply a loosening of them.

The "UTF-16" label does not mandate the use of the BOM.

I never said it did.

We are pretty much going round and round on this. The bottom line for me is, it would be nice if there were a shorthand way of saying "big-endian UTF-16," and many people (including you?) feel that "UTF-16BE" is that way, but it is not. That term has a DIFFERENT MEANING. The following stream:

FE FF 00 48 00 65 00 6C 00 6C 00 6F

is valid big-endian UTF-16, but it is NOT valid "UTF-16BE" unless the leading U+FEFF is explicitly meant as a zero-width no-break space, which may not be stripped.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ­

Reply via email to