Doug Ewell, Sat, 5 Jan 2013 18:11:59 -0700: > "Martin J. Dürst" wrote:
>>> When Unicode data is sent across the Internet we would say, "The >>> UTF-32 data was sent across the Internet." >> >> The first is correct. The second is correct. The third is wrong. >> [ snip ] you would say: >> >> "The UTF-32BE data was sent across the Internet." > > The larger problem here is that most civilians don't understand what > is truly meant by "UTF-32BE" and "UTF-32LE". > > In general, people think these terms simply mean "big-endian UTF-32" > and "little-endian UTF-32" respectively, without the additional > connotation (defined in D99 and D100) that U+FEFF at the beginning of > a stream defined as "UTF-32BE" or "UTF-32LE" is supposed to be > interpreted, against all logic, as a zero-width no-break space. (I agree that it is against logic.) > Because of this, it's not automatically the case that "the file > contains UTF-32BE data." That statement implies that there is no > initial U+FEFF, or if there is one, that it is meant to be a ZWNBSP. > You could just as easily have a "UTF-32" file, which might have an > initial U+FEFF (which then defines the endianness of the data) or > might not (which means the data is big-endian unless a "higher-level > protocol" dictates otherwise). I believe that even the U+FEFF *itself* is either UTF-32LE or UTF-32BE. Thus, there is, per se, no implication of lack of byte-order mark in Martin’s statement. Assuming that the label "UTF-32" is defined the same way as the label "UTF-16", then it is an umbrella label or a "macro label" (hint: macro language) which covers the two *real* encodings - UTF-32LE and UTF-32BE. Just my 5 øre. -- leif halvard silli

