i18nGuy Tex Texin <tex at i18nguy dot com> wrote: > Although it can help prevent that confusion, for it to be a *good > reason*, it first has to be shown (or believed) that not only is there a > need for an indicator of endian-ness, but there is also a need for a > (weak) encoding indicator. > > Second, it has to be shown (or believed) that the indicator should be > this particular value 00 00 FE FF and not another one that doesn't offer > this potential confusion to begin with. > > I can buy endian-ness. I am not sold on (weak) encoding signatures.
These are good observations. I wasn't part of the decision-making process (then or now), but until Ken or Asmus or Mark comes up with a more authoritative response, here is how I see this issue. The decision to encode some sort of byte-order mark probably occurred early in the design of Unicode. Remember what things were like 12 years ago, when this decision was likely made: 1. Plain text files were very common, much more common than fancy text, but they were generally not marked with respect to character encoding (except, I suppose, in the ISO 2022 world). This caused problems when files were interchanged between MS-DOS and Windows 2.x (or Unix or other 8859-1-ish systems) and the HP world with its "Roman-8" CCS. (I definitely remember the heuristics involved in auto-detecting CP437 vs. CP1252.) 2. Endianness was already known to be an issue, particularly between the Intel (PC) and Motorola (Mac) worlds. Considering the speed of hardware at the time, conversion between big-endian and little-endian was widely regarded as a performance bottleneck (despite the fact that all processors contained a SWAB-style machine instruction). Holy wars developed over the "correct" byte order. 3. Software written with integer data in mind often used the value -1 as a "sentinel" to signify the end of normal data. This practice was common not only in ASCII-based systems, but in EBCDIC as well (where the term EO (Eight Ones) was used). 4. There was widespread understanding that 16-bit Unicode was being introduced to an overwhelmingly 8-bit world. Unicode text data was in danger of being misinterpreted as 8-bit data, or data of the opposite byte order. There was a sense that it was necessary to introduce a character value that would function not only as a byte order mark, but also as what Tex calls a "weak encoding indicator," because for some time it would continue to be necessary to distinguish Unicode from non-Unicode data. (Today, with most Unicode data in 8-bit-friendly UTF-8, we see that this need has not gone away.) Again, it was *not* common at the time for text data to be supplemented with out-of-band encoding information. SGML, HTML, XML, etc. provide great mechanisms for this today, but in 1990 they either did not exist or were not in common use for ordinary text. 0xFFFF could not be used as a signature because of the prevalent use of -1 as a sentinel value. And in any case, if indication of byte order was a goal, then clearly no value of the form U+xxyy could be used where xx = yy. 0xFE and 0xFF were found to be particularly infrequent (in either order) at the beginning of contemporary text files. If you are going to define a code point U+xxyy as a byte order mark, it makes sense to reserve U+yyxx as a noncharacter (modern terminology). This approach introduces less "potential confusion" than any other alternative. Defining U+FEFF as the BOM and U+FFFE as the noncharacter, instead of the other way around, permitted the two noncharacter values U+FFFE and U+FFFF to be contiguous, which seems more elegant somehow than if they were separated by a 256-character row. Later, when "Unicode" came to mean not only UTF-16 but also UTF-7, UTF-8, UTF-32, SCSU, BOCU, ACE, etc., the "encoding indicator" function of the BOM expanded, so that it distinguished UTF-16 not only from non-Unicode charsets, but also from UTF-8, UTF-32, etc. The "potential confusion" only occurs here when deciding between little-endian UTF-16 and UTF-32, and when allowing for the possibility of U+0000 in ordinary text (quite an unlikely scenario, IMHO). The other source of confusion, of course, has to do with U+FEFF being given a second role as zero-width no-break space (and having its name changed from BYTE ORDER MARK), and even then the confusion only exists in the equally unlikely scenario that a ZWNBSP is assumed to be valid at the start of a text stream (where it doesn't have the requisite two adjacent characters between which to prevent breaking). In any event, we are now up to Unicode 3.2, where U+2060 WORD JOINER is poised to remove this second role from U+FEFF, thus removing the source of confusion. In summary, I think "the need for a (weak) encoding indicator" had already been shown (or believed), and the choice of U+FEFF was made with that evidence or belief already in hand. I would definitely appreciate any assistance from the Unicode pioneers if I got any of these facts or assumptions wrong. -Doug Ewell Fullerton, California

