Shlomi Tal <[EMAIL PROTECTED]> wrote: > Microsoft Unicode Text File Byte Order Mark (BOM) FAQ > ... > There is another, very common Unicode encoding scheme called UTF-8, > which maps the Unicode repertoire into sequences of bytes. Since > the order of bytes (as opposed to words of more than one byte) is > the same for all processors, UTF-8 does not require a BOM. It can > have one, though.
Shlomi explains the "signature" function of the BOM much later in his FAQ, but just to summarize, U+FEFF in its role as BYTE ORDER MARK -- as opposed to ZERO-WIDTH NO-BREAK SPACE (not "Non-Breaking") -- has two (overlapping) purposes: * as a true byte order mark * as a text format signature These two uses are explained in TUS 3.0, Section 13.6, "Specials" (p. 324). UTF-8 does not need a byte order mark per se, of course, but in certain environments it may benefit from a signature. This point is regularly missed by those who view BOMs in UTF-8 text as needless junk, and Microsoft's use of this marker as evil (not necessarily Shlomi's opinion, but certainly that of Markus Kuhn and many other Linux faithful). It is true that Unix/Linux systems, even those that are configured to use UTF-8, often expect the first 'n' bytes of a file to identify the file type (e.g. "#!"), and may not work correctly if the file starts with 0xEF 0xBB 0xBF instead. But representing the issue as "U+FEFF is a byte-order mark, therefore UTF-8 files don't need it" sheds no light on the reasons why some vendors choose to include it. > Web pages usually use UTF-8, and although they can handle the BOM, > it may appear as a strange character (a blank square or a question > mark) on a browser that doesn't recognize it, and may also cause > the above troubles when the file is saved to the local disk. There is no reason for a Unicode-compliant browser to display "a blank square or a question mark" for U+FEFF instead of a zero-width no-break space. U+FEFF is one of the better-known Unicode characters and legitimately has the ZWNBSP semantic, and will (perhaps regrettably) continue to have it even after U+2060 WORD JOINER becomes widely recognized as the preferred character for that function. > old 8-bit "ANSI" (Microsoft's non-standard name for its 8-bit > Windows codepages If it were up to me, I would dispense with the gratuitous, 15-year-old jab at Microsoft for calling the Windows code pages "ANSI." Their reasons for doing so have been documented often. Putting "ANSI" in quotation marks might have been sufficient. But I understand that this FAQ is intended for a Unix/Linux audience, and that may simply be the price of admission. > Since UTF-16 text files are not meant for open transfer anyway, > this is not an important issue. As for database applications and > other situations where text files are merged, a Unicode-aware > application should be able to discard all following U+FEFF > characters. The reference to UTF-16 being "not meant for open transfer" and the statement about discarding non-initial U+FEFF are not strictly correct (because non-initial U+FEFF could also be ZWNBSP), but in the limited context of this FAQ they are probably harmless. Finally, I agree with Shlomi that the references to UTF-7 are "not of any importance," to the point where I am not sure why UTF-7 is even mentioned. As Shlomi points out, Microsoft products do not treat UTF-7 specially, except that IE recognizes the UTF-7 BOM and sets its encoding accordingly (but this is true for any UTF-7 sequence, not just the BOM; try loading a text file containing only the 11 ASCII characters "M+APw-nchen"). -Doug Ewell Fullerton, California

