>UTF-8 does not need a byte order mark per se, of course, but in certain >environments it may benefit from a signature. This point is regularly >missed by those who view BOMs in UTF-8 text as needless junk, and >Microsoft's use of this marker as evil (not necessarily Shlomi's >opinion, but certainly that of Markus Kuhn and many other Linux >faithful). > >It is true that Unix/Linux systems, even those that are configured to >use UTF-8, often expect the first 'n' bytes of a file to identify the >file type (e.g. "#!"), and may not work correctly if the file starts >with 0xEF 0xBB 0xBF instead. But representing the issue as "U+FEFF is a >byte-order mark, therefore UTF-8 files don't need it" sheds no light on >the reasons why some vendors choose to include it.
Well, I did receive comment from Linux user about how the UTF-8 BOM is meant as a magic file type, not as a mark for the byte order, but such tagging of plain-text files does sound weird even to me, who use Windows 2000 most of the time. I may be completely off the mark here, but tagging of plain-text files seems to me an eery reminder of the ISO-2022 escape sequences. In UTF-16 it's sensible because the OS needs to interpet the byte values differently (eg "black heart suit" instead of "ampersand, small Latin letter E"). > > Web pages usually use UTF-8, and although they can handle the BOM, > > it may appear as a strange character (a blank square or a question > > mark) on a browser that doesn't recognize it, and may also cause > > the above troubles when the file is saved to the local disk. > >There is no reason for a Unicode-compliant browser to display "a blank >square or a question mark" for U+FEFF instead of a zero-width no-break >space. U+FEFF is one of the better-known Unicode characters and >legitimately has the ZWNBSP semantic, and will (perhaps regrettably) >continue to have it even after U+2060 WORD JOINER becomes widely >recognized as the preferred character for that function. If Mozilla has already fixed that, then good. I have Mozilla on Win2K, but on my Linux partition I have only Netscape 4.7, and that one does mishandle the BOM. > > old 8-bit "ANSI" (Microsoft's non-standard name for its 8-bit > > Windows codepages > >If it were up to me, I would dispense with the gratuitous, 15-year-old >jab at Microsoft for calling the Windows code pages "ANSI." Their >reasons for doing so have been documented often. Putting "ANSI" in >quotation marks might have been sufficient. But I understand that this >FAQ is intended for a Unix/Linux audience, and that may simply be the >price of admission. It wasn't in the original, and I don't have any particular grudge against Microsoft. I added it after Markus Kuhn pointed out that the Microsoft term "ANSI" was a misnomer. I'd rather avoid the term altogether, but it is ubiquitous in Windows 2000/XP. > > Since UTF-16 text files are not meant for open transfer anyway, > > this is not an important issue. As for database applications and > > other situations where text files are merged, a Unicode-aware > > application should be able to discard all following U+FEFF > > characters. > >The reference to UTF-16 being "not meant for open transfer" and the >statement about discarding non-initial U+FEFF are not strictly correct >(because non-initial U+FEFF could also be ZWNBSP), but in the limited >context of this FAQ they are probably harmless. I have UTF-16 text files on my machine, but all over the Web and e-mail and newsgroups exchange you won't see anything but UTF-8. That's what I meant by "open transfer". >Finally, I agree with Shlomi that the references to UTF-7 are "not of >any importance," to the point where I am not sure why UTF-7 is even >mentioned. As Shlomi points out, Microsoft products do not treat UTF-7 >specially, except that IE recognizes the UTF-7 BOM and sets its encoding >accordingly (but this is true for any UTF-7 sequence, not just the BOM; >try loading a text file containing only the 11 ASCII characters >"M+APw-nchen"). I mentioned UTF-7 (as opposed to UTF-1, which is really obsolete) because it still appears in the environments: mailers (Outlook & co), browsers (Netscape, Mozilla, also an option in MSIE if you add the META tag to call it), conversion routines (Win2K cmd.exe handles UTF-7 when you do "chcp 65000"). I haven't seen any frequent use of it, but mailers definitely still support it, as I verified in a post of mine to misc.tests. About "M+APw-nchen", are you quite sure? I drag a text file containing this string into Internet Explorer 5.0, and it doesn't display the UTF-7 converted. It displays "small Latin letter u with diaraesis" correctly when the text file contains the UTF-7 BOM: "+/v8-M+APw-nchen". _________________________________________________________________ Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp.

