> A Unicode text file beginning with FEFF is > big-endian, and a file beginning with FFFE (not a legal Unicode > character for any other purpose) is little-endian.
This is incorrect. Here is a summary of the meaning of those bytes at the start of text files with different Unicode encoding forms. beginning with bytes FE FF: - UTF-16 => big endian, omitted from contents - UTF-16BE => ZWNBSP - UTF-16LE, UTF-8, UTF-32, UTF-32BE, UTF32LE => malformed, file corrupted beginning with bytes FF FE: - UTF-16 => little endian, omitted from contents - UTF-16LE => ZWNBSP - UTF-32 => little endian (if followed by bytes 00 00), omitted from contents - UTF-32LE => different code points, depending on following bytes - UTF-16BE, UTF-8, UTF-32BE => malformed, file corrupted > In addition, a Unicode encoding scheme named UTF-7, which was meant as Worth mentioning that SCSU also has a BOM. Mark ————— Γνῶθι σαυτόν — Θαλῆς [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: "Shlomi Tal" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, April 09, 2002 10:43 Subject: MS/Unix BOM FAQ again (small fix) > A small fix for the FAQ; specifically, a fix for the typo/braino of > construing 0x071F as little-endian 1F 70 instead of (the now fixed) 1F 07. > Thanks to Wladislaw Vaintroub for pointing it out for me. > > --- BEGIN --- > > Microsoft Unicode Text File Byte Order Mark (BOM) FAQ > > by Shlomi Tal ([EMAIL PROTECTED]) > > Contents > > 1. What is a BOM? > 2. Why does it matter? > 3. Is the BOM mandatory or optional? > -------------------------------------------------------------------- - > > 1. What is a BOM? > ^^^^^^^^^^^^^^^^^ > > BOM, or Byte-Order Mark, is a signature at the beginning of a Unicode > text file. Since different processors handle sequences of bytes in a > particular way, the BOM is used to mark which byte-order the text file > was written in. > > Processors are either big-endian or little-endian. The former put the > most significant byte first, and the latter put the least significant > byte first. So that the 16-bit number 0x071F is serialized as: > > Big-endian 07 1F > Little-endian 1F 07 > > Obviously a code with the value 0x071F will be interpreted as 0x1F07 > if it passes from a processor of different byte-order without > information about its original state. This is what the Unicode BOM > seeks to avoid. > > The Unicode standard permits the character U+FEFF (Zero-Width > Non-Breaking Space) at the beginning of the file as a mark for the > byte order of the file. A Unicode text file beginning with FEFF is > big-endian, and a file beginning with FFFE (not a legal Unicode > character for any other purpose) is little-endian. > > All this is relevant to the 16-bit and 32-bit encodings of Unicode > characters - UTF-16 and UTF-32 respectively. Thus: > > FE FF is UTF-16 Big-Endian > FF FE is UTF-16 Little-Endian > 00 00 FE FF is UTF-32 Big-Endian > FF FE 00 00 is UTF-32 Little-Endian > > There is another, very common Unicode encoding scheme called UTF-8, > which maps the Unicode repertoire into sequences of bytes. Since the > order of bytes (as opposed to words of more than one byte) is the same > for all processors, UTF-8 does not require a BOM. It can have one, > though. > > In addition, a Unicode encoding scheme named UTF-7, which was meant as > a mail-safe encoding but is now nearly obsolete, can have a BOM as > well. Here too the BOM is not mandatory. > > 2. Why does it matter? > ^^^^^^^^^^^^^^^^^^^^^^ > > It matters because Microsoft tools (most prominently Windows Notepad) > prefix the BOM to Unicode text files regularly, whereas other systems > and environments (Unix, Linux, web pages) are better off without the > BOM, especially in the case of UTF-8 text files. > > Unix systems, for example, search for an initial #! in a shell script > file in order to determine the interpreter for it. An initial BOM > coming instead of the #! could easily disrupt this convention. Also, > and this applies particularly to databases, and not only in Unix, the > BOM can cause disorder when files are merged. Web pages usually use > UTF-8, and although they can handle the BOM, it may appear as a > strange character (a blank square or a question mark) on a browser > that doesn't recognize it, and may also cause the above troubles when > the file is saved to the local disk. > > Most of the Unicode text meant for open transfer between various > systems (and the Web) is encoded in UTF-8. Unix systems regularly form > UTF-8 text files without the BOM, but Windows systems prefix the BOM > as usual. Here follows an explanation of when the Unicode BOM can or > cannot be removed from text files on Microsoft Windows systems. > > 3. Is the BOM mandatory or optional? > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Microsoft Windows, beginning with the Unicode-supporting operating > systems Windows 2000 and Windows XP, can handle UTF-16 Little-Endian, > UTF-16 Big-Endian, UTF-8 and old 8-bit "ANSI" (Microsoft's > non-standard name for its 8-bit Windows codepages, consisting of the > ASCII repertoire for the first 128 characters and varying characters > for the other 128). The native encoding for these systems is UTF-16 > Little-Endian, which when saving under Notepad is called "Unicode". > UTF-16 Big-Endian is called "Unicode Big-Endian", and UTF-8 keeps its > name. > > Upon saving a Unicode text file in Notepad, the BOM is always > prefixed. Thus, opening such a file with a text editor which is not > Unicode-aware (such as edit.com) or doing a hexdump on it, you will > see UTF-16 Little-Endian ("Unicode") starting with FF FE, UTF-16 > Big-Endian ("Unicode Big-Endian") starting with FE FF, and UTF-8 > starting with the UTF-8 encoding of the BOM: EF BB BF. > > For the first two encoding schemes (UTF-16), the user MUST NOT remove > the BOM manually. Removing the BOM using an external tool (such as > edit.com) and then opening the file with Notepad will reveal a pile of > gibberish. Then, saving the file will corrupt it beyond recovery. This > is because the BOM is necessary for the system to read the 16-bit > values as they are and ignore their values as 8-bit sequences. Without > the BOM, an 8-bit sequence forming part of a 16-bit Unicode character > will be given its special ASCII value, which may be a control > character. Many of these are transcoded into graphic ASCII characters > when the file is saved again, and thus the original text is lost. > Since UTF-16 text files are not meant for open transfer anyway, this > is not an important issue. As for database applications and other > situations where text files are merged, a Unicode-aware application > should be able to discard all following U+FEFF characters. > > For UTF-8, Windows Notepad prefixes the sequence EF BB BF, but it is > not mandatory. The sequence does not signal byte-order, but just that > the file is in UTF-8 encoding, and strictly speaking is not necessary > at all. In fact, Notepad can identify a text file as UTF-8 if it > contains no illegal UTF-8 sequences. One Latin-1 accented European > vowel standing alone in the text already prevents the text from being > recognized as UTF-8. See for yourself: type ALT+0206 ALT+0177 (that > is, those numbers with the ALT key held) on an empty text file, save > and close it. The next time you open the file you will see a Greek > small letter alpha in it - the file has been converted to UTF-8, > though the BOM has not yet been added. Writing more and saving the > file a second time will cause the BOM to be prefixed. > > Thus, when writing UTF-8 files for open transfer, it is best to keep > the BOM until the text file is complete, and then the BOM can be > safely removed (the author does so for all his HTML files: writing > with the BOM until completion, then removing it using the Vim editor, > which since version 6.0 can handle UTF-8). Upon making further changes > to the file, remember to remove the BOM again. > > So the rules are: > > 1) Do not remove the BOM (FF FE or FE FF) from UTF-16 files. > 2) Removing the BOM (EF BB BF) from UTF-8 is allowed. > > Finally, as a side note, and not of any importance, UTF-7 files can > have a BOM too: 2B 2F 76 38 2D (ASCII +/v8-). UTF-7 files are no > special type under Windows, they are saved as "ANSI", as if they were > regular ASCII or Latin-1 text. The UTF-7 BOM is useful only for > testing a UTF-7 encoded text file when dragging it into Internet > Explorer (5 and upwards), which recognizes the BOM and promptly sets > its encoding to UTF-7. However, given that the UTF-7 encoding has so > little use (in our day of 8-bit clean systems, which let data with the > high bit on pass uncorrupted), this can only serve as a piece of > trivia. > > --- END --- > > > _________________________________________________________________ > Join the worlds largest e-mail service with MSN Hotmail. > http://www.hotmail.com > > >

