"Lars Kristan" <[EMAIL PROTECTED]> wrote: > AFAIK, UTF-8 files are NOT supposed to have a BOM in them.
Different operating systems and applications have different preferences. There is no universal "right" or "wrong" about this. This is unfortunate, but true. > Why is UTF-16 percieved as UNICODE? Well, we all know it's because UCS-2 > used to be the ONLY implementation of Unicode. But there is another > important difference between UTF-16 and UTF-8. It is barely possible to > misinterpret UTF-16, because it uses shorts and not bytes. On the other > hand, UTF-8 and ASCII are in extreme cases identical. At the risk of being mistaken for juuitchan by citing a Japanese example: A non-BOM file that starts with the bytes 0x30 0x42 could be the UTF-8 characters "0B", or it could be the UTF-16BE character HIRAGANA LETTER A. (A similar situation applies for UTF-16LE.) Now, "0B" might not be the first two characters of many novels, but in a techie Unix environment it could easily be the start of a text-format data file. Two common heuristics for determining whether a file is UTF-16 are to check whether every other byte is 0x00, or whether every other byte is the same. The former fails for non-Latin scripts, the latter fails (less frequently) for scripts that are not part of a smallish alphabet. That's the problem with no BOM: you have to resort to heuristics, or external tagging. > Why not have BOM in UTF-8? Probably because of the applications that don't > really need to know that a file is in UTF-8, especially since it may be pure > ASCII in many cases (e.g. system configuration files). And if Unicode is THE > codeset to be used in the future, then at some point in time all files would > begin with a UTF-8 BOM. Quite unnecessary. Further problems arise when you > concat files or start reading in the middle. That's why U+2060 WORD JOINER is being introduced in Unicode 3.2. Hopefully it will take over the ZWNBSP semantics from U+FEFF, which can then be used *solely* as a BOM. Eventually, if this happens, it will become safe to strip BOM's as they appear. (Of course, if you are splitting or concatenating files, you should not do any interpretation anyway.) I have never seen a non-pathological example where stripping a file- or stream-initial U+FEFF was harmful because of the possibility that it was intended as ZWNBSP. ZWNBSP (or WORD JOINER) affects the behavior of the characters before and after it. If there is no character before ZWNBSP, it doesn't belong there. > [O]n UNIX, it is essential that the user is aware of the codeset that is being > used. Unix users are accustomed to dealing with such details. > Anyway, some invalid sequences will be encountered by the > editor, but then hopefully it will simply display some replacement > characters (or ask if it can do so). Hopefully it will allow me to save the > file, with invalid sequences intact. Editing invalid sequences (or inserting > new ones) would be too much to ask, right? > > What bothers me a little bit is that I would not be able to save such a file > as UTF-16 because of the invalid sequences in it. Why would I? Well, Windows > has more and more suppport for UTF-8, so maybe I don't really need to. I > still wish I had an option though. > > This again makes me think that UTF-8 and UTF-16 are not both Unicode. Maybe > UTF-16 is 'more' Unicode right now, because of the past. But maybe UTF-8 > will be 'more' Unicode in the future, because it can contain invalid > sequences and these can be properly interpreted by someone at a later time. > Unless UTF-16 has that same ability, it will lose the battle of being an > 'equally good Unicode format'. I don't think the fact that invalid sequences are possible in UTF-8 and not in UTF-16 makes UTF-8 inferior, or any less "Unicode." It was designed that way. Invalid sequences always represent a problem, just like line noise. They should not be treated as a normal situation. -Doug Ewell Fullerton, California

