On Fri, Feb 15, 2002 at 12:41:02PM -0800, Rick Cameron wrote: > understanding that MBCS character sets are not uncommon on unix (for > example, EUC). If my program is running on a unix system where the default > character set is EUC-JP (as I believe it's called) and it tries to import a > text file containing UTF-8 without a BOM, how is the program supposed to > know that the file contains UTF-8 rather than EUC-JP?
Why can't EUC-JP start with EF BB BF? (Why can't UTF-16 start with EF BB BF?) If a user wants to open a UTF-8 file, he should either tell the program it's UTF-8 or run in a UTF-8 locale. > So not a unix problem, but rather a problem with dumb command-line tools. I > wonder whether the GNU people have thought of making their command-line > tools aware of UTF-8 & BOMs. It's impossible. For one thing, the tools work and should continue to work on binary data, given user care. They must continue to work on ASCII data, including not introducing spurious non-ASCII data. Also, if I do "cat a b c >> d", and d doesn't exist, then who is supposted to add the BOM? And what happens when it turns out that a, b, and c are gzip archives, and adding the BOM corrupted them? Or d does exist, and is some funky binary format that doesn't mind extra data at the end (like some executable formats) but does mind the BOM at the start? To add the BOM would make text programs much more complex and more fragile, and in exchange for what? UTF-8 is fairly easy to recognize from other formats, and it's a problem that people have been dealing with fairly well for over two decades, for 7/8-bit formats that are indistiguishable without user input. -- David Starner / Давид Старнэр - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, "Peace and Love, Inc."

