Re: Unicode and end users

David Starner Fri, 15 Feb 2002 17:49:45 -0800

On Fri, Feb 15, 2002 at 12:41:02PM -0800, Rick Cameron wrote:
> understanding that MBCS character sets are not uncommon on unix (for
> example, EUC). If my program is running on a unix system where the default
> character set is EUC-JP (as I believe it's called) and it tries to import a
> text file containing UTF-8 without a BOM, how is the program supposed to
> know that the file contains UTF-8 rather than EUC-JP?


Why can't EUC-JP start with EF BB BF? (Why can't UTF-16 start with EF BB
BF?) If a user wants to open a UTF-8 file, he should either tell the
program it's UTF-8 or run in a UTF-8 locale.

> So not a unix problem, but rather a problem with dumb command-line tools. I
> wonder whether the GNU people have thought of making their command-line
> tools aware of UTF-8 & BOMs.

It's impossible. For one thing, the tools work and should continue to
work on binary data, given user care. They must continue to work on
ASCII data, including not introducing spurious non-ASCII data. Also, if
I do "cat a b c >> d", and d doesn't exist, then who is supposted to add
the BOM? And what happens when it turns out that a, b, and c are gzip
archives, and adding the BOM corrupted them? Or d does exist, and is
some funky binary format that doesn't mind extra data at the end (like
some executable formats) but does mind the BOM at the start?

To add the BOM would make text programs much more complex and more
fragile, and in exchange for what? UTF-8 is fairly easy to recognize
from other formats, and it's a problem that people have been dealing
with fairly well for over two decades, for 7/8-bit formats that are
indistiguishable without user input.

-- 
David Starner / Давид Старнэр - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, "Peace and Love, Inc."

Re: Unicode and end users

Reply via email to