I would say that this isn't an issue specific to unix - many programmers who work primarily on Windows like to use command-line text-processing tools.
And OTOH surely the case where a BOM is useful also occurs on unix: when a program that wants to operate in Unicode must import a text file. It's my understanding that MBCS character sets are not uncommon on unix (for example, EUC). If my program is running on a unix system where the default character set is EUC-JP (as I believe it's called) and it tries to import a text file containing UTF-8 without a BOM, how is the program supposed to know that the file contains UTF-8 rather than EUC-JP? So not a unix problem, but rather a problem with dumb command-line tools. I wonder whether the GNU people have thought of making their command-line tools aware of UTF-8 & BOMs. Thanks - rick cameron -----Original Message----- From: David Starner [mailto:[EMAIL PROTECTED]] Sent: Friday, 15 February 2002 11:24 To: Rick Cameron Cc: [EMAIL PROTECTED] Subject: Re: Unicode and end users On Fri, Feb 15, 2002 at 09:47:54AM -0800, Rick Cameron wrote: > If there is a file on disc called foo.txt, it is clearly not typed > data. Thus, it appears to be Mr Davis' opinion that when such a file > contains UTF-8 data, it is quite appropriate for there to be a BOM at > the start. In a global sense, it may be appropriate for a UTF-8 file to have a BOM. However, in a Unix context - and UTF-8 was originally designed for Unix and Unix-like systems - it is worthless and annoying. Take, for example, three files: A: <BOM>C<LF>AB<LF> B: <BOM>ABC<LF>AB<LF> C: <BOM>ABCDEFG<LF> and the operation grep "AB" A B > file; cat C >> file you'll end up with file: A: AB<LF>B: <BOM>ABC<LF>B: AB<LF><BOM>ABCDEFG<LF> That's a document with two BOM's, and none at the start of the file. There's no simple way to fix this; grep doesn't know if it's working on UTF-8 text or raw binary or Latin-1 (I frequently do grep foo file | recode l1..utf-8), and it doesn't know whether its output is going to the screen or a file or the tail of a file or the input of another program. Again, while globally, UTF-8 BOM's might work, in Unix they will be more of a nuisance than a help. -- David Starner / ����� ������� - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, "Peace and Love, Inc."

