> BOM is a piece of binary junk that most programs don't accept

I agree that it does not make sense philosophically for a "byte order
mark" to be in a file encoded in a way that is byte-order independent.
The Unicode Standard does not recommend its use in a UTF-8 file
because it is not necessary.

However, it is a bug for programs not to accept the BOM at the start
of a UTF-8 file, because the Unicode Standard explicitly allows its
presence in a UTF-8 file as a byte order mark.  See Table 2.4, "The
Seven Unicode Encoding Schemes", which has been in the Unicode
Standard for more than a decade (Table 2.4 was in Unicode version 5.0
and it is still in Unicode version 11.0, released in June 2018).

I use "od" (for example, "od -t c my-file | head -1" or "od -t x1
my-file | head -1") to see if a file starts with a BOM, or have a
program I wrote check for it.  I would not trust "less" or some other
program besides "od" to display it.

I think "less" should display printable characters the way they are
expected to be viewed unless viewing a file in raw mode.  The formal
rendering of the BOM is as a "zero-width no-break space".  That was
its original purpose as a word joiner, and that is how the Unicode
Standard says it still should be rendered for backward-compatibility
with earlier versions of Unicode (see Unicode Standard 11.0, Section
23.2, Layout Controls).

Even in raw mode, I still would not trust less to find out if a file
started with a BOM--I use od.

Thanks,


Paul Hardy

Reply via email to