Re: UTF-8 BOM (Re: Charset declaration in HTML)

Steven Atreju Thu, 12 Jul 2012 05:44:10 -0700

Leif Halvard Silli <[email protected]> wrote:

 |Steven Atreju, Thu, 12 Jul 2012 12:32:46 +0200:
 |
 |> In the meanwhile the UTF-8 BOM is in the standard and thus
 |> contradicts fourty years of (well) good (Unix/POSIX) engineering
 |> and craftsmanship.  Where a file is a file and everything is a
 |> file, holistically.  Where small tools which do their thing well
 |> can be plugged together to achieve complex tasks.  Unicode is
 |> very, very important.  Really.
 |> 
 |> In the future simple things like '$ cat File1 File2 > File3' will
 |> no longer work that easily.
 |
 |I guess you get the same problem with UTF-16 files also, then?
 |-- 
 |Leif Halvard Silli


UTF-8 is a bytestream, not multioctet(/multisequence).  This is
a perfectly valid data interchange format (IMHO).  The embedded
BOM in UTF-8 streams seems to serve the purpose of enabling
automatic encoding detection.  To handle that, data inspection is
required, and also user-chosen locale settings (LC_CTYPE,
LC_COLLATE..) must be forcefully overwritten.  This _/\_can_/\_;
be the wrong thing, can it.  Especially behind the back of someone.

I do liked ISO 10646 more in respect to the clear 31 bit
statement, yes.  UTF-16 is a multisequence, so that a character
can consist of multiple codepoints which in turn can consist of
multiple UTF-16 instances.  This is harder to handle than having
some UTF-32 integers around, where one integer transports one
codepoint.  I don't really understand why one gives up the 1:1
relationship of codepoint<->storage, especially if that doesn't
gain 1:1 relationship on the storage<->character side.  Why not
UTF-8 directly, then.  Solely MHO.

'Nothing against UTF-32 as a memory representation from my side.
Or, if it's your real desire, UTF-16.  For data interchange i
prefer bytes.  Besides it is pretty clear that the Unix/POSIX
tools have to be adjusted for real Unicode awareness
(normalization and combining and working on the result).  Why is
there a need to embed completely useless information in a file.
You have to special-case this.  Like running

  $ < nice-windows-file.txt iconv -f UTF-16 -t UTF-8 | some-work

or something.  Stripping the BOM silently may change the checksum.
UTF-8 BOM is horrible in normal data interchange.  It maybe ok for
XHTML or XML where some standard uses a fallback encoding, but
then again.  Ach.  ¡Viva la Revolución!

¡Hasta la Victoria Siempre!

  Steven

Re: UTF-8 BOM (Re: Charset declaration in HTML)

Reply via email to