Leif Halvard Silli <[email protected]> wrote: |Steven Atreju, Thu, 12 Jul 2012 12:32:46 +0200: | |> In the meanwhile the UTF-8 BOM is in the standard and thus |> contradicts fourty years of (well) good (Unix/POSIX) engineering |> and craftsmanship. Where a file is a file and everything is a |> file, holistically. Where small tools which do their thing well |> can be plugged together to achieve complex tasks. Unicode is |> very, very important. Really. |> |> In the future simple things like '$ cat File1 File2 > File3' will |> no longer work that easily. | |I guess you get the same problem with UTF-16 files also, then? |-- |Leif Halvard Silli
UTF-8 is a bytestream, not multioctet(/multisequence). This is a perfectly valid data interchange format (IMHO). The embedded BOM in UTF-8 streams seems to serve the purpose of enabling automatic encoding detection. To handle that, data inspection is required, and also user-chosen locale settings (LC_CTYPE, LC_COLLATE..) must be forcefully overwritten. This _/\_can_/\_; be the wrong thing, can it. Especially behind the back of someone. I do liked ISO 10646 more in respect to the clear 31 bit statement, yes. UTF-16 is a multisequence, so that a character can consist of multiple codepoints which in turn can consist of multiple UTF-16 instances. This is harder to handle than having some UTF-32 integers around, where one integer transports one codepoint. I don't really understand why one gives up the 1:1 relationship of codepoint<->storage, especially if that doesn't gain 1:1 relationship on the storage<->character side. Why not UTF-8 directly, then. Solely MHO. 'Nothing against UTF-32 as a memory representation from my side. Or, if it's your real desire, UTF-16. For data interchange i prefer bytes. Besides it is pretty clear that the Unix/POSIX tools have to be adjusted for real Unicode awareness (normalization and combining and working on the result). Why is there a need to embed completely useless information in a file. You have to special-case this. Like running $ < nice-windows-file.txt iconv -f UTF-16 -t UTF-8 | some-work or something. Stripping the BOM silently may change the checksum. UTF-8 BOM is horrible in normal data interchange. It maybe ok for XHTML or XML where some standard uses a fallback encoding, but then again. Ach. ¡Viva la Revolución! ¡Hasta la Victoria Siempre! Steven

