Right. Unix was unique when it was created as it was built to handle all files as unstructured binary files. The history os a lot different, and text files have always used another paradigm, based n line records. End of lines initially were not really control characters. And even today the Unix-style end od lines (as advertized on other systems now with the C language) is not using the international standard (CR+LF, which was NOT a Microsoft creation for DOS or Windows). In fact the "plain text" concept was created by taking the common denominator of lots of many historical terminal protocols and file system protocols. ASCII tried to unify all this but most controls were just assigned symbolic names but no standard functions. Now to reconcile the various incarnations of plain text, we also have to leave with at least 4 end-of-line styles in plain-texts, but still many terminal protocols only consider CR+LF (except those for Unix shells, using the initial AT&T definition for the C language ; but even in that case, Unix terminals have also used other conventions with various emulations — see termcap). May be you would think that "cat utf8file1.txt utf8file2.txt >utf8file.txt" would create problems. For plain text-files, this is no longer a problem, even if there are extra BOMs in the middle, playing as no-ops. now try "cat utf8file1.txt utf16file2.txt > unknownfile.txt" and it will not work. IT will not work as well each time you'll have text files using various SBCS or DBCS encodings (there's never been any standard encoding in the Unic filesystem, simply because the concention was never stored in it; previous filesystems DID have the way to track the encoding by storing metadata; even NTFS could track the encoding, without guessing it from the content). Nothing in fact prehibits Unix to have support of filesystems supporting out-of-band metadata. But for now, you have to assume that the "cat" tool is only usable to concatenate binary sequences, in aritrary orders : it is not properly a tool to handle text files. Use "ucat" instead to indicate that the input files are in some standard UTF, and the BOMs will be silently handled, and the various UTF's will be automatically recognizd with their leading BOM (all other BOMs will be ignored and discarded, possibly with just a warning if you work in a really pedantic mode). No-op codes are not a problem. They have always existed in all terminal protocols, for various functions such as padding. Even Unicode documents characters that have no meaning at all except in very limited contexts (and these characters are strongly discouraged in "plain text" documents. More and more tools are now aware of the BOM as a convenient way to work reliably with various UTFs. Its absence meaning that the platform default encoding, or the host default, or the encoding selected by the user in his locale environment will be used. BOM's are in fact most useful in contexts where the storage or transmission platform does not allow storing out of band metadata about the encoding. It is extremely small, it does not impact the performance.
The BOM should now even be completely ignorable in all contexts, including in the middle of combining sequences. There will remain a few ontexts where it may harm, but those softwares that still don't ignore it semantically (notably some syntaxic parsers used by programming languages) should be corrected (this also includes HTML5 where it should have been ignored as well, except for its only function of allowing the autodetection of a standard UTF, where ever it occurs, as opposed to the legacy "platform default"). It could even be changed so that it could be present in any UTF-encoded text to allow transitions between distinct UTFs (for example when concatenating UTF-8 texts best suiting most alphabetic scripts, and UTF-16 best suiting ideographic, Hangul scripts or historic scripts encoded only outside the BMP with few occurences of ASCII punctuations and controls). As if it was an out-of-band control function (treated like surrogates that have a dedicated special function and that are not really characters by themselves). This solution would solve many problems to maximize the interoperability (there does not exist an universal interopeability solution that can solve all problems, but at least the UCS with its standardized UTFs are soplving many). Effective solutions that solve problems much more often than what it would create with old legacy applications (most of them being updatable by updating/upgrading the same softwares). The old legacy solutions will become then something only needed by some geeks, and instead of blicking them when they persist in maintaining them, it will be more valuale for them to isolate those contents and offer them via a proxying conversion filter. BOMs are then not a problem, but a solution (which is not the only one but helps filing the gap when other solutions are not usable or available). 2012/7/12 Julian Bradfield <[email protected]>: > On 2012-07-12, Steven Atreju <[email protected]> wrote: >> In the future simple things like '$ cat File1 File2 > File3' will >> no longer work that easily. Currently this works *whatever* file, >> and even program code that has been written more than thirty years >> ago will work correctly. No! You have to modify content to get it >> right!! > > Nice rant, but actually this has never worked like that. You can't cat > .csv files with headers, html files, images, movies, or countless > other "just files" and get a meaningful result, and never have been > able to. > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > >

