2012/7/13 Steven Atreju <[email protected]>: > Philippe Verdy <[email protected]> wrote: > > |2012/7/12 Steven Atreju <[email protected]>: > |> UTF-8 is a bytestream, not multioctet(/multisequence). > |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of > |bytes. It has a lot of internal semantics and constraints. Some things > |are very meaningful, some play absolutely no role at all and could > |even be disacarded from digital signature schemes (this includes > |ignoring BOMs wherever they are, and ignoring the encoding effectiely > |useed in checksum algorithms, whose first step will be to uniformize > |and canonicalize the encoding into a single internal form before > |processing). > |The effective binary encoding of text streams should NOT play any > |semantic role (all UTFs should completely be equivalent on the text > |interface, the bytestream low level is definitely not suitable for > |handling text and should not play any role in any text parser or > |collator). > > I don't understand what you are saying here. > UTF-8 is a data interchange format, a text-encoding. > It is not a filetype!
Not only ! It is a format which is unambiguously bound to a text filetype, even if this file type may not be intended to be interpreted by humans (e.g. program sources or riche text formats like HTML) > A BOM is a byte-order-mark, used to signal different host endianesses.[...] I'm on this list since long enough to know all this already. And i've not contradicted this role. However this is not prescriptive for anything else than text file types (whatever they are). For example BOMs have abolutely no role for encoding binary images, even if they include internal multibyte numeric fields. > |The history os a lot > |different, and text files have always used another paradigm, based n > |line records. End of lines initially were not really control > |characters. And even today the Unix-style end od lines (as advertized > |on other systems now with the C language) is not using the > |international standard (CR+LF, which was NOT a Microsoft creation for > |DOS or Windows). > > CR+LF seems to originate in teletypewriters (my restricted > knowledge, sorry). CR+LF is used in a lot of internet protocols. > Unix uses \n U+000A to indicate End-Of-Line in text files for a > long time. It is a usage that is younger and which became widespread only because of the success of the C language and its adoption for programming other systems. For long (and still today), end of lines/newlines have been encoded very differently. Even on the earliest Unix terminals (most often based on VT-* protocols), the LF characters were only used to move the cursor down, but not to start a new paragraph, so Unix applications had to use a "termcap" database to convert these newlines into visual end of paragraph, by converting them into CR+LF. > |May be you would think that "cat utf8file1.txt utf8file2.txt > |>utf8file.txt" would create problems. For plain text-files, this is no > |longer a problem, even if there are extra BOMs in the middle, playing > |as no-ops. > |now try "cat utf8file1.txt utf16file2.txt > unknownfile.txt" and it > |will not work. IT will not work as well each time you'll have text > |files using various SBCS or DBCS encodings (there's never been any > |standard encoding in the Unic filesystem, simply because the > |concention was never stored in it; previous filesystems DID have the > |way to track the encoding by storing metadata; even NTFS could track > |the encoding, without guessing it from the content). > |Nothing in fact prehibits Unix to have support of filesystems > |supporting out-of-band metadata. But for now, you have to assume that > |the "cat" tool is only usable to concatenate binary sequences, in > |aritrary orders : it is not properly a tool to handle text files. > > If there is a file, you can simply look at it. Use less(1) or any > other pager to view it as text, use hexdump(1) or od(1) or > whatever to view it in a different way. You can do that. It is > not that you can't do that -- no dialog will appear to state that > there is no application registered to handle a filetype; you look > at the content, at a glance. You can use cat(1) to concatenate > whatever files, and the result will be the exact concatenation of > the exact content of the files you've passed. And you can > concatenate as long as you want. For example, this mail is > written in an UTF-8 enabled vi(1) basically from 1986, in UTF-8 > encoding («Schöne Überraschung, gelle?» -- works from my point of > view), and the next paragraph is inserted plain from a file from > 1971 (http://minnie.tuhs.org/cgi-bin/utree.pl): > > > K. Thompson > > D. M. Ritchie > > > > > November 3, 1971 > INTRODUCTION > > > This manual gives complete descriptions of all the publicly available features > of UNIX. > > > This worked well, and there is no magic involved here, ASCII in > UTF-8, just fine. The metadata is in your head. (Nothing will > help otherwise!) For metadata, special file formats exist, i.e., > SGML. Or text based approaches from which there are some, but i > can't remember one at a glance ;}. Anyway, such things are used > for long-time archiving textdata. Though the > http://www.bitsavers.org/ have chosen a very different way for > historic data. Metadata in a filesystem is not really something > for me, and in the end. > > |No-op codes are not a problem. They have always existed in all > |terminal protocols, for various functions such as padding. > > Yes, there is some meaningful content around. Many C sources > contain ^L to force a new page when printed on a line printer, for > example. > > |More and more tools are now aware of the BOM as a convenient way to > |work reliably with various UTFs. Its absence meaning that the platform > |default encoding, or the host default, or the encoding selected by the > |user in his locale environment will be used. > |BOM's are in fact most useful in contexts where the storage or > |transmission platform does not allow storing out of band metadata > |about the encoding. It is extremely small, it does not impact the > |performance. > > A BOM is a byteorder mark. And as such simply overcomes a > weakness in the definition of the multioctet UTF formats, and that > is the missing definition of the used byteorder. Since network > protocols were carefully designed already 40 years ago, taking > care of such isses (the "network" byteorder is BE, but some > protocols *define* a different one), someone has failed to dig > deep enough before something else happened. I will *not*, as a > human who makes a lot of errors himself, cheer «win-win situation» > for something which in fact tries to turn a ridiculous miss into > something win-win-win anything. That's simply not what it is. > These are text-encodings, not file formats. > > |The BOM should now even be completely ignorable in all contexts, > |including in the middle of combining sequences. > > This goes very well for Unicode text content, but i'm not so sure > on the data storage side. > > |This solution would solve many problems to maximize the > |interoperability (there does not exist an universal interopeability > |solution that can solve all problems, but at least the UCS with its > |standardized UTFs are soplving many). Effective solutions that solve > |problems much more often than what it would create with old legacy > |applications (most of them being updatable by updating/upgrading the > |same softwares). The old legacy solutions will become then something > |only needed by some geeks, and instead of blicking them when they > |persist in maintaining them, it will be more valuale for them to > |isolate those contents and offer them via a proxying conversion > |filter. > | > |BOMs are then not a problem, but a solution (which is not the only one > |but helps filing the gap when other solutions are not usable or > |available). > > BOMs have never been a problem in at least binary data. They are > an effective solution on old little endian processors which leave > the task of swapping bytes into the correct order to the server, > and so for the client. There is no old legacy solution. Unicode > and UTF-8 ('think this one exclusively on Unix, since > byte-oriented) will surely be rather alone in a decade or two in > respect to text content. This will be true, and that was clear > before an UTF-8 BOM made it into the standard. Again, i will > *not* cheer win-win, *these* BOMs are simply ridiculous and > unnecessary. Different to a lot of control characters which had > their time or still work today. > > It is clear that this discussion of mine is completely useless, > the BOMs are real, they have been placed in the standard, so go > and see how you can deal with it. It'll end up with radically > mincing user data on the input side, and then > > $ cat utf8withbom > output-withoutbom > > will be something very common. One project of mine contains a > file cacert.pem. For compatibility reasons, the "cat" tool will never be changed to do that, as it is used in so many scripts that generate binary output. But another tool, specifically made for text files, could do the trick of vonverting BOMS, and even to parse the encodings and reencode on the fly, even if the input contents use distinct UTFs. Such text-ware tools should include on Linux/Unix : more, less, page. tail, od (except when used in binary mode with explicit flags), ... In my opinion they should not require you to specify which UTF is usedn or which byte-order is used internally, and they should recognize BOMs wherever they are, even in the middle of an input, as way to autodetect when a change of UTF binary form is used, and extra BOMs (even if they don't change the corrent UTF or byte order) should never cause any troubles (even if there's one in the middle of a combining sequence, it should not even break it, but the combining characters encoded after the BOM should just be decoded according to the UTF binary format indicated by this newly detected BOM). If you just want to restrict the tool to use Unicode semantics (to avoid the complications of the legacy encodinfs which have their own mutual incompatibilities) you'll replace "cat" by something as short like "ucat" (for Unicode-aware cat). And if this complicates your existing shell scripts, you may define a shell alias for changin the name, or could play with PATH environment variables). Unix is the only kind of system that does not differentiate or structure text formats correctly. Its legacy filesystems don't provide any support for identifying the filetypes (only the filenames *may* indicate it, but ther things could cause this information to be lost, including filename truncations, or lack of filename information, notably in I/O streams). Multiple unspecified encodings in Unix/Linux have always been a problem when they imply that the files will be incorrectly interpreted according to the environment/locale of the viewing user (which has no reason to change : this is a breach in the highly recommended separation of layers). And I still don't see why I/O streams in Linux cannot convey an out-of-band metadata substream (despite there's alwys been a support for that in the kernel, using ioctl's, on which all the basic read/write operations are built even in system drivers, they are just a particular subset of ioctl for handling with the default streams); metadata should also be a property of the volume on which any filesystem is mounted and stores filenames (this was not the case in early Unix filesystems). Before Unix, its ancestors always had meta-data streams for keeping information about the file types, encodings, security attributes and ACLs, processing limits, and controling the archiving/backups/replications, or information about access time (e.g. the need to dynamically open a remote connectionn which may require a prior humane authorization, or manual handling to request some admin to mount an indexed tape, hard disk, or stokc of punch card, or to provide decryption keys, plus information for bookkeeping the offline storages or if the volume is currently available or when it will be available). On Unix/Linux those metadata are represented using separate files (that are also unstructured by nature), by only assuming some naming conventions in the directory hierarchy (but only if this hierarchy is visible). So, early Unix filesystems are completely agnostic : even the directory entries are not clearly enforcing any naming convention or encoding, with the exception of two bytes (0x00 used by the C language to terminate strings, and 0x2F as the hierarchy separator); even the names "." and ".." not strictly bound to navigate the structure (unless they are explicitly bound within the filesystem hierarchy as name entries; and it remains valid in most Unix filesystems to name a file with only a 0x01 byte or just a LF byte, or using escape sequences, or a BEL control which will suspend the output for the time of playing a sound, or backspace/delete charcters to hide some parts of the text or to hide some oher files from the list, and then to perform a simple "ls" to just corrupt your terminal session or change your terminal mode or to attempt to accec to private user data by forging attacks against the terminal protocol.) Security is only offered by separating userss (but a limited number of users as user ids were initially only small integers and qualified user names were not part of the securty system). The absence of strucure and semantics in the core definition of the system has required to write additional layers on top of that core system. This may be an advantage because it minimizes the supporr code to write the kernel, but this advantage is lost when you add layers on top of it (and when there's no real enforcement about how these layers will cooperate or compete to perform their services on top of their shared kernel layer, so additional layers are also developed to implement a cooperation and interoperability system.

