2012/7/16 Steven Atreju <[email protected]>: > Fifteen years ago i think i would have put effort in including the > BOM after reading this, for complete correctness! I'm pretty sure > that i really would have done so.
Fifteen years ago I would not ahave advocated it. Simply because support of UTF-8 was very poor (and there were even differences of interpretations between the ISO/IEC definition and the Unicode definition, notably differences for the conformance requirements). This is no longer the case. > So, given that this page ranks 3 when searching for «utf-8 bom» > from within Germany i would 1), fix the «ecoding» typo and 2) > would change this to be less «neutral». The answer to «Q.» is > simply «Yes. Software should be capable to strip an encoded BOM > in UTF, because some softish Unicode processors fail to do so when > converting in between different multioctet UTF schemes. Using BOM > with UTF-8 is not recommended.» > > |> I know that, in Germany, many, many small libraries become closed > |> because there is not enough money available to keep up with the > |> digital race, and even the greater *do* have problems to stay in > |> touch! > | > |People like to complain about the BOM, but no libraries are shutting > |down because of it. "Keeping up with the digital race" isn't about > |handling two or three bytes at the beginning of a text file, in a way > |that has been defined for two decades. > > RFC 2279 doesn't note the BOM. > > Looking at my 119,90.- German Mark Unicode 3.0 book, there is > indeed talk about the UTF-8 BOM. We have (2.7, page 28) > «Conformance to the Unicode Standard does not requires the use of > the BOM as such a signature» (typo taken plain; or is it no > typo?), and (13.6, page 324) «..never any questions of byte order > with UTF-8 text, this sequence can serve as signature for .. this > sequence of bytes will be extremely rare at the beginning of text > files in other encodings ... for example []Microsoft Windows[]». > > So this is fine. It seems UTF-16 and UTF-32 were never ment for > data exchange and the BOM was really a byte order indicator for a > consumer that was aware of the encoding but not the byte order. > And UTF-8 got an additional «wohooo - i'm Unicode text» signature > tag, though optional. I like the term «extremely rare» sooo much!! > :-) No need to rant. There's the evidence that the role of BOM in UTF-8 has been to help the migration from legacy charsets to Unicode, to avoid mojibake. And this role is still important. As UTF-8 became proeminent in interchanges, and the need for migration from older encodings largely augmented, this small signature has helped knowing which files were converted or not, even if there was no meta data (meta data is freuently dropped as soon as the ressource is no longer on a web server, but stored in a file of a local filesystem). As there are still a lot of local resources using other encodings, the signature really helps managing the local contents. And more and more applications will recognize this signature automatically to avoid using the default legacy encodings of the local system (something they still do in absence of meta data and of the BOM) : you no longer need to use a menu in apps to select the proper encoding (most often it is not available, or requires restarting the application or cancelling an ongoing transaction, and still frequently we still have to manage the situation were resources using legacy local encodings and those in UTF-8 are mixed in the application). The BOM is then extremely useful in a transition that will durate several decennials (or more) each time that resource is not strictly bound to the 7-bit US-ASCII subset. I am also convinced that even Shell interpreters on Linux/Unix should recognize and accept the leading BOM before the hash/bang starting line (which is commonly used for filetype identification and runtime behavior), without claiming that they don"t know what to do to run the file or which shell interpreter to use. PHP itself should be allowed to use it as well (but unfortunetaly it still does not have the concept of tracking the effective encoding to parse its scripts simply. Yes this requires modifying the database of filetype signatures, but this type of update has always been necessary since long for handling more and more filetypes (see for example the frequent updates and the growth of the "/etc/magic" database used by the Unix/Linux tool "file").

