Steven Atreju, Mon, 16 Jul 2012 13:35:04 +0200: > "Doug Ewell" <[email protected]> wrote:
> And: > > Q: Is the UTF-8 encoding scheme the same irrespective of whether > the underlying processor is little endian or big endian? > ... > Where a BOM is used with UTF-8, it is only used as an ecoding > signature to distinguish UTF-8 from other encodings — it has > nothing to do with byte order. > > Fifteen years ago i think i would have put effort in including the > BOM after reading this, for complete correctness! I'm pretty sure > that i really would have done so. I believe that most people that are conscious about inserting the BOM, do so because, without it, then Web browsers (with Chrome as the exception, whenever the page contains non-ASCII characters, at least) are unlikely to sniff a UTF-8 encoded page to be UTF-8 encoded. So, it has nothing with "complete correctness" to do, but everything to do with complete safety. > So, given that this page ranks 3 when searching for «utf-8 bom» > from within Germany i would 1), fix the «ecoding» typo and 2) > would change this to be less «neutral». The answer to «Q.» is > simply «Yes. Software should be capable to strip an encoded BOM > in UTF, because some softish Unicode processors fail to do so when > converting in between different multioctet UTF schemes. Using BOM > with UTF-8 is not recommended.» The current text is much to prefer. Also, you place the wagon before the horse. You place tools over users. There is one reason to use UTF-8 BOM which that FAQ point doesn't mention, however, and that is that Chrome/Safari/Webkit plus IE treat a UTF-8 encoded text/html page with a BOM different from a UTF-8 encoded text/html page without a BOM - even when the page is otherwise properly labelled as UTF-8. For the former, then the user would not be able to override the encoding, manually. Whereas for a page without the BOM, then the user can override the encoding/shoot themselves (and others) in the foot. > And UTF-8 got an additional «wohooo - i'm Unicode text» signature > tag, though optional. I like the term «extremely rare» sooo much!! > :-) What's the problem? > If you know how to deal with UTF-8, you can deal with UTF-8. > If you don't, no signature ever will help you, no?! Do you mean that, instead of the wohoo, one should do more thorough sniffing? I have no insight into how reliable such non-BOM-sniffing is. But I take it that it is much less secure than BOM-sniffing. Hence it would be risky (?) to deny users to override the encoding of a non-BOM-sniffed page. Which, bottom line, means that the BOM got an advantage. > If you don't know the charset of some text, that comes from > nowhere, i.e., no container format with meta-information, no > filetype extension with implicit meta-information, as is used on > Mac OS and DOS, then UTF-8 is still very easily identifieable by > itself due to the way the algorithm is designed. Is it?? As I just said in a reply to Doug: Of the Web browsers in current use, Chrome is the very best. This is, I think, because it, to a higher degree than the competition, assumes UTF-8 whenever it finds non-ASCII characters. Clearly, sniffing could improve. At least in the browser world. But is that also true for command lines tools? -- Leif H Silli

