Bjoern Hoehrmann <derhoermi at gmx dot net> wrote: > ... However, there are no or insufficient recommendations when > protocols should allow [U+FEFF signatures], and which of the many > signatures should be recognized when performing auto-detection.
I assume you have read http://unicode.org/faq/utf_bom.html#BOM . Increasingly, protocols tend to discourage or forbid the use of U+FEFF signatures, either to achieve poor-man's compatibility with 8-bit legacy applications (like shell scripts), or out of fears that two encoding declarations in the same document (e.g. U+FEFF signature plus XML "encoding") might disagree. This type of objection to in-band tagging mechanisms tends to assume that all worthwhile data is in a high-level markup format, or that processing these sequences is too difficult for 21st-century software. > Furthermore, the signatures are ambiguous. The only ambiguity I can think of is where "little-endian UTF-16 BOM followed by U+0000" can be confused with "little-endian UTF-32 BOM." Most text strings do not begin with U+0000, so even this case is more of a theoretical problem than a real one. There are several possible byte sequences for the UTF-7 signature, but this is more of an inconvenience than an ambiguity. UTF-7 signatures tend to appear more in comprehensive tables of signatures than in actual content. > This has lead to a situation where protocols vary considerably leading > to interoperability failures and potential security problems. For > instance, it is common for XML processors to support UTF-32 and detect > it properly, while other formats, like "HTML5" require treating > documents with a UTF-32 LE signature as UTF-16 LE. Yet other formats, > like JSON, are textual in nature and permit only various Unicode > encodings, but do not permit the BOM. HTML5, at least, deliberately forbids the use of certain encodings (like SCSU) and auto-detection of others (like UTF-32), not only to prevent cross-site scripting attacks, but out of a belief that supporting them "just wastes developer time." See http://lists.w3.org/Archives/Public/public-html-comments/2008Jan/0032.html to see this viewpoint expressed by an HTML Working Group participant. -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s

