Hi, As I understand it, the Unicode standard permits the interpretation of a leading U+FEFF as Unicode signature, and sometimes byte order mark, if there is no higher-level encoding information, independently of the par- ticular encoding chosen, so you have signatures for UTF-8, UTF-7, and so on. However, there are no or insufficient recommendations when protocols should allow them, and which of the many signatures should be recognized when performing auto-detection. Furthermore, the signatures are ambiguo- us.
This has lead to a situation where protocols vary considerably leading to interoperability failures and potential security problems. For in- stance, it is common for XML processors to support UTF-32 and detect it properly, while other formats, like "HTML5" require treating documents with a UTF-32 LE signature as UTF-16 LE. Yet other formats, like JSON, are textual in nature and permit only various Unicode encodings, but do not permit the BOM. In case of JSON the problem is further amplified by a primary consumer, the XMLHttpRequest interface, always checking for a signature, whether the format allows it or not, so your JSON content works in the browser when using that interface, but may not work elsewhere. XMLHttpRequest further does not check for UTF-32, with or without signature, but the JSON specification suggests performing auto-detection for that using that JSON entities start with some ASCII code point, which leads to another interoperability problem. Is there some guidance in the Unicode standard that I've missed, or is there some guidance that could be offered to authors of new protocols, or those revising existing protocols, to ease the pain? regards, -- Björn Höhrmann · mailto:[email protected] · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

