Recommendations for Unicode auto-detection

Bjoern Hoehrmann Mon, 04 Oct 2010 21:05:32 -0700

Hi,

  As I understand it, the Unicode standard permits the interpretation of
a leading U+FEFF as Unicode signature, and sometimes byte order mark, if
there is no higher-level encoding information, independently of the par-
ticular encoding chosen, so you have signatures for UTF-8, UTF-7, and so
on. However, there are no or insufficient recommendations when protocols
should allow them, and which of the many signatures should be recognized
when performing auto-detection. Furthermore, the signatures are ambiguo-
us.


This has lead to a situation where protocols vary considerably leading
to interoperability failures and potential security problems. For in-
stance, it is common for XML processors to support UTF-32 and detect it
properly, while other formats, like "HTML5" require treating documents
with a UTF-32 LE signature as UTF-16 LE. Yet other formats, like JSON,
are textual in nature and permit only various Unicode encodings, but do
not permit the BOM.

In case of JSON the problem is further amplified by a primary consumer,
the XMLHttpRequest interface, always checking for a signature, whether
the format allows it or not, so your JSON content works in the browser
when using that interface, but may not work elsewhere. XMLHttpRequest
further does not check for UTF-32, with or without signature, but the
JSON specification suggests performing auto-detection for that using
that JSON entities start with some ASCII code point, which leads to
another interoperability problem.

Is there some guidance in the Unicode standard that I've missed, or is
there some guidance that could be offered to authors of new protocols,
or those revising existing protocols, to ease the pain?

regards,
-- 
Björn Höhrmann · mailto:[email protected] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Recommendations for Unicode auto-detection

Reply via email to