[whatwg] Surrogate pairs and character references

Øistein E . Andersen Tue, 08 Sep 2009 15:39:17 -0700

According to the spec, character references may cause surrogatecharacters (0xD800 to 0xDFFF) to be inserted into the DOM. Assumingthat the DOM is an UTF-16BE environment, &#xD800;&#xDC00; and𐀀 will both result in \xD800\xDC00 or U+1,0000. This shouldprobably be pointed out explicitly since extra processing has to bedone to achieve the same result in a parser that is not built atopUTF-16BE.

Furthermore, it is not entirely clear whether a mixed form like\xD800&#xDC00; encoded in UTF-16BE should give \xD800\xDC00 or \xFFFD\xDC00. Not all browsers convert unpaired surrogates in UTF-16 to U+FFFD, so the mixed form may be interpreted as U+1,0000.


--
Øistein E. Andersen

[whatwg] Surrogate pairs and character references

Reply via email to