Re: [whatwg] Handling of illegal byte-sequences (typically in UTF-8)

Henri Sivonen Fri, 24 Nov 2006 02:35:59 -0800

On Nov 24, 2006, at 04:11, Øistein E. Andersen wrote:

Section 8.1.4:
Bytes that are not valid UTF-8 sequences must be interpreted as[...] U+FFFD
Section 9.2.2:
Bytes or sequences of bytes [...] that could not be converted toUnicode characters
must be converted to U+FFFD
If I read this correctly, section 8.1.4 requires that an illegalUTF-8 sequence likeF2 BF BF (the three first bytes of a four-byte sequence, obviouslynot followed bya continuation byte) be converted into exactly three U+FFFDcharacters (onefor each byte), whereas section 9.2.2 also allows one singlereplacement character (and possibly even two) in this case (andpermits an arbitrary number n of repetitionsof the three-byte sequence to be replaced by any number of U+FFFDcharacters
between 1 and 3n).

I'm inclined to think that interop in error situations doesn't needto go as deep as defining how many replacement characters (in therange 1...number of bytes in a faulty sequence) a character decoderhas to emit. Apps may want to delegate character decoding to anoutside library whose authors don't care about the details of HTML5.(For example, it appears that Safari is leaving this stuff to ICU.)Chances are that there's more value in being able to use a librarythan in getting a specific number of replacement characters on error.


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/

Re: [whatwg] Handling of illegal byte-sequences (typically in UTF-8)

Reply via email to