On Nov 24, 2006, at 04:11, Øistein E. Andersen wrote:

Section 8.1.4:
Bytes that are not valid UTF-8 sequences must be interpreted as [...] U+FFFD

Section 9.2.2:
Bytes or sequences of bytes [...] that could not be converted to Unicode characters
must be converted to U+FFFD

If I read this correctly, section 8.1.4 requires that an illegal UTF-8 sequence like F2 BF BF (the three first bytes of a four-byte sequence, obviously not followed by a continuation byte) be converted into exactly three U+FFFD characters (one for each byte), whereas section 9.2.2 also allows one single replacement character (and possibly even two) in this case (and permits an arbitrary number n of repetitions of the three-byte sequence to be replaced by any number of U+FFFD characters
between 1 and 3n).

I'm inclined to think that interop in error situations doesn't need to go as deep as defining how many replacement characters (in the range 1...number of bytes in a faulty sequence) a character decoder has to emit. Apps may want to delegate character decoding to an outside library whose authors don't care about the details of HTML5. (For example, it appears that Safari is leaving this stuff to ICU.) Chances are that there's more value in being able to use a library than in getting a specific number of replacement characters on error.

--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/


Reply via email to