On Nov 24, 2006, at 04:11, Øistein E. Andersen wrote:
Section 8.1.4:
Bytes that are not valid UTF-8 sequences must be interpreted as
[...] U+FFFD
Section 9.2.2:
Bytes or sequences of bytes [...] that could not be converted to
Unicode characters
must be converted to U+FFFD
If I read this correctly, section 8.1.4 requires that an illegal
UTF-8 sequence like
F2 BF BF (the three first bytes of a four-byte sequence, obviously
not followed by
a continuation byte) be converted into exactly three U+FFFD
characters (one
for each byte), whereas section 9.2.2 also allows one single
replacement character (and possibly even two) in this case (and
permits an arbitrary number n of repetitions
of the three-byte sequence to be replaced by any number of U+FFFD
characters
between 1 and 3n).
I'm inclined to think that interop in error situations doesn't need
to go as deep as defining how many replacement characters (in the
range 1...number of bytes in a faulty sequence) a character decoder
has to emit. Apps may want to delegate character decoding to an
outside library whose authors don't care about the details of HTML5.
(For example, it appears that Safari is leaving this stuff to ICU.)
Chances are that there's more value in being able to use a library
than in getting a specific number of replacement characters on error.
--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/