[whatwg] Handling of illegal byte-sequences (typically in UTF-8)

Øistein E . Andersen Thu, 23 Nov 2006 18:12:03 -0800

Section 8.1.4:
> Bytes that are not valid UTF-8 sequences must be interpreted as [...] U+FFFD


Section 9.2.2:
> Bytes or sequences of bytes [...] that could not be converted to Unicode 
> characters
> must be converted to U+FFFD

If I read this correctly, section 8.1.4 requires that an illegal UTF-8 sequence 
like
F2 BF BF (the three first bytes of a four-byte sequence, obviously not followed 
by
a continuation byte) be converted into exactly three U+FFFD characters (one
for each byte), whereas section 9.2.2 also allows one single replacement 
character (and possibly even two) in this case (and permits an arbitrary number 
n of repetitions
of the three-byte sequence to be replaced by any number of U+FFFD characters
between 1 and 3n).

I realise that the underspecification in section 9.2.2 may well be intentional, 
given that
this section is not limited to UTF-8, but (quite possibly depending on the 
handling chosen) this 
can (more or less easily) be expressed in such a way that it applies to any 
encoding.

Alternatively, a reference to an authoritative source would of course fulfil 
the purpose in the particular case of UTF-8 (if such a document can be found).

[Currently, an alert reader might infer that the treatment indicated in section 
8.1.4
would be preferable also in section 9.2.2, but such inference for consistency 
can
hardly be expected.]

-- 
Ãistein E. Andersen

[whatwg] Handling of illegal byte-sequences (typically in UTF-8)

Reply via email to