> > UTF-8 encoded sequences can be up to 5 bytes long...
> 
>       How is that possible. I was under the impression that a UTF-8
> sequence
> could never be more than 4 bytes (i.e. U+10FFFF becomes F4 8F BF BF).

Philippe chastised Chan for mentioning illegal sequences, but then went
on to make reference to there being other illegal sequences.

UTF-8 sequences, as originally defined, could be longer than four bytes,
in order to address codepoints in the vast expanse of UCS-4 at
U+110000..U+FFFFFFFF. Since the accepted code space has been constrained
to U+0000..U+10FFFF, only four bytes are needed. There are non-UTF-8s --
beasts that kind of look like UTF-8 but aren't -- in which sequences of
varying length represent the same character and sequences of more than
four bytes appear, but they are not UTF-8; those byte sequences are
considered illegal in UTF-8.



Peter Constable


Reply via email to