> > UTF-8 encoded sequences can be up to 5 bytes long... > > How is that possible. I was under the impression that a UTF-8 > sequence > could never be more than 4 bytes (i.e. U+10FFFF becomes F4 8F BF BF).
Philippe chastised Chan for mentioning illegal sequences, but then went on to make reference to there being other illegal sequences. UTF-8 sequences, as originally defined, could be longer than four bytes, in order to address codepoints in the vast expanse of UCS-4 at U+110000..U+FFFFFFFF. Since the accepted code space has been constrained to U+0000..U+10FFFF, only four bytes are needed. There are non-UTF-8s -- beasts that kind of look like UTF-8 but aren't -- in which sequences of varying length represent the same character and sequences of more than four bytes appear, but they are not UTF-8; those byte sequences are considered illegal in UTF-8. Peter Constable

