> [Original Message] > From: Jon Hanna <[EMAIL PROTECTED]> > > UTF-8 as defined in Unicode4.0 can never be greater than 4 bytes long. > However illegal sequences can be up to 6 (not just 5) bytes long. > > UTF-8 has been variously defined in various standards and specs as > an encoding of either Unicode or of ISO 10646. ISO 10646 has space > up to U+7FFFFFFF, although there is a commitment not to use anything > about U+10FFFF to maintain compatibility with Unicode. > > Because of this some of the specifications for UTF-8 that have been > published allow for U+7FFFFFFF and below to be encoded > (U+7FFFFFFF would be encoded as FD BF BF BF BF BF)[1]. For > example RFC 2279 (which is defined in terms of ISO 10646 alone) > allows this, but it is obsoleted by RFC 3629 (STD 63) which references > the Unicode standard.
Theoretically, it is possible to encounter valid 5 or 6 byte sequences in UTF-8. ISO 10646 IIRC had some private use areas above U+10FFFF. Therefore a version of UTF-8 that referenced the earlier ISO 10646 definition could have data that referred to such a character. Why anyone would need or want to do this is beyond me, but it would be possible for there to exist such data. However, like the possibility of encountering Unicode 1 Hangul syllables, it isn't something I'd especially worry about.

