> >> it can be represented in UTF-8 format as: > >> 1 byte: still 2F > >> 2 bytes: C0 AF (illegal) > >> 3 bytes: E0 80 AF (illegal) > > > > Thanks for keeping the indication that the last two are illegal with > > UTF-8. But > > you should have better never listed them (even if there still exists > > some legacy > > converters that will accept them, no one should generate them). Note > > also that > > UTF-8 encoded sequences can be up to 5 bytes long... > > How is that possible. I was under the impression that a UTF-8 sequence
> could never be more than 4 bytes (i.e. U+10FFFF becomes F4 8F BF BF). UTF-8 as defined in Unicode4.0 can never be greater than 4 bytes long. However illegal sequences can be up to 6 (not just 5) bytes long. UTF-8 has been variously defined in various standards and specs as an encoding of either Unicode or of ISO 10646. ISO 10646 has space up to U+7FFFFFFF, although there is a commitment not to use anything about U+10FFFF to maintain compatibility with Unicode. Because of this some of the specifications for UTF-8 that have been published allow for U+7FFFFFFF and below to be encoded (U+7FFFFFFF would be encoded as FD BF BF BF BF BF)[1]. For example RFC 2279 (which is defined in terms of ISO 10646 alone) allows this, but it is obsoleted by RFC 3629 (STD 63) which references the Unicode standard. A na�ve processor that allowed both over-long sequences and also code points upto U+7FFFFFFF would treat the six-octet sequence FC 80 80 80 80 AF as an encoding of U+002F SOLIDUS. Indeed depending on just how such a processor was getting things wrong (and we can only specify correct behaviour after all, people are free to get things wrong whatever way they want :) it's *just* about possible that the seven-octet sequence FE 80 80 80 80 80 AF would also be treated as U+002F SOLIDUS. [1]Indeed the format of UTF-8 would make it possible to unambiguously encode any value up to 0xFFFFFFFFFF but this exceeds the ISO 10646 codepoint space and it would break one of UTF-8's design goals in requiring the use of the octet FE. -- Jon Hanna <http://www.hackcraft.net/> "�it has been truly said that hackers have even more words for equipment failures than Yiddish has for obnoxious people." - jargon.txt

