Peter Constable wrote:

> UTF-8 sequences, as originally defined, could be longer than 
> four bytes,
> in order to address codepoints in the vast expanse of UCS-4 at
> U+110000..U+FFFFFFFF. Since the accepted code space has been 
> constrained
> to U+0000..U+10FFFF, only four bytes are needed. There are 
> non-UTF-8s --
> beasts that kind of look like UTF-8 but aren't -- in which 
> sequences of
> varying length represent the same character and sequences of more than
> four bytes appear, but they are not UTF-8; those byte sequences are
> considered illegal in UTF-8.

1. UCS-4, which is still defined by 10646 (but never by Unicode)
    is limited at U-7FFF FFFF (nitpick: for some reason it's "U-"
    not "U+"; don't ask me why). U-FFFF FFFF has always been
    out of range. Probably so that one could use "signed" 32-bit
    ints (not all p.l. have unsigned integer types).

2. That "original" definition of UTF-8 (which was never in Unicode)
    is still the definition of UTF-8 in 10646. So UTF-8/Unicode is
    not the same as UTF-8/10646. In practice it does not matter
    very much, since there are no (and will never be) any characters
    allocated above U+10FFFF, and the private use planes above
    U+10FFFF (which were specified in 10646) have been removed.

3. Another nitpick: To reach up to (and above...) U-FFFF FFFF in a
    UTF-8-like encoding would put the max number of bytes per
    char to 7. There would be no data bit in the first byte of a 7-byte
    sequence though, as it would consist exactly of 7 1s and 1 0. ;-)

                /kent k


Reply via email to