Markus Scherer responded: > Stefan Persson wrote: > > > This links to a different page on the same server: > > > > http://www.cl.cam.ac.uk/~mgk25/unicode.html > > > > That page contains a strange UTF-8 table: > > ... > > The last two byte sequences are invalid. > > > Markus Kuhn's page shows the original ISO 10646 definition. ^^^^^^^^ and still current ISO/IEC 10646 definition. Table D.1 in Annex D "UCS Transformation Format 8 (UTF-8)".
Note that the definition of the 5- and 6-byte UTF-8 sequences for code positions past U-001FFFFF is essentially harmless, as ISO/IEC 10646 now contains explicit language indicating the non-intention to encode any characters at code positions past U-0010FFFF. So the definition of the 5- and 6-byte sequences is vacuous -- no such sequence will ever be a valid representation of an *encoded character* in 10646. > This necessarily includes all codes up to 7FFFFFFF. > It also includes D800..DFFF, which is not allowed in Unicode 3.2 > and the RFC on UTF-8, and I think implicitly not allowed in ISO 10646. They are *explicitly* not allowed in UTF-8 in ISO/IEC 10646 as well. >From Clause D.4 "Mapping from UCS-4 form to UTF-8 form": "Values of x in the range 0000 D800 .. 0000 DFFF are reserved for the UTF-16 form and do not occur in UCS-4. The mappings of these code positions in UTF-8 are undefined." --Ken

