Markus Scherer responded:

> Stefan Persson wrote:
> 
> > This links to a different page on the same server:
> > 
> > http://www.cl.cam.ac.uk/~mgk25/unicode.html
> > 
> > That page contains a strange UTF-8 table:
> > ...
> > The last two byte sequences are invalid.
> 
> 
> Markus Kuhn's page shows the original ISO 10646 definition.
                               ^^^^^^^^
and still current ISO/IEC 10646 definition. Table D.1 in Annex D
"UCS Transformation Format 8 (UTF-8)".

Note that the definition of the 5- and 6-byte UTF-8 sequences
for code positions past U-001FFFFF is essentially harmless,
as ISO/IEC 10646 now contains explicit language indicating
the non-intention to encode any characters at code positions
past U-0010FFFF. So the definition of the 5- and 6-byte sequences
is vacuous -- no such sequence will ever be a valid representation
of an *encoded character* in 10646.

> This necessarily includes all codes up to 7FFFFFFF.
> It also includes D800..DFFF, which is not allowed in Unicode 3.2 
> and the RFC on UTF-8, and I think implicitly not allowed in ISO 10646.

They are *explicitly* not allowed in UTF-8 in ISO/IEC 10646 as well.
>From Clause D.4 "Mapping from UCS-4 form to UTF-8 form":

  "Values of x in the range 0000 D800 .. 0000 DFFF are
   reserved for the UTF-16 form and do not occur in UCS-4.
   The mappings of these code positions in UTF-8 are undefined."

--Ken


Reply via email to