Re: UTF-8 syntax

DougEwell2 Wed, 06 Jun 2001 23:47:19 -0700
In a message dated 2001-06-06 9:35:45 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  we see that Unicode does not *exclude* D800 and DC00 from the
>  codespace for the CCS, and therefore it would seem that that UTF-8 sequence
>  would have to be interpreted (in the encoding form level of interpretation)
>  as the code points < D800 DC00 >, which have *no* meaning *as codepoints*!!

But definition D29 says that a UTF must round-trip these invalid code points, 
so we have no choice but to interpret them as <D800 DC00>.  That is why 
UTF-8s is ambiguous.  The sequence <ED A0 80 ED B0 80> could be mapped as 
either <D800 DC00>, because D29 says you have to allow for that, or as 
<10000>, because that is the real intent.

Note that UTF-8 is not ambiguous in this regard, unless you permit these 
so-called "lenient" processors, which I thought were made non-conformant by 
the Corrigendum.  The sequence <ED A0 80 ED B0 80> is every bit as much 
"overlong" as is <C0 80>.

-Doug Ewell
 Fullerton, California
Re: UTF-8 syntax

Reply via email to