Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

DougEwell2 Tue, 05 Jun 2001 20:10:04 -0700

In a message dated 2001-06-05 14:24:38 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  For me, that would be the one positive for defining UTF-8S: we could then
>  tighten up the definition of UTF-8 to require it to exclude 6-byte forms on
>  input. You could then have:
>  
>  UTF-8: only emits 4byte, only reads 4byte
>  UTF-8S: only emits 6byte, only reads 6byte

But there is still a problem, because of definition D29.  All UTFs have to be 
able to encode non-character code points, including 0xD800 through 0xDFFF.  
That means -- as unlikely as it is in the real world -- you could have a 
UTF-8 code sequence that represents an unpaired surrogate, and you have to 
consider it as valid, strict UTF-8 (although you can reject the unpaired 
surrogate itself).

I don't like definition D29 personally, but the experts (in particular Mark) 
have assured me that it is necessary and justified.  In my view, D29 just 
throws another monkey wrench into UTF-8S.

Remember that, to handle characters above U+FFFF, a UTF-8S processor would 
not actually emit and read 6-byte sequences per se.  It would emit and read 
*pairs of 3-byte sequences*.  The processor then has to put the two together, 
using the UTF-16 rules.

-Doug Ewell
 Fullerton, California

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

Reply via email to