Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

DougEwell2 Mon, 04 Jun 2001 01:19:29 -0700
In a message dated 2001-06-03 18:04:17 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  It would seem to me that there's
>  another issue that has to be taken into consideration here: normalisation.
>  You can't just do a simple sort using raw binary comparison; you have to
>  normalise strings before you compare them, even if the comparison is a
>  binary compare. 

I would be surprised if that has even been considered.  Normalization is one 
of those fine details of Unicode, like directionality and character 
properties, that may be completely unknown to a development team that thinks 
the strict binary order of UTF-16 code points makes a suitable collation 
order.  This is a sign of a company or development team that thinks Unicode 
support is a simple matter of handling 16-bit characters instead of 8-bit.

While we are at it, here's another argument against the existence of both 
UTF-8 and this new UTF-8s.  Recently there was a discussion about the use of 
the U+FEFF signature in UTF-8 files, with a fair number of Unicode experts 
arguing against its necessity because UTF-8 is so easy to detect 
heuristically.  Without reopening that debate, it is worth noting that UTF-8s 
could not be distinguished from UTF-8 by that technique.  By definition D29, 
UTF-8s must support encoding of unpaired surrogates (as UTF-8 already does), 
and thus a UTF-8s sequence like ED A0 80 ED B0 80 could ambiguously represent 
either the two unpaired surrogates U+D800 U+DC00 or the legitimate Unicode 
code point U+10000.  Such a sequence -- the only difference between UTF-8 and 
UTF-8s -- could appear in either encoding, but with different 
interpretations, so auto-detection would not work.

Summary: UTF-8s is bad.

-Doug Ewell
 Fullerton, California
Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

Reply via email to