In a message dated 2001-06-12 1:07:17 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  There's a mistake being made here that has been made repeatedly throughout
>  our discussion: that's to assume that there are two kinds of UTF-8: the
>  original, in which the code unit sequence < ED A0 80 ED B0 80 > meant the
>  coded character sequence < U-0000D800, U-0000DC00 >, and the new UTF-8 in
>  which this sequence means U-00010000. The only sensible interpretation of
>  the definitions of Unicode is that UTF-8 maps exactly one coded character
>  to exactly one code unit sequence. As far as I know, the UTF-8 mapping
>  hasn't changed; all that has changed are the range of USVs that are mapped
>  into it, and the introduction of some terms like "irregular".

There has only ever been one kind of UTF-8, but the Unicode underneath it has 
changed: from version 1.x, where there were no surrogates and U+D800, U+DC00 
was just an ordinary sequence of two characters, to version 2.x and beyond, 
where U+D800, U+DC00 is either (a) a surrogate pair representing U+10000 or 
(b) much less likely, two loose surrogates that happened to appear together 
by chance.

UTF-16, alone among UTFs (until this proposal), does not allow the 
distinction required by definition D29, but UTF-8 does have this power.  You 
can say F0 90 80 80 to mean U+10000, or if you really want to, you can also 
say ED A0 80 ED B0 80 to mean U+D800, U+DC00.

So Toby is correct -- UTF-8s is not UTF-8, but a completely different 
encoding scheme (although it walks and talks just like UTF-8 as long as the 
text in question contains no surrogates or supplementary characters).

There are problems, though.  UTF-8s looks *so much* like UTF-8 that, as Peter 
notes, there is considerable opportunity for the two to become mixed up.  
Toby admits that although UTF-8s is intended to remain internal, sometimes 
"internal" things leak out into the external world.  Oracle's choice of names 
("UTF8" to mean UTF-8s, the non-intuitive "AL32UTF8" to mean UTF-8) doesn't 
help matters one bit.  And I still don't think UTF-8s is truly capable of 
round-tripping unpaired surrogates in the manner spelled out in D29.

All of these technical considerations need to be taken into account, as well 
as those presented by the database vendors.  The worst thing would be for 
UTF-8s to be just swept into the standard because of the political clout 
wielded by the proponents.

-Doug Ewell
 Fullerton, California

Reply via email to