Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads

Kenneth Whistler Tue, 12 Jun 2001 18:57:51 -0700
Jianping wrote:

> One thing needs to clarify here is that there is no four byte encoding in
> UTF-8S proposal and four byte encoding is illegal but not irregular. As
> everything in UTF-8S is perfect match to UTF-16, any blame to this proposal
> also applies to UTF-16 encoding form.

Well after a couple months arguing about this, it is nice to have
this little detail drop into place. Perhaps in another couple
of months we could have a complete specification, and then
restart the argument.

[BTW, as Peter has recently noted: 4 lines of new content, quoting my
 234 lines of content, with no commentary interspersed. This rampant
 failure to edit reply-to's is threatening to bring the wrath of
 Sarasvati back down on the list, folks.]

So, given the new information that the four byte form is illegal, not irregular,
in UTF-8s, here is a revised summary of UTF-8s:

===========================================================

Case III. Code points U-0000D800..U-0000DFFF included
        in the UTF's, using UTF-8s "The vision provided
        by the Oracle."

   code point     UTF-8s             UTF-16     UTF-32

a. 00000000  <=>  00                 0000       00000000
b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
g. 0000E000  <=>  EE 80 80           E000       0000E000
h. 0000FFFF  <=>  EF BF BF           FFFF       0000FFFF
i. 00010000  <=>  ED A0 80 ED B0 80  D800 DC00  00010000
j. 0010FFFF  <=>  ED AF BF ED BF BF  DBFF DFFF  0010FFFF

Round-tripping isolated surrogate code points:

c. 0000D800  <=>  ED A0 80           D800       0000D800
d. 0000DBFF  <=>  ED AF BF           DBFF       0000DBFF
e. 0000DC00  <=>  ED B0 80           DC00       0000DC00
f. 0000DFFF  <=>  EF BF BF           DFFF       0000DFFF

Code point sequences that do not round-trip from all UTF code
unit sequences. (Could be termed "irregular code point
sequences" --Ken):

k. 0000D800 0000DC00  =>  ED A0 80 ED B0 80  D800 DC00  0000D800 0000DC00
l. 0000DBFF 0000DFFF  =>  ED AF BF ED BF BF  DBFF DFFF  0000DBFF 0000DFFF

=============================================================

What Jianping is saying now is that F0..F4 are illegal as
initiators in UTF-8s. (They are legal initiators in UTF-8.)

Also, judging from his statement that "everything in UTF-8S is
perfect match to UTF-16", it is quite clear that UTF-8s does
*not* meet the Unicode Standard's definition of a UTF. To be
a UTF, it has to be a reversible transform of code points (or
Unicode scalar values -- there is some argument about which).

But UTF-8s is designed and conceived as a CODE UNIT TRANSFORM
of UTF-16. (A "CUT", not a "UTF".)

Basically, instead of starting with the code points, and deriving
the three UTF's, for UTF-8s you start with UTF-16 and derive
UTF-8s directly from it. (This is why I have been pounding on
the point that in order to understand the Oracle proposal, you
have to think in terms of the UTF-16 <==> UTF-8 convertors,
rather than in terms of the definitional UTF's.)

In other words, while others are seeing:

U-00010000  ==>  ED A0 80 ED B0 80  in UTF-8s
            ==>  D800 DC00          in UTF-16

Oracle is seeing:

      (D800)(DC00) <==> (ED A0 80)(ED B0 80)

and pointing out the tremendous simplicity of the fact that
a code point, err... code unit in UTF-16 always corresponds
*exactly* to a code point, errr... well a 1-, 2-, or 3- code
unit sequence in UTF-8s that always corresponds to a, umm..
character, well, sort of.

Now, perhaps Jianping will care to step in an clarify how UTF-32
fits in this picture. How, for example, are the irregular UTF-32
sequences in k and l above to be treated? As I have indicated?
(in which case, as Peter points out, there is an ambiguity in
the interpretation of any 6-byte UTF-8s representation) Or in
some other manner? And if so, how so?

--Ken
Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads

Reply via email to