And Visions of Sugar Plum UTF-8's Dance in Their Heads

Kenneth Whistler Tue, 12 Jun 2001 16:53:41 -0700
Case I. Code points U-0000D800..U-0000DFFF excluded
        from the UTF's. "The way God intended it to be"

   code point     UTF-8              UTF-16     UTF-32

a. 00000000  <=>  00                 0000       00000000
b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
g. 0000E000  <=>  EE 80 80           E000       0000E000
h. 0000FFFF  <=>  EF BF BF           FFFF       0000FFFF
i. 00010000  <=>  F0 90 80 80        D800 DC00  00010000
j. 0010FFFF  <=>  F4 8F BF BF        DBFF DFFF  0010FFFF

[Commentary by Ken: UTF-16 does not define the same
 binary ordering as UTF-8 or UTF-32. Big whoop.]

===========================================================

Case II. Code points U-0000D800..U-0000DFFF included
        in the UTF's. "Mark's hard look at the real
        world, where the angels have fallen."
        http://www.macchiato.com/utc/utf_comparison.htm

   code point     UTF-8              UTF-16     UTF-32

a. 00000000  <=>  00                 0000       00000000
b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
g. 0000E000  <=>  EE 80 80           E000       0000E000
h. 0000FFFF  <=>  EF BF BF           FFFF       0000FFFF
i. 00010000  <=>  F0 90 80 80        D800 DC00  00010000
j. 0010FFFF  <=>  F4 8F BF BF        DBFF DFFF  0010FFFF

Round-tripping isolated surrogate code points (when not
appropriately paired):

c. 0000D800  <=>  ED A0 80           D800       0000D800
d. 0000DBFF  <=>  ED AF BF           DBFF       0000DBFF
e. 0000DC00  <=>  ED B0 80           DC00       0000DC00
f. 0000DFFF  <=>  EF BF BF           DFFF       0000DFFF

Code point sequences that do not round-trip from UTF code
unit sequences. [Could be termed "irregular code point
sequences" --Ken]:

k. 0000D800 0000DC00  =>  F0 90 80 80  D800 DC00  00010000
l. 0000DBFF 0000DFFF  =>  F4 8F BF BF  DBFF DFFF  0010FFFF

UTF code unit sequences that do not round-trip from code
points. (Irregular code unit sequences):

m. 00010000  <=   ED A0 80 ED B0 80   ----      0000D800 0000DC00
n. 0010FFFF  <=   ED AF BF ED BF BF   ----      0000DBFF 0000DFFF

[Commentary by Ken: k and l are a real problem here,
 since the conditional handling of "surrogate code points",
 where they convert to a single UTF-32 code unit when isolated,
 but *also* convert to a single UTF-32 code unit when paired,
 breaks the 1-to-1 relationship, character==>code unit, implicit
 for UTF-32. m and n have the same problem in reverse for UTF32.
 I don't think either can be considered a correct specification
 for UTF-32.]

===========================================================

Case III. Code points U-0000D800..U-0000DFFF included
        in the UTF's, using UTF-8s "The vision provided
        by the Oracle."

   code point     UTF-8s             UTF-16     UTF-32

a. 00000000  <=>  00                 0000       00000000
b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
g. 0000E000  <=>  EE 80 80           E000       0000E000
h. 0000FFFF  <=>  EF BF BF           FFFF       0000FFFF
i. 00010000  <=>  ED A0 80 ED B0 80  D800 DC00  00010000
j. 0010FFFF  <=>  ED AF BF ED BF BF  DBFF DFFF  0010FFFF

Round-tripping isolated surrogate code points:

c. 0000D800  <=>  ED A0 80           D800       0000D800
d. 0000DBFF  <=>  ED AF BF           DBFF       0000DBFF
e. 0000DC00  <=>  ED B0 80           DC00       0000DC00
f. 0000DFFF  <=>  EF BF BF           DFFF       0000DFFF

Code point sequences that do not round-trip from all UTF code
unit sequences. (Could be termed "irregular code point
sequences" --Ken):

k. 0000D800 0000DC00  =>  ED A0 80 ED B0 80  D800 DC00  0000D800 0000DC00
l. 0000DBFF 0000DFFF  =>  ED AF BF ED BF BF  DBFF DFFF  0000DBFF 0000DFFF

UTF code unit sequences that do not round-trip from code
points. (Irregular code unit sequences):

m. 00010000  <=   F0 90 80 80        ----      ???
n. 0010FFFF  <=   F4 8F BF BF        ----      ???

[Commentary by Ken: The UTF-8s proposal reverses the
 sense of the irregular UTF-8 code unit sequences, making
 them regular for UTF-8s and making the regular UTF-8
 code unit sequences for supplementary characters *irregular*
 for UTF-8s. The proposal suffers the same nagging problem
 about what to do for UTF-32 for the odd cases of k, l, m, n.
 The UTF-32 *does* round-trip for k and l, but the UTF-8
 and UTF-16 do not. This leads to a conversion conundrum
 for UTF-32:

 <0000D800 0000DC00> => <U+D800, U+DC00> ==> 
      <ED A0 80 ED AF BF> => U+10000 != <U+D800, U+DC00> 

 Further note: To think about this Case the way Oracle does,
 recast everything in terms of UTF-8s <==> UTF-16 conversions.
 This vision of UTF-8s is really the extrapolation of the
 original UTF-2, as a transform on UCS-2, seeking not to
 special-case the handling of surrogate code units that
 were introduced in UTF-16. ]

===========================================================

Case IV. Code points U-0000D800..U-0000DFFF included
        in the UTF's, using UTF-8s and adding UTF-32s.
        "Let them order UTF-16 cake."

   code point     UTF-8s             UTF-16     UTF-32s

a. 00000000  <=>  00                 0000       00000000
b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
g. 0000E000  <=>  EE 80 80           E000       0011E000
h. 0000FFFF  <=>  EF BF BF           FFFF       0011FFFF
i. 00010000  <=>  ED A0 80 ED B0 80  D800 DC00  00010000
j. 0010FFFF  <=>  ED AF BF ED BF BF  DBFF DFFF  0010FFFF

(and everything else follows the Oracle Case III.)

[Commentary by Ken: This one is *too* weird. UTF-32s
 now has the same binary order as UTF-16 and UTF-8s, but
 it breaks the numeric relationship between code point
 and UTF-32 code unit value, which is sure to break lots
 of code. Use of code unit values greater than 0x10FFFF would
 also break code that assumed the UTF-32 structure. Otherwise
 this has the same imprecision regarding irregular UTF-32
 for surrogate pairs as Case III.]

===========================================================

Case V. Code points U-0000D800..U-0000DFFF included
        in the UTF's, using UTF-16x. "Huh?"

   code point     UTF-8              UTF-16x    UTF-32

a. 00000000  <=>  00                 0000       00000000
b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
g. 0000E000  <=>  EE 80 80           D800       0000E000
h. 0000FFFF  <=>  EF BF BF           F7FF       0000FFFF
i. 00010000  <=>  F0 90 80 80        F800 FC00  00010000
j. 0010FFFF  <=>  F4 8F BF BF        FBFF FFFF  0010FFFF

(And it isn't unclear what else to do with this, as I
 haven't seen a complete specification yet.)

[Commentary by Ken: This one is *even* weirder, if
 I have interpreted what people have in mind. Mark already
 ruled it "impossible". While obtaining the goal of
 binary order compatibility between the three UTF's, it 
 would trash interoperability with existing UTF-16 data and 
 API's.]

===========================================================

Case VI. "Ken's Horrible Vision of the Future with
    UTF-8 *and* UTF-8s"

   code point     UTF-8/8s           UTF-16     UTF-32

a. 00000000  <=>  00                 0000       00000000
b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
g. 0000E000  <=>  EE 80 80           E000       0000E000
h. 0000FFFF  <=>  EF BF BF           FFFF       0000FFFF

   code point     UTF-8              UTF-16     UTF-32

i. 00010000  <=>  F0 90 80 80        D800 DC00  00010000
j. 0010FFFF  <=>  F4 8F BF BF        DBFF DFFF  0010FFFF

   code point     UTF-8s             UTF-16     UTF-32

i. 00010000  <=>  ED A0 80 ED B0 80  D800 DC00  00010000
j. 0010FFFF  <=>  ED AF BF ED BF BF  DBFF DFFF  0010FFFF

Round-tripping isolated surrogate code points:

   code point     UTF-8/8s           UTF-16     UTF-32

c. 0000D800  <=>  ED A0 80           D800       0000D800
d. 0000DBFF  <=>  ED AF BF           DBFF       0000DBFF
e. 0000DC00  <=>  ED B0 80           DC00       0000DC00
f. 0000DFFF  <=>  EF BF BF           DFFF       0000DFFF

Code point sequences that do not round-trip from UTF code
unit sequences. [Commentary by Ken: These also have to
map from irregular UTF-32 code unit sequences, as currently
defined.]:

   code point             UTF-8              UTF-32

k. 0000D800 0000DC00  =>  F0 90 80 80        0000D800 0000DC00
l. 0000DBFF 0000DFFF  =>  F4 8F BF BF        0000DBFF 0000DFFF

   code point             UTF-8s        

k. 0000D800 0000DC00  =>  ED A0 80 ED B0 80  0000D800 0000DC00
l. 0000DBFF 0000DFFF  =>  ED AF BF ED BF BF  0000DBFF 0000DFFF

UTF code unit sequences that do not round-trip from code
points. (Irregular UTF-8/8s code unit sequences):

   code point     UTF-8

m. 00010000  <=   ED A0 80 ED B0 80
n. 0010FFFF  <=   ED AF BF ED BF BF

   code point     UTF-8s        

m. 00010000  <=   F0 90 80 80
n. 0010FFFF  <=   F4 8F BF BF

[Commentary by Ken: All generic UTF-8 handlers will have
to be armed with the expectation that they may run into
supplementary characters encoded either as UTF-8 or as UTF-8s.
All processing of UTF-8 will necessitate normalization
between the two forms, to avoid inconsistencies, round-trip
failures, and security issues. The actual API's that people
want to write: UTF8toUTF16, UTF16toUTF8, UTF8toUTF32,
UTF32toUTF8, etc., will be greatly complicated by this
situation, compared to the situation for Case 1, "The way
God intended it to be."]

--Ken
And Visions of Sugar Plum UTF-8's Dance in Their Heads

Reply via email to