Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads

Jianping Yang Tue, 12 Jun 2001 19:54:16 -0700



Kenneth Whistler wrote:

> Jianping wrote:
>
> > One thing needs to clarify here is that there is no four byte encoding in
> > UTF-8S proposal and four byte encoding is illegal but not irregular. As
> > everything in UTF-8S is perfect match to UTF-16, any blame to this proposal
> > also applies to UTF-16 encoding form.
>
> Well after a couple months arguing about this, it is nice to have
> this little detail drop into place. Perhaps in another couple
> of months we could have a complete specification, and then
> restart the argument.
>

This is not truth. In its very beginning, we stated that the supplementary character
will be encoded as a pair of three bytes. That's why we have a new proposal,
otherwise the proposal will be groundless as it will be same as UTF-8.

>
> [BTW, as Peter has recently noted: 4 lines of new content, quoting my
>  234 lines of content, with no commentary interspersed. This rampant
>  failure to edit reply-to's is threatening to bring the wrath of
>  Sarasvati back down on the list, folks.]
>
> So, given the new information that the four byte form is illegal, not irregular,
> in UTF-8s, here is a revised summary of UTF-8s:
>
> ===========================================================
>
> Case III. Code points U-0000D800..U-0000DFFF included
>         in the UTF's, using UTF-8s "The vision provided
>         by the Oracle."
>
>    code point     UTF-8s             UTF-16     UTF-32
>
> a. 00000000  <=>  00                 0000       00000000
> b. 0000D700  <=>  ED 9F BF           D7FF       0000D7FF
> g. 0000E000  <=>  EE 80 80           E000       0000E000
> h. 0000FFFF  <=>  EF BF BF           FFFF       0000FFFF
> i. 00010000  <=>  ED A0 80 ED B0 80  D800 DC00  00010000
> j. 0010FFFF  <=>  ED AF BF ED BF BF  DBFF DFFF  0010FFFF
>
> Round-tripping isolated surrogate code points:
>
> c. 0000D800  <=>  ED A0 80           D800       0000D800
> d. 0000DBFF  <=>  ED AF BF           DBFF       0000DBFF
> e. 0000DC00  <=>  ED B0 80           DC00       0000DC00
> f. 0000DFFF  <=>  EF BF BF           DFFF       0000DFFF
>
> Code point sequences that do not round-trip from all UTF code
> unit sequences. (Could be termed "irregular code point
> sequences" --Ken):
>
> k. 0000D800 0000DC00  =>  ED A0 80 ED B0 80  D800 DC00  0000D800 0000DC00
> l. 0000DBFF 0000DFFF  =>  ED AF BF ED BF BF  DBFF DFFF  0000DBFF 0000DFFF
>
> =============================================================
>
> What Jianping is saying now is that F0..F4 are illegal as
> initiators in UTF-8s. (They are legal initiators in UTF-8.)
>
> Also, judging from his statement that "everything in UTF-8S is
> perfect match to UTF-16", it is quite clear that UTF-8s does
> *not* meet the Unicode Standard's definition of a UTF. To be
> a UTF, it has to be a reversible transform of code points (or
> Unicode scalar values -- there is some argument about which).
>

Does UTF-16 meet? If UTF-16 does, UTF-8S should.

>
> But UTF-8s is designed and conceived as a CODE UNIT TRANSFORM
> of UTF-16. (A "CUT", not a "UTF".)
>
> Basically, instead of starting with the code points, and deriving
> the three UTF's, for UTF-8s you start with UTF-16 and derive
> UTF-8s directly from it. (This is why I have been pounding on
> the point that in order to understand the Oracle proposal, you
> have to think in terms of the UTF-16 <==> UTF-8 convertors,
> rather than in terms of the definitional UTF's.)
>

This is your perception.

>
> In other words, while others are seeing:
>
> U-00010000  ==>  ED A0 80 ED B0 80  in UTF-8s
>             ==>  D800 DC00          in UTF-16
>
> Oracle is seeing:
>
>       (D800)(DC00) <==> (ED A0 80)(ED B0 80)
>

That's also your perception but not Oracle as we already support standard UTF-8
encoding in 9i.

>
> and pointing out the tremendous simplicity of the fact that
> a code point, err... code unit in UTF-16 always corresponds
> *exactly* to a code point, errr... well a 1-, 2-, or 3- code
> unit sequence in UTF-8s that always corresponds to a, umm..
> character, well, sort of.
>

It is meaningless to examine each bytes of UTF-8S encoding, and this also applies to
UTF-8. What is code unit in UTF-8S should be 1-, 2- or 3-bytes unit, and one or two
code-unit will be one codepoint. If we still look at each byte of UTF-8S/UTF-S and
make random truncation, you will get meaningless bytes there. The best practice here
is that you have to treat this 1-, 2-, or 3-bytes encoding as one unit.

>
> Now, perhaps Jianping will care to step in an clarify how UTF-32
> fits in this picture. How, for example, are the irregular UTF-32
> sequences in k and l above to be treated? As I have indicated?
> (in which case, as Peter points out, there is an ambiguity in
> the interpretation of any 6-byte UTF-8s representation) Or in
> some other manner? And if so, how so?
>

Before answering these questions, just replace UTF-8S by UTF-16, can you give me
good answers  here? If this is any ambiguity for UTF-8S, so as UTF-16.

I don't think there is ground here to argue this syntax or semantics issue as UTF-8S
should meet the standard requirement exactly the same way as UTF-16. I think the key
issue here is its benefit and its implication to the implementor, and I think we
should get a best balance between these two.

Regards,
Jianping.


>
> --Ken

begin:vcard 
n:Yang;Jianping
tel;fax:650-506-7225
tel;work:650-506-4865
x-mozilla-html:FALSE
org:Server Gobalization Technology;Server Technology
version:2.1
email;internet:[EMAIL PROTECTED]
title:Senior Development Manager
adr;quoted-printable:;;500 Oracle Packway=0D=0AM/S 659407;Redwood Shores;CA;94065;
fn:Jianping Yang
end:vcard

Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads

Reply via email to