Kenneth Whistler wrote: > Jianping wrote: > > > One thing needs to clarify here is that there is no four byte encoding in > > UTF-8S proposal and four byte encoding is illegal but not irregular. As > > everything in UTF-8S is perfect match to UTF-16, any blame to this proposal > > also applies to UTF-16 encoding form. > > Well after a couple months arguing about this, it is nice to have > this little detail drop into place. Perhaps in another couple > of months we could have a complete specification, and then > restart the argument. > This is not truth. In its very beginning, we stated that the supplementary character will be encoded as a pair of three bytes. That's why we have a new proposal, otherwise the proposal will be groundless as it will be same as UTF-8. > > [BTW, as Peter has recently noted: 4 lines of new content, quoting my > 234 lines of content, with no commentary interspersed. This rampant > failure to edit reply-to's is threatening to bring the wrath of > Sarasvati back down on the list, folks.] > > So, given the new information that the four byte form is illegal, not irregular, > in UTF-8s, here is a revised summary of UTF-8s: > > =========================================================== > > Case III. Code points U-0000D800..U-0000DFFF included > in the UTF's, using UTF-8s "The vision provided > by the Oracle." > > code point UTF-8s UTF-16 UTF-32 > > a. 00000000 <=> 00 0000 00000000 > b. 0000D700 <=> ED 9F BF D7FF 0000D7FF > g. 0000E000 <=> EE 80 80 E000 0000E000 > h. 0000FFFF <=> EF BF BF FFFF 0000FFFF > i. 00010000 <=> ED A0 80 ED B0 80 D800 DC00 00010000 > j. 0010FFFF <=> ED AF BF ED BF BF DBFF DFFF 0010FFFF > > Round-tripping isolated surrogate code points: > > c. 0000D800 <=> ED A0 80 D800 0000D800 > d. 0000DBFF <=> ED AF BF DBFF 0000DBFF > e. 0000DC00 <=> ED B0 80 DC00 0000DC00 > f. 0000DFFF <=> EF BF BF DFFF 0000DFFF > > Code point sequences that do not round-trip from all UTF code > unit sequences. (Could be termed "irregular code point > sequences" --Ken): > > k. 0000D800 0000DC00 => ED A0 80 ED B0 80 D800 DC00 0000D800 0000DC00 > l. 0000DBFF 0000DFFF => ED AF BF ED BF BF DBFF DFFF 0000DBFF 0000DFFF > > ============================================================= > > What Jianping is saying now is that F0..F4 are illegal as > initiators in UTF-8s. (They are legal initiators in UTF-8.) > > Also, judging from his statement that "everything in UTF-8S is > perfect match to UTF-16", it is quite clear that UTF-8s does > *not* meet the Unicode Standard's definition of a UTF. To be > a UTF, it has to be a reversible transform of code points (or > Unicode scalar values -- there is some argument about which). > Does UTF-16 meet? If UTF-16 does, UTF-8S should. > > But UTF-8s is designed and conceived as a CODE UNIT TRANSFORM > of UTF-16. (A "CUT", not a "UTF".) > > Basically, instead of starting with the code points, and deriving > the three UTF's, for UTF-8s you start with UTF-16 and derive > UTF-8s directly from it. (This is why I have been pounding on > the point that in order to understand the Oracle proposal, you > have to think in terms of the UTF-16 <==> UTF-8 convertors, > rather than in terms of the definitional UTF's.) > This is your perception. > > In other words, while others are seeing: > > U-00010000 ==> ED A0 80 ED B0 80 in UTF-8s > ==> D800 DC00 in UTF-16 > > Oracle is seeing: > > (D800)(DC00) <==> (ED A0 80)(ED B0 80) > That's also your perception but not Oracle as we already support standard UTF-8 encoding in 9i. > > and pointing out the tremendous simplicity of the fact that > a code point, err... code unit in UTF-16 always corresponds > *exactly* to a code point, errr... well a 1-, 2-, or 3- code > unit sequence in UTF-8s that always corresponds to a, umm.. > character, well, sort of. > It is meaningless to examine each bytes of UTF-8S encoding, and this also applies to UTF-8. What is code unit in UTF-8S should be 1-, 2- or 3-bytes unit, and one or two code-unit will be one codepoint. If we still look at each byte of UTF-8S/UTF-S and make random truncation, you will get meaningless bytes there. The best practice here is that you have to treat this 1-, 2-, or 3-bytes encoding as one unit. > > Now, perhaps Jianping will care to step in an clarify how UTF-32 > fits in this picture. How, for example, are the irregular UTF-32 > sequences in k and l above to be treated? As I have indicated? > (in which case, as Peter points out, there is an ambiguity in > the interpretation of any 6-byte UTF-8s representation) Or in > some other manner? And if so, how so? > Before answering these questions, just replace UTF-8S by UTF-16, can you give me good answers here? If this is any ambiguity for UTF-8S, so as UTF-16. I don't think there is ground here to argue this syntax or semantics issue as UTF-8S should meet the standard requirement exactly the same way as UTF-16. I think the key issue here is its benefit and its implication to the implementor, and I think we should get a best balance between these two. Regards, Jianping. > > --Ken
begin:vcard n:Yang;Jianping tel;fax:650-506-7225 tel;work:650-506-4865 x-mozilla-html:FALSE org:Server Gobalization Technology;Server Technology version:2.1 email;internet:[EMAIL PROTECTED] title:Senior Development Manager adr;quoted-printable:;;500 Oracle Packway=0D=0AM/S 659407;Redwood Shores;CA;94065; fn:Jianping Yang end:vcard

