Kenneth Whistler wrote: > Jianping responded: > > > Kenneth Whistler wrote: > > > > > Jianping wrote: > > > > > > > One thing needs to clarify here is that there is no four byte encoding in > > > > UTF-8S proposal and four byte encoding is illegal but not irregular. As > > > > everything in UTF-8S is perfect match to UTF-16, any blame to this proposal > > > > also applies to UTF-16 encoding form. > > > > > > Well after a couple months arguing about this, it is nice to have > > > this little detail drop into place. Perhaps in another couple > > > of months we could have a complete specification, and then > > > restart the argument. > > > > This is not truth. In its very beginning, we stated that the supplementary >character > > will be encoded as a pair of three bytes. That's why we have a new proposal, > > otherwise the proposal will be groundless as it will be same as UTF-8. > > You're missing the point. To date the specification for UTF-8s > has not been complete. People have openly speculated on the > list (before my message under this new topic) about what the > status of a four-byte supplementary character representation in > UTF-8s would be. > > You did state that supplementary characters are encoded in UTF-8s > as pairs of three bytes. Everybody understood that. That was the > well-formedness condition. > > What you didn't state were the ill-formedness conditions. Was > valid UTF-8 considered allowable or not under UTF-8s? Was it > ill-formed, and if so, how interpreted, or not? > > What you finally stated today is that <F0 90 80 80> is flat-out > *illegal* in UTF-8s. That was a missing piece of the puzzle for anyone > trying to interpret what you are proposing. > In the UTF-8S, there should be no irregular forms, should we repeat the history again? Nobody except you though that 4-byte is allowed in UTF-8S. > > > > ============================================================= > > > > > > What Jianping is saying now is that F0..F4 are illegal as > > > initiators in UTF-8s. (They are legal initiators in UTF-8.) > > > > > > Also, judging from his statement that "everything in UTF-8S is > > > perfect match to UTF-16", it is quite clear that UTF-8s does > > > *not* meet the Unicode Standard's definition of a UTF. To be > > > a UTF, it has to be a reversible transform of code points (or > > > Unicode scalar values -- there is some argument about which). > > > > > > > Does UTF-16 meet? If UTF-16 does, UTF-8S should. > > UTF-16 is a reversible transform of Unicode scalar values > to 16-bit code units. > > By contrast, you have conceived and defined UTF-8s as a > reversible transform of UTF-16 16-bit code units to 8-bit > code units. > > Not the same thing. > > Presumably UTF-8s could also be defined as a UTF, but > that isn't how you have been presenting or (apparently) > conceiving it. > > > > But UTF-8s is designed and conceived as a CODE UNIT TRANSFORM > > > of UTF-16. (A "CUT", not a "UTF".) > > > > > > Basically, instead of starting with the code points, and deriving > > > the three UTF's, for UTF-8s you start with UTF-16 and derive > > > UTF-8s directly from it. (This is why I have been pounding on > > > the point that in order to understand the Oracle proposal, you > > > have to think in terms of the UTF-16 <==> UTF-8 convertors, > > > rather than in terms of the definitional UTF's.) > > > > > > > This is your perception. > > Yep, and I stand by it. > > > > In other words, while others are seeing: > > > > > > U-00010000 ==> ED A0 80 ED B0 80 in UTF-8s > > > ==> D800 DC00 in UTF-16 > > > > > > Oracle is seeing: > > > > > > (D800)(DC00) <==> (ED A0 80)(ED B0 80) > > > > > > > That's also your perception but not Oracle as we already support standard UTF-8 > > encoding in 9i. > > How is Oracle's support for standard UTF-8 relevant to the conceptual > definition of UTF-8s? That means we do recognize U-00010000 in our implementation for UTF formats. > > > > > and pointing out the tremendous simplicity of the fact that > > > a code point, err... code unit in UTF-16 always corresponds > > > *exactly* to a code point, errr... well a 1-, 2-, or 3- code > > > unit sequence in UTF-8s that always corresponds to a, umm.. > > > character, well, sort of. > > > > > > > It is meaningless to examine each bytes of UTF-8S encoding, and this also applies >to > > UTF-8. What is code unit in UTF-8S should be 1-, 2- or 3-bytes unit, and one or two > > code-unit will be one codepoint. If we still look at each byte of UTF-8S/UTF-S and > > make random truncation, you will get meaningless bytes there. The best practice >here > > is that you have to treat this 1-, 2-, or 3-bytes encoding as one unit. > > Beautifully put. I think you have just confirmed my argument. > > > > Now, perhaps Jianping will care to step in an clarify how UTF-32 > > > fits in this picture. How, for example, are the irregular UTF-32 > > > sequences in k and l above to be treated? As I have indicated? > > > (in which case, as Peter points out, there is an ambiguity in > > > the interpretation of any 6-byte UTF-8s representation) Or in > > > some other manner? And if so, how so? > > > > > > > Before answering these questions, just replace UTF-8S by UTF-16, can you give me > > good answers here? If this is any ambiguity for UTF-8S, so as UTF-16. > > Certain I can give you a good answer. Return to Case I of my original document. > > In that formulation, the code point U-0000D800 and the code point > U-0000DC00 are not mapped by the UTF's at all. <U-0000D800, U-0000DC00> > is just an illegal representation. > > The UTF-16 code unit sequence <D800 DC00> *always* corresponds to U+10000. > It also always corresponds to the UTF-32 code unit sequence <00010000> > and the UTF-8 code unit sequence <F0 90 80 80>. > > No ambiguities, no mapping issues. > > Now please answer the question for UTF-32 under your formulation of > UTF-8s. My answer here is quite simple: The UTF-8S code unit sequence <ED A0 80 ED B0 80> *always* corresponds to U+10000. It also always corresponds to the UTF-32 code unit sequence <00010000> and the UTF-8 code unit sequence <F0 90 80 80>. No ambiguities, no mapping issues. Regards, Jianping. > > > I don't think there is ground here to argue this syntax or semantics issue as >UTF-8S > > should meet the standard requirement exactly the same way as UTF-16. I think the >key > > issue here is its benefit and its implication to the implementor, and I think we > > should get a best balance between these two. > > No comment. > > --Ken > > > > > Regards, > > Jianping.
begin:vcard n:Yang;Jianping tel;fax:650-506-7225 tel;work:650-506-4865 x-mozilla-html:FALSE org:Server Gobalization Technology;Server Technology version:2.1 email;internet:[EMAIL PROTECTED] title:Senior Development Manager adr;quoted-printable:;;500 Oracle Packway=0D=0AM/S 659407;Redwood Shores;CA;94065; fn:Jianping Yang end:vcard

