Ken, >From your analysis, it make me more believe that we need a UTF-8S not only for the binary order but also for this ambiguity applying to both UTF-8S and UTF-16. As proposed UTF-8S encoding is logically equivalent to the UTF-16, they share the same property which is different from UTF-8 and UTF-32. Here we need either to fix UTF-16 to make it have the some property with UTF-8, or to make another one as UTF-8S. This will fix the following problem for example: For a searching engine to search the character U-00010000 in UTF-8 string, and it could not find. But when UTF-8 is converted into UTF-16, it can found it there because <ED A0 80> and <ED B0 80> are converted into U-0001000 in UTF-16. Regards, Jianping. Kenneth Whistler wrote: > Jianping, > > > I don't get point from this argument as UTF-8S is exactly mapped to UTF-16 in > > UTF-16 code unit which means one UTF-16 code unit will be mapped to either one, > > two, or three bytes in UTF-8S. So if you are saying there is ambiguous in > > UTF-8S, it should also apply to UTF-16, which does not make sense to me. > > I think the reason you are not following the argument that Doug and Peter > have been presenting is that you are thinking in terms of a UTF-8s to > UTF-16 converter, instead of thinking of the UTF's as they are defined > in relation to scalar values. I.e., > > UTF-8s <==> UTF-16 > > instead of: > |==> UTF-8 > USV <==|==> UTF-16 > |==> UTF-32 > > Let me represent the Unicode Scalar Values (USV) in the 10646 *long* > notation, so you can't confuse them with UTF-16 code unit values. > > |==> <F0 90 80 80> > U-00010000 <==|==> <D800 DC00> > |==> <00010000> > > That is the current situation for UTF-8, UTF-16, and UTF-32 as > defined in the standard. You want to introduce a UTF-8s, which > would put us in the following situation: > > |==> <ED A0 80 ED B0 80> UTF-8s > |==> <F0 90 80 80> UTF-8 > U-00010000 <==|==> <D800 DC00> UTF-16 > |==> <00010000> UTF-32 > > Then for interworking, you would choose UTF-8s and UTF-16, since > they have the identical binary ordering properties you want, > and simplify your conversion and allocation handling as well. > > Now the conundrum that Doug and Peter are putting out to you is > what do you do about the handling of isolated surrogates, which > the standard also requires you to have a unique sequence for > (if we consider them to be Unicode scalar values)? Thus: > > |==> <ED A0 80> UTF-8s > |==> <ED A0 80> UTF-8 > U-0000D800 <==|==> <D800> UTF-16 > |==> <0000D800> UTF-32 > > Now let's put two of those isolated surrogate code points > together in sequence: > |==> <ED A0 80 ED B0 80> UTF-8s > |==> <ED A0 80 ED B0 80> UTF-8 > <U-0000D800, U-0000DC00> <==|==> <D800 DC00> UTF-16 > |==> <0000D800 0000DC00> UTF-32 > > Here, arguably, both UTF-32 and UTF-8 would maintain a unique, > roundtrippable distinction between two isolated surrogate > code points (i.e. Unicode scalar values) in sequence, and > an ordinary supplemental code point. However, UTF-16 and > UTF-8s would not. For UTF-16 this is understandable, since > it was *designed* that way. It cannot really represent sequences of > isolated surrogate code points, since it uses surrogate code > *units* as part of the transformation. But by making UTF-8s > mimic UTF-16, the problem gets worse. The UTF-8s sequence > cannot distinguish the two either, so it is failing of > the "unique sequence" requirement. But what is worse, the > supposedly regular UTF-8s sequence cannot be distinguished from > the *irregular* UTF-8 sequence for the same thing. > > Personally, I think there are other conundrums in the last two > examples, as applied to UTF-16, that would lead me to prefer > restricting "Unicode scalar value" itself to non-surrogate > code points for the purposes of the definition of the UTF's, > and then leave the last two examples to the error-handling > exceptions. But in any case, the introduction of UTF-8s > doesn't make the situation better for these definitions -- > it just creates more points of confusion and inconsistency > in the definitions. > > --Ken > > > [EMAIL PROTECTED] wrote: > > > > > On 06/07/2001 10:38:15 AM DougEwell2 wrote: > > > > > > >The ambiguity comes from the fact that, if I am using UTF-8s and I want to > > > >represent the sequence of (invalid) scalar values <D800 DC00>, I must use > > > the > > > >UTF-8s sequence <ED A0 80 ED B0 80>, and if I want to represent the > > > (valid) > > > >scalar value <10000>, I must *also* use the UTF-8s sequence <ED A0 80 ED > > > B0 > > > >80>. Unless you have a crystal ball or are extremely good with tarot > > > cards, > > > >you have no way, upon reverse-mapping the UTF-8s sequence <ED A0 80 ED B0 > > > >80>, to know whether it is supposed to be mapped back to <D800 DC00> or to > > > ><10000>. > > >
begin:vcard n:Yang;Jianping tel;fax:650-506-7225 tel;work:650-506-4865 x-mozilla-html:FALSE org:Server Gobalization Technology;Server Technology version:2.1 email;internet:[EMAIL PROTECTED] title:Senior Development Manager adr;quoted-printable:;;500 Oracle Packway=0D=0AM/S 659407;Redwood Shores;CA;94065; fn:Jianping Yang end:vcard

