Re: UTF-8 syntax

Jianping Yang Fri, 08 Jun 2001 12:02:49 -0700

Ken,

>From your analysis, it make me more believe that we need a UTF-8S not only for the
binary order but also for this ambiguity applying to both UTF-8S and UTF-16. As
proposed UTF-8S encoding is logically equivalent to the UTF-16, they share the same
property which is different from UTF-8 and UTF-32. Here we need either to fix UTF-16
to make it have the some property with UTF-8, or to make another one as UTF-8S.

This will fix the following problem for example:
For a searching engine to search the character  U-00010000 in UTF-8 string, and it
could not find. But when UTF-8 is converted into UTF-16, it can found it there
because <ED A0 80> and  <ED B0 80> are converted into U-0001000 in UTF-16.

Regards,
Jianping.




Kenneth Whistler wrote:

> Jianping,
>
> > I don't get point from this argument as UTF-8S is exactly mapped to UTF-16 in
> > UTF-16 code unit which means one UTF-16 code unit will be mapped to either one,
> > two, or three bytes in UTF-8S. So if you are saying there is ambiguous in
> > UTF-8S, it should also apply to UTF-16, which does not make sense to me.
>
> I think the reason you are not following the argument that Doug and Peter
> have been presenting is that you are thinking in terms of a UTF-8s to
> UTF-16 converter, instead of thinking of the UTF's as they are defined
> in relation to scalar values. I.e.,
>
>                 UTF-8s  <==>  UTF-16
>
> instead of:
>                         |==> UTF-8
>                 USV  <==|==> UTF-16
>                         |==> UTF-32
>
> Let me represent the Unicode Scalar Values (USV) in the 10646 *long*
> notation, so you can't confuse them with UTF-16 code unit values.
>
>                                |==> <F0 90 80 80>
>                 U-00010000  <==|==> <D800 DC00>
>                                |==> <00010000>
>
> That is the current situation for UTF-8, UTF-16, and UTF-32 as
> defined in the standard. You want to introduce a UTF-8s, which
> would put us in the following situation:
>
>                                |==> <ED A0 80 ED B0 80>   UTF-8s
>                                |==> <F0 90 80 80>         UTF-8
>                 U-00010000  <==|==> <D800 DC00>           UTF-16
>                                |==> <00010000>            UTF-32
>
> Then for interworking, you would choose UTF-8s and UTF-16, since
> they have the identical binary ordering properties you want,
> and simplify your conversion and allocation handling as well.
>
> Now the conundrum that Doug and Peter are putting out to you is
> what do you do about the handling of isolated surrogates, which
> the standard also requires you to have a unique sequence for
> (if we consider them to be Unicode scalar values)? Thus:
>
>                                |==> <ED A0 80>            UTF-8s
>                                |==> <ED A0 80>            UTF-8
>                 U-0000D800  <==|==> <D800>                UTF-16
>                                |==> <0000D800>            UTF-32
>
> Now let's put two of those isolated surrogate code points
> together in sequence:
>                                |==> <ED A0 80 ED B0 80>   UTF-8s
>                                |==> <ED A0 80 ED B0 80>   UTF-8
>   <U-0000D800, U-0000DC00>  <==|==> <D800 DC00>           UTF-16
>                                |==> <0000D800 0000DC00>   UTF-32
>
> Here, arguably, both UTF-32 and UTF-8 would maintain a unique,
> roundtrippable distinction between two isolated surrogate
> code points (i.e. Unicode scalar values) in sequence, and
> an ordinary supplemental code point. However, UTF-16 and
> UTF-8s would not. For UTF-16 this is understandable, since
> it was *designed* that way. It cannot really represent sequences of
> isolated surrogate code points, since it uses surrogate code
> *units* as part of the transformation. But by making UTF-8s
> mimic UTF-16, the problem gets worse. The UTF-8s sequence
> cannot distinguish the two either, so it is failing of
> the "unique sequence" requirement. But what is worse, the
> supposedly regular UTF-8s sequence cannot be distinguished from
> the *irregular* UTF-8 sequence for the same thing.
>
> Personally, I think there are other conundrums in the last two
> examples, as applied to UTF-16, that would lead me to prefer
> restricting "Unicode scalar value" itself to non-surrogate
> code points for the purposes of the definition of the UTF's,
> and then leave the last two examples to the error-handling
> exceptions. But in any case, the introduction of UTF-8s
> doesn't make the situation better for these definitions --
> it just creates more points of confusion and inconsistency
> in the definitions.
>
> --Ken
>
> > [EMAIL PROTECTED] wrote:
> >
> > > On 06/07/2001 10:38:15 AM DougEwell2 wrote:
> > >
> > > >The ambiguity comes from the fact that, if I am using UTF-8s and I want to
> > > >represent the sequence of (invalid) scalar values <D800 DC00>, I must use
> > > the
> > > >UTF-8s sequence <ED A0 80 ED B0 80>, and if I want to represent the
> > > (valid)
> > > >scalar value <10000>, I must *also* use the UTF-8s sequence <ED A0 80 ED
> > > B0
> > > >80>.  Unless you have a crystal ball or are extremely good with tarot
> > > cards,
> > > >you have no way, upon reverse-mapping the UTF-8s sequence <ED A0 80 ED B0
> > > >80>, to know whether it is supposed to be mapped back to <D800 DC00> or to
> > > ><10000>.
> > >

begin:vcard 
n:Yang;Jianping
tel;fax:650-506-7225
tel;work:650-506-4865
x-mozilla-html:FALSE
org:Server Gobalization Technology;Server Technology
version:2.1
email;internet:[EMAIL PROTECTED]
title:Senior Development Manager
adr;quoted-printable:;;500 Oracle Packway=0D=0AM/S 659407;Redwood Shores;CA;94065;
fn:Jianping Yang
end:vcard

Re: UTF-8 syntax

Reply via email to