Re: UTF-8 syntax

DougEwell2 Thu, 07 Jun 2001 09:57:54 -0700
In a message dated 2001-06-07 1:03:04 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  >But definition D29 says that a UTF must round-trip these invalid code
>  points,
>  >so we have no choice but to interpret them as <D800 DC00>.  That is why
>  >UTF-8s is ambiguous.  The sequence <ED A0 80 ED B0 80> could be mapped as
>  >either <D800 DC00>, because D29 says you have to allow for that, or as
>  ><10000>, because that is the real intent.
>  
>  Well, I don't find round-trip implied in D29, but it does say that the
>  mapping from the CCS to 8-bit code sequences is unique:

The (unnumbered) paragraph immediately following D29 is what I was referring 
to:

<quote emphasis=original>
Because every Unicode coded character sequence maps to a unique sequence of 
code values in a given UTF, a reverse mapping can be derived.  Thus every UTF 
supports *lossless round-trip transcoding*:  mapping from any Unicode coded 
character sequence S to a sequence of code values and back will produce S 
again.  To ensure that round-trip transcoding is possible, a UTF mapping 
*must also* map invalid Unicode scalar values to unique code value sequences. 
 These invalid scalar values include FFFE, FFFF, and unpaired surrogates.
</quote>

I assume this paragraph, although unnumbered, is intended to supplement and 
clarify D29, and so in a sense it is part of D29.  (What other reason could 
it have for being there?)

(N.B.  The list of invalid scalar values also includes *all* values of the 
form U+xxFFFE and U+xxFFFF, as well as U+FDD0 through U+FDEF.)

>  >But definition D29 says that a UTF must round-trip these invalid code
>  points,
>  >so we have no choice but to interpret them as <D800 DC00>.  That is why
>  >UTF-8s is ambiguous.
>  
>  Not so. All that D29 imposes on UTF-8s is that its mapping from codepoints
>  to code units must be injective; i.e. there can be only one sequence for
>  any given codepoint. It does not make any further requirements as to the
>  nature of the mapping. Therefore, it is possible for UTF-8s to specify that
>  the represention of U+10000 is <ED A0 80 ED B0 80> (or anything else, for
>  that matter), but it can only specify one representation. D29 requires that
>  any UTF-8s, if it were to be defined in Unicode, could *not* be ambiguous.

The ambiguity comes from the fact that, if I am using UTF-8s and I want to 
represent the sequence of (invalid) scalar values <D800 DC00>, I must use the 
UTF-8s sequence <ED A0 80 ED B0 80>, and if I want to represent the (valid) 
scalar value <10000>, I must *also* use the UTF-8s sequence <ED A0 80 ED B0 
80>.  Unless you have a crystal ball or are extremely good with tarot cards, 
you have no way, upon reverse-mapping the UTF-8s sequence <ED A0 80 ED B0 
80>, to know whether it is supposed to be mapped back to <D800 DC00> or to 
<10000>.

I mean, yes, you do have a way.  The great *likelihood* is that you want to 
represent the valid Unicode code point, not a sequence of two lonely 
surrogate code points that just coincidentally happen to appear together.  
But this heuristic does not answer the requirement of the paragraph following 
D29 that a UTF must map code points to code units  unambiguously. 

>  - Contrary to Doug, a UTF-8s could not be made ambiguous if it were defined
>  in Unicode. No argument on this basis against a proposed UTF-8s has been
>  made.

Premise:  Unicode should not, and does not, define ambiguous UTFs.
    I think we agree on this.

Premise:  UTF-8s is ambiguous in its handling of surrogate code points.
    I tried to prove this above.

Conclusion:  Unicode should not define UTF-8s.

-Doug Ewell
 Fullerton, California
Re: UTF-8 syntax

Reply via email to