Re: Abstract character?

Mark Davis Wed, 24 Jul 2002 08:38:28 -0700

I disagree with Ken, but don't have time now to write a lengthy
reply.. I'll try to get to that soon.


Mark
__________
http://www.macchiato.com
◄  “Eppur si muove” ►

----- Original Message -----
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: "Kenneth Whistler" <[EMAIL PROTECTED]>
Sent: Tuesday, July 23, 2002 19:44
Subject: Re: Abstract character?


> Kenneth Whistler <kenw at sybase dot com> wrote:
>
> >> UTF-16 does not allow the representation of an unpaired surrogate
> >> 0xD800 followed by another, coincidental unpaired surrogate
0xDC00.
> >> (It maps the two to U+10000.)  Among the standard UTFs, only
UTF-32
> >> allows the two to be treated as unpaired surrogates.
> >
> > Actually, not that, either.
> >
> >> In fact, before UTF-8 was
> >> "tightened up" in 3.2, the only UTF that DID NOT permit these two
> >> coincidental unpaired surrogates was UTF-16.
> >>
> >> UTF-8:  D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal)
> >> UTF-32:  D800 DC00 <==> 0000D800 0000DC00
> >
> > This is ill-formed in UTF-32, and thereby, illegal.
>
> I'm glad to hear that unpaired surrogates are now also illegal in
> UTF-32, and presumably also in UTF-16.  However, I did do my
homework
> before writing yesterday's post, and that wasn't the impression I
got,
> so I sense another opportunity to tighten up the definitions before
> Unicode 4.0 is released.
>
> In UAX #28, "Unicode 3.2," the section on "Elimination of Irregular
> Sequences" starts out talking about "transformation formats such as
> UTF-8."  However, the rest of the section deals exclusively with
UTF-8;
> UTF-16 and UTF-32 are not mentioned.
>
> UAX #19, "UTF-32" (written by Mark) is listed in the header block as
> having been updated to Unicode 3.2, but it does not state anywhere
that
> unpaired surrogates are illegal.  In particular, the following
passages
> from UAX #19 led me to believe that all code points, from 0x0000
through
> 0x10FFFF inclusive, are legal in UTF-32:
>
> "UTF-32 is restricted in values to the range 0..10FFFF<sub>16</sub>,
> which precisely matches the range of characters defined in the
Unicode
> Standard (and other standards such as XML), and those representable
by
> UTF-8 and UTF-16."
>
> "(b) An illegal UTF-32 code unit sequence is any byte sequence that
> would correspond to a numeric value outside of the range 0 to
> 10FFFF<sub>16</sub>.
>
> "(c) An irregular UTF-32 code unit sequence is an eight-byte
sequence
> where the first four bytes correspond to a high surrogate, and the
next
> four bytes correspond to a low surrogate. As a consequence of C12,
these
> irregular UTF-32 sequences shall not be generated by a conformant
> process."
>
> I suggest that the Unicode 4.0 text specifically state, in
unambiguous
> terms, which code points are and are not valid in UTF-8, UTF-16, and
> UTF-32.  And if it is true that the surrogate code points 0xD800
through
> 0xDFFF are illegal in UTF-32, then I suggest that UAX #18 be revised
to
> state this unambiguously.
>
> -Doug Ewell
>  Fullerton, California
>
>
>

Re: Abstract character?

Reply via email to