Re: Abstract character?

Doug Ewell Mon, 22 Jul 2002 22:31:39 -0700

Mark Davis <mark at macchiato dot com> wrote:

> The UTC in has decided to make scalar value mean unambiguously the
> code points 0000..D7FF, E000..10FFFF, i.e., everything but surrogate
> code points. While surrogate code points cannot be represented in
> UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate
> code points are illegal in all UTFs; notably, they are legal in
> UTF-16.


They are not legal in UTF-16 unless you believe that the two code points
(0xD800, 0xDC00) are fundamentally equivalent to the single code point
0x10000 -- that is, unless you believe Unicode *is* UTF-16.

UTF-16 does not allow the representation of an unpaired surrogate 0xD800
followed by another, coincidental unpaired surrogate 0xDC00.  (It maps
the two to U+10000.)  Among the standard UTFs, only UTF-32 allows the
two to be treated as unpaired surrogates.  In fact, before UTF-8 was
"tightened up" in 3.2, the only UTF that DID NOT permit these two
coincidental unpaired surrogates was UTF-16.

UTF-8:  D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal)
UTF-32:  D800 DC00 <==> 0000D800 0000DC00
- but -
UTF-16:  D800 DC00 ==> D800 DC00 ==> 10000

> Ken is pushing for this change; I believe it would be a very bad idea.
> (I think the reasons have already appeared on this list, so I am not
> trying to reopen the discussion; just state the current situation.)

I don't recall seeing the reasons conclusively discussed on this list;
I'd be happy to hear them again.  I've been complaining about the
paragraph after D29 for two years now.

-Doug Ewell
 Fullerton, California

Re: Abstract character?

Reply via email to