Mark Davis <mark at macchiato dot com> wrote: > The UTC in has decided to make scalar value mean unambiguously the > code points 0000..D7FF, E000..10FFFF, i.e., everything but surrogate > code points. While surrogate code points cannot be represented in > UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate > code points are illegal in all UTFs; notably, they are legal in > UTF-16.
They are not legal in UTF-16 unless you believe that the two code points (0xD800, 0xDC00) are fundamentally equivalent to the single code point 0x10000 -- that is, unless you believe Unicode *is* UTF-16. UTF-16 does not allow the representation of an unpaired surrogate 0xD800 followed by another, coincidental unpaired surrogate 0xDC00. (It maps the two to U+10000.) Among the standard UTFs, only UTF-32 allows the two to be treated as unpaired surrogates. In fact, before UTF-8 was "tightened up" in 3.2, the only UTF that DID NOT permit these two coincidental unpaired surrogates was UTF-16. UTF-8: D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal) UTF-32: D800 DC00 <==> 0000D800 0000DC00 - but - UTF-16: D800 DC00 ==> D800 DC00 ==> 10000 > Ken is pushing for this change; I believe it would be a very bad idea. > (I think the reasons have already appeared on this list, so I am not > trying to reopen the discussion; just state the current situation.) I don't recall seeing the reasons conclusively discussed on this list; I'd be happy to hear them again. I've been complaining about the paragraph after D29 for two years now. -Doug Ewell Fullerton, California

