-----BEGIN PGP SIGNED MESSAGE----- Mark Davis wrote: > A small correction to Ken's message: > > > The Unicode scalar value > > definitionally excludes D800..DFFF, which are only code unit > > values used in UTF-16, and which are not code points associated > > with any well-formed UTF code unit sequences. > > The UTC in has decided to make scalar value mean unambiguously the > code points 0000..D7FF, E000..10FFFF, i.e., everything but surrogate > code points.
I think it would be a mistake for the standard to refer to "surrogate code points". The term "code point" is used for other CCS's where there may also be gaps in the code space; in that case, the gaps are not considered valid code points. When 0xD800..0xDFFF are used in UTF-16, they are used as code units, not code points. As Unicode code points, 0xD800..0xDFFF are (or at least should be) invalid in the same sense that 0x110000 is. I.e. IMHO "Unicode scalar value" and "Unicode code point" should be synonyms, with the set of valid values 0..0xD7FF, 0xE000..0x10FFFF. "code point" should be defined as an integer corresponding to an encoded character in any CCS, not just Unicode. > While surrogate code points cannot be represented in > UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate > code points are illegal in all UTFs; notably, they are legal in > UTF-16. The integers 0xD800..0xDFFF are legal *as code units* in UTF-16. IMHO allowing them as code points (i.e. allowing any process to conformantly generate unpaired surrogates) is a really bad idea. The set of code point sequences that are validly representable in each UTF should be identical (which ensures that mappings between UTFs are bijective and always succeed iff the input is valid in the source UTF). I.e. U+D800..DFFF, like U+110000, should be undesignated and unrepresentable. (As well as UTF-16, the definition of UTF-32 in UAX #19 does not specifically exclude 0xD800..0xDFFF, although the ISO 10646 definition does. In this case I think Unicode should be changed to be consistent with ISO 10646.) > Ken is pushing for this change; I believe it would be a very bad idea. What precisely do you think would be a bad idea? - -- David Hopwood <[EMAIL PROTECTED]> Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/ RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01 Nothing in this message is intended to be legally binding. If I revoke a public key but refuse to specify why, it is because the private key has been seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip -----BEGIN PGP SIGNATURE----- Version: 2.6.3i Charset: noconv iQEVAwUBPT0/MjkCAxeYt5gVAQEOvQf8DEmtbZpQ59nSSbVa8HN/BXCoMG/UOqYy lSknQ+dUaIS3S0QgpVSIs5tFOjShw2YZ117cXioxzADMbU2MlbY3NITJYkatbgqf UWIH9ENnqe0YDLdg1FWjyFFWuYLz1kf7c4M16OblhrHMJCjc9+Gba8dikIjJolWi WNtzfX9ftuzcvFwssReGjyemXMhN6ugeUv3T1hGXjMRT834rSG9eLEr98BWpE1xR m8wQPBWizSCDF3xFrRg6SwfSt1g+SrhGjLd/ccG96ENdC1XBHYyF4WgggdIO6Ilb 0WSaLbBV4uEPxyPihsy4pV3w8GLRXDhwpK34InLRHJFkMcgNWMTE2w== =Kn1u -----END PGP SIGNATURE-----

