RE: Code point vs. scalar value

Whistler, Ken Thu, 19 Sep 2013 16:13:15 -0700

Stephan Stiller seems unconvinced by the various attempts to explain the 
situation. Perhaps an authoritative explanation of the textual history might 
assist.


Stephan demands an answer:

I want to know why the Glossary claims that surrogate code points are 
"[r]eserved for use by UTF-16".

Reason #1 (historical): Because the Glossary entry for “Surrogate Code Point” 
has been worded thusly since Unicode 4.0 (p. 1377), published in 2003, and 
hasn’t been reworded since.

Reason #2 (substantive): Because UTC members have been satisfied with the 
content of the statement and have not required it be changed in subsequent 
versions of the standard.

Reason #3 (intentional): Because the wording was added in the first place as 
part of the change to identify the term “surrogate character”, which had been 
widely used before, as a misnomer and a usage to be deprecated. The term 
“surrogate code point” was a deliberate introduction at that time to refer 
specifically to the range U+D800..U+DFFF of “code points” which could *not* be 
used to encode abstract characters.

Reason #4 (proximal): Because nobody recently has submitted a suggested 
improvement to the text of the relevant entry in the glossary (and associated 
text in Chapter 3) which has passed muster in the editorial committee and been 
considered to be an improvement on the text.

If it is exegesis rather than textual history that concerns you, here is what I 
consider to be a full explanation of the meaning of the text that troubles you 
so:

Code points in the range U+D800..U+DFFF are reserved for a special purpose, and 
cannot be used to encode abstract characters (thereby making them encoded 
characters) in the Unicode Standard. Note that it is perfectly valid to refer 
to these as code points and use the U+ prefix for them. The U+ prefix 
identifies the Unicode codespace, and the glossary (correctly) identifies that 
as the range of integers from 0 to 10FFFF. O.k., if the range of code points 
U+D800..U+DFFF are reserved for a special purpose, what is that purpose and how 
do we designate the range? The designation is easy: we call elements of the 
subrange U+D800.. U+DBFF “high-surrogate code point” (see D71) and the elements 
of the subrange U+DC00..U+DFFF “low-surrogate code point” (see D73), and by 
construction (and common usage), the elements contained in the union of those 
two subranges is called “surrogate code point”. What is the special purpose? 
The shorthand description of the purpose is that the “surrogate code points” 
are “used for UTF-16”. But since that seems to confuse a minority of the 
readers of the standard, here is a longer explication: The surrogate code 
points are deliberately precluded from use to encode abstract characters to 
enable the construction of an efficient and unambiguous mapping between Unicode 
scalar values (the U+0000..U+D7FF, U+10000..U+10FFFF subranges of the Unicode 
codespace) and the sequences of 16-bit code units defined in the UTF-16 
encoding form. In other words, the reservation *from* encoding for the code 
points U+D800..U+DFFF enables the use of the numerical range 0xD800..0xDFFF to 
define surrogate pairs to map U+10000..U+10FFFF, while otherwise retaining a 
simple one-to-one mapping from code point to code unit in UTF-16 for the BMP 
code points which *are* used for encoding abstract characters. In short, the 
surrogate code points are “used for UTF-16”.

Stephan’s next demand for an answer was:

Remind me real quick, in what way does a function "use" the input values that 
it's not defined on?

Well, the problem here is in the formulation of the implied question. I 
suspect, from the discussion in this thread, that Stephan has concluded that 
the generic wording “used for” in the glossary item in question necessary 
imputes that the surrogate code points are therefore elements of the domain of 
the mapping function for UTF-16 (which maps Unicode scalar values to sequences 
of UTF-16 code units). Of course that imputation is incorrect. Surrogate code 
points are excluded form that domain, by *definition*, as intended. And I have 
explained above what the phrase “used for” is actually used for in the glossary 
entry.

Finally:

And what does this have to do with UTF-16?

It is definitional for UTF-16. I think that should also be clear from the 
explanation above.

Now, rather than quibbling further about what the glossary says, if the 
explanation still does not satisfy, and if the text in the glossary (and in 
Chapter 3) still seems wrong and misleading in some way, here is a more 
productive way forward:

Submit a proposal for a small textual change to the Unicode Technical 
Committee. This can either consist of an extended document (if long), or can be 
done on the online contact form (if short). (See the web site for submission 
details.) In a case like this, to be effective, such a proposal should have the 
following rhetorical structure, approximately:

===========================================================================================

1. I find (glossary entry/conformance clause/section/page/…) 
(confusing/misleading/erroneous…) for XYZ reasons.

2. The following reformulation of that text [insert exact text suggestion here] 
might be a useful improvement.

3. Please consider this suggestion at your next available opportunity.

Sincerely, etc., etc., with appropriate contact information

===========================================================================================

Anyone who wants to make an actual textual improvement to the standard can 
follow that general outline.

If, on the other hand, the goal here is simply to have a rousing argument for 
argument’s sake on the email list, at a certain point, others on the list may 
conclude that enough is enough. It might be time then to take the argument 
private to those individual correspondents who wish to continue the argument.

--Ken



You haven't answered my questions. I want to know why the Glossary claims that 
surrogate code points are "[r]eserved for use by UTF-16". Remind me real quick, 
in what way does a function "use" the input values that it's not defined on? 
And what does this have to do with UTF-16?


Stephan

RE: Code point vs. scalar value

Reply via email to