Nicely stated.
Mark <https://plus.google.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Thu, Sep 19, 2013 at 11:21 PM, Whistler, Ken <[email protected]>wrote: > Stephan Stiller seems unconvinced by the various attempts to explain the > situation. Perhaps an authoritative explanation of the textual history > might assist.**** > > ** ** > > Stephan demands an answer:**** > > ** ** > > I want to know why the Glossary claims that surrogate code points are > "[r]eserved for use by UTF-16".**** > > ** ** > > Reason #1 (historical): Because the Glossary entry for “Surrogate Code > Point” has been worded thusly since Unicode 4.0 (p. 1377), published in > 2003, and hasn’t been reworded since.**** > > ** ** > > Reason #2 (substantive): Because UTC members have been satisfied with the > content of the statement and have not required it be changed in subsequent > versions of the standard.**** > > ** ** > > Reason #3 (intentional): Because the wording was added in the first place > as part of the change to identify the term “surrogate character”, which had > been widely used before, as a misnomer and a usage to be deprecated. The > term “surrogate code point” was a deliberate introduction at that time to > refer specifically to the range U+D800..U+DFFF of “code points” which could > *not* be used to encode abstract characters.**** > > ** ** > > Reason #4 (proximal): Because nobody recently has submitted a suggested > improvement to the text of the relevant entry in the glossary (and > associated text in Chapter 3) which has passed muster in the editorial > committee and been considered to be an improvement on the text.**** > > ** ** > > If it is exegesis rather than textual history that concerns you, here is > what I consider to be a full explanation of the meaning of the text that > troubles you so:**** > > ** ** > > Code points in the range U+D800..U+DFFF are reserved for a special > purpose, and cannot be used to encode abstract characters (thereby making > them encoded characters) in the Unicode Standard. Note that it is perfectly > valid to refer to these as code points and use the U+ prefix for them. The > U+ prefix identifies the Unicode codespace, and the glossary (correctly) > identifies that as the range of integers from 0 to 10FFFF. O.k., if the > range of code points U+D800..U+DFFF are reserved for a special purpose, > what is that purpose and how do we designate the range? The designation is > easy: we call elements of the subrange U+D800.. U+DBFF “high-surrogate code > point” (see D71) and the elements of the subrange U+DC00..U+DFFF > “low-surrogate code point” (see D73), and by construction (and common > usage), the elements contained in the union of those two subranges is > called “surrogate code point”. What is the special purpose? The shorthand > description of the purpose is that the “surrogate code points” are “used > for UTF-16”. But since that seems to confuse a minority of the readers of > the standard, here is a longer explication: The surrogate code points are > deliberately precluded from use to encode abstract characters to enable the > construction of an efficient and unambiguous mapping between Unicode scalar > values (the U+0000..U+D7FF, U+10000..U+10FFFF subranges of the Unicode > codespace) and the sequences of 16-bit code units defined in the UTF-16 > encoding form. In other words, the reservation *from* encoding for the code > points U+D800..U+DFFF enables the use of the numerical range 0xD800..0xDFFF > to define surrogate pairs to map U+10000..U+10FFFF, while otherwise > retaining a simple one-to-one mapping from code point to code unit in > UTF-16 for the BMP code points which *are* used for encoding abstract > characters. In short, the surrogate code points are “used for UTF-16”.**** > > ** ** > > Stephan’s next demand for an answer was:**** > > ** ** > > Remind me real quick, in what way does a function "use" the input values > that it's not defined on?**** > > ** ** > > Well, the problem here is in the formulation of the implied question. I > suspect, from the discussion in this thread, that Stephan has concluded > that the generic wording “used for” in the glossary item in question > necessary imputes that the surrogate code points are therefore elements of > the domain of the mapping function for UTF-16 (which maps Unicode scalar > values to sequences of UTF-16 code units). Of course that imputation is > incorrect. Surrogate code points are excluded form that domain, by > *definition*, as intended. And I have explained above what the phrase “used > for” is actually used for in the glossary entry.**** > > ** ** > > Finally:**** > > ** ** > > And what does this have to do with UTF-16?**** > > ** ** > > It is definitional for UTF-16. I think that should also be clear from the > explanation above.**** > > ** ** > > Now, rather than quibbling further about what the glossary says, if the > explanation still does not satisfy, and if the text in the glossary (and in > Chapter 3) still seems wrong and misleading in some way, here is a more > productive way forward:**** > > ** ** > > Submit a proposal for a small textual change to the Unicode Technical > Committee. This can either consist of an extended document (if long), or > can be done on the online contact form (if short). (See the web site for > submission details.) In a case like this, to be effective, such a proposal > should have the following rhetorical structure, approximately:**** > > ** ** > > > =========================================================================================== > **** > > ** ** > > 1. I find (glossary entry/conformance clause/section/page/…) > (confusing/misleading/erroneous…) for XYZ reasons.**** > > ** ** > > 2. The following reformulation of that text [insert exact text suggestion > here] might be a useful improvement.**** > > ** ** > > 3. Please consider this suggestion at your next available opportunity.**** > > ** ** > > Sincerely, etc., etc., with appropriate contact information**** > > ** ** > > > =========================================================================================== > **** > > ** ** > > Anyone who wants to make an actual textual improvement to the standard can > follow that general outline.**** > > ** ** > > If, on the other hand, the goal here is simply to have a rousing argument > for argument’s sake on the email list, at a certain point, others on the > list may conclude that enough is enough. It might be time then to take the > argument private to those individual correspondents who wish to continue > the argument.**** > > ** ** > > --Ken**** > > ** ** > > ** ** > > ** ** > > You haven't answered my questions. I want to know why the Glossary claims > that surrogate code points are "[r]eserved for use by UTF-16". Remind me > real quick, in what way does a function "use" the input values that it's > not defined on? And what does this have to do with UTF-16? > > > **** > > Stephan**** >

