At 10:42 PM 10/1/01 -0700, Bernard Miller wrote: >--- Asmus Freytag <[EMAIL PROTECTED]> wrote: > > There are 66 non-characters as of Unicode 3.1, there > > were 34 non-characters > > before. > >I understand now.. the non characters in 16 higher >planes were defined first, then the ones in the arabic >presentation forms block. In this case it is as I >suspected, just a documentation problem. The book says >"None of these surrogate pairs has been ASSIGNED in >this version of the standard" (emphasis mine).
There are three types of things that can be stated for a code point (code point, not character) - allocation - designation - assignment Allocation refers to whether the code point is part of the standard - allocation changed once in the life of Unicode to include the range 0x10000-0x10FFFF. Designation refers to the status as character, non- character, surrogate, private use character, etc. Designation changed twice in Unicode, once to designate the surrogates, and once to designate the 32 characters on the BMP as non-characters. Assignment refers to assigning a character to a code point. New assignments are made all the time, as new characters are added to the standard. In the early history of Unicode, assignments changed twice, once to reflect the merger with 10646, and once to add the Korean Hangul. Future assignment changes are restricted to adding new assignments. Because people easily confuse code points and characters, few people make the distinction between allocation, designation, and assignment. New text being drafted for Unicode 4.0 will clarify these terms. >It >would merely be misleading to not mention 32 non >characters in the section called "non characters" and >to state that there are no characters in the higher >planes as of Unicode 3.0; but I think we have a bona >fide incorrect statement to say that no surrogate pair >has been ASSIGNED when in fact 32 surrogate pairs were >assigned the status of non characters. As you can see from the above, they were "designated" and not "assigned". > > The reason to put the additional (defined in 3.1) > > non-characters into the BMP is to allow them to > > have single codes for UTF-16 implementation - > > something that doesn't > > work so well if they are on the higher planes. > >I don't understand this, the "arabic" non characters >are supposed to REPRESENT the "hidden" non characters? No, implementors in the UTC simply demonstrated a need to have 32 non-character code points - code points that they would be free to use internally because they would never be a legal part of any interchanged data. For UTF-16 implementations, using the 32 supplementary non-characters would have forced them to use surrogate pairs, which is awkward for the kinds of use intended for internal-use code points. That's why 32 code points in the BMP were re-designated from 'reserved' to 'non-character'. A./