From: "Lars Marius Garshol" <[EMAIL PROTECTED]> > Does this make sense? Is "code point" the right term, or should I say > "scalar value"? And what about "abstract character"? Are two equal > sequences of code points in NFC necessarily composed of the same > sequence of abstract characters?
As Unicode and ISO/IEC 10646 are assigning the same codepoints for the samle abstract character, the codepoint should be unique (so there's a bijection). The issue is that not all Unicode strings are required to be in any normalized form. So a Unicode string may be distinct from another one, despite both are "canonically equivalent", i.e. equal after transformation to a standard normalized form (NFC, NFD). If you read this list, you'll see that some strings need to be encoded with distinct sequences, despite they are canonically equivalent. This may cause interpretation problems as the normalization process (even if canonical and not "compatible") may alter the semantic in some cases (we discussed the issue in Traditional Hebrew, Arabic, Tibetan, etc...), changing what is considered to be a string of abstract characters (the canonicalization will alter the abstract characters in some cases, even if it is not supposed to change the way they are rendered to common readers). Note also that the term "scalar value" is related to a assignment of a relative position in a ordered character set. The term "code point" is be be interpreted as symbolic, so that distinct code points have no defined relative order (ordering code points is a question of collation, and the collation in Unicode is defined to not act on the individual abstract characters that make a string, but on the global string itself). I would not use the term "scalar value" in your definition, even if strings are normalized in a canonical composed form, where the representation of the string is made of code points that have an inherent scalar value, which may be stored in memory as code units, and then serialized as sequences of bytes through an encoding scheme. In fact there may exist fully Unicode-compliant applications that do not handle strings of abstract characters using the scalar value of code points, but instead symbolically (think about a Lisp processor that handles each abstract character using a symbolic node, or about SGML applications that handle them by their name or by character entity references): the scalar value of each codepoint, is not required to perform Unicode string handling, as strings may be serialized on input and output only as sequences of bytes or code units with in any encoding scheme or coded charset. If you think then about the normalization process, it can be performed also symbolically, without using codepoints, and even when using equivalents symbols to represent the same codepoint (for example in SGML or XML the "numeric" character entities or named character entities) Am I wrong ?

