Lars Marius Garshol asked: > I'm trying to find out what an abstract character is. I've been > looking at chapter 3 of Unicode 3.0, without really achieving > enlightenment. > > The term Unicode scalar value (apparently synonymous with code point) > seems clear. It is the identifying number assigned to assigned > Unicode characters.
Here is one of my attempts at a more rigorous term rectification: Abstract character that which is encoded; an element of the repertoire (existing independent of the character encoding standard, and often identifiable in other character encoding standards, as well as the Unicode Standard); the implicit basis of transcodings. Note that while in some sense abstract characters exist a priori by virtue of the nature of the units of various writing systems, their exact nature is only pinned down at the point that an actual encoding is done. They are not always obvious, and many new abstract characters may arise as the result of particular textual processing needs that can be addressed by characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER, etc., etc.) Code point A number from 0..10FFFF; a "point" in the codespace 0..10FFFF. Encoded character An *association* of an abstract character with a code point. Unicode scalar value A number from 0..D7FF, E000..10FFFF; the domain of the functions which define UTF's. The Unicode scalar value definitionally excludes D800..DFFF, which are only code unit values used in UTF-16, and which are not code points associated with any well-formed UTF code unit sequences. Assignment (of code points) Refers to the process of associating abstract character with code points. Mathematically a code point is "assigned to" an abstract character and an abstract character is "mapped to" a code point. This is distinguished from the vaguer sense of "assigned" in general parlance as meaning "a code point given some designated function by the standard", which would include noncharacters and surrogates. > > So far, so good. Some questions: > > - are all assigned Unicode characters also abstract characters? Yes. Or rather: all encoded characters are assigned to abstract characters. (See above for my distinction between "assigned" and "designated", which would apply to noncharacters and surrogate code points -- neither of which classes of code points get assigned to abstract characters.) > > - it seems that not all abstract characters have code points (since > abstract characters can be formed using combining characters). Is > that correct? Yes. (Note above -- abstract characters are also a concept which applies to other character encodings besides the Unicode Standard, and not all encoded characters in other character encodings automatically make it into the Unicode Standard, for various architectural reasons.) > > - do <U+00C5> (�) and <U+0041, U+030A> (A followed by combining ring > above) represent the same abstract character? Yes. That is the implicit claim behind a specification of canonical equivalence. --Ken > > Would be good if someone could clear this up. > > -- > Lars Marius Garshol, Ontopian <URL: http://www.ontopia.net > > ISO SC34/WG3, OASIS GeoLang TC <URL: http://www.garshol.priv.no > > > >

