Suzanne responded: > > Maybe Unicode is more of a shared set of rules that apply to > > low level data structures surrounding text and its algorithms > > then a character set. > > Sounds like the start of a philosophical debate. > > If Unicode is described as a set of rules, we'll be in a world of hurt.
> (On a serious note, these exceptions are exactly what make writing some > sort of "is and isn't" FAQ pretty darned hard. Hmm. Since the discussion which started out trying to specify a few examples of what kinds of entities would be inappropriate to proffer for encoding as Unicode characters seems to be in danger of mutating into the recurrent "What is Unicode?" question, perhaps its time to start a new thread for the latter. And now for some ontological ground rules. When trying to decide what a "thing" is, it helps not to use an attribute nominatively, since that encourages people to privately visualize the noun the attribute is applied to, but to do so in different ways -- and then to argue past each other because they are, in the end, talking about different things. "Unicode" is used attributatively of a number of things, and if we are going to start arguing/discussing what "it" is, it would be better to lay out the alternative "it"s a little more specifically first. 1. The Unicode *Consortium* is a standardization organization. It started out with a charter to produce a single standard, but along the way has expanded that charter, in response to the desire of its membership. In addition to "The Unicode Standard", it now has adopted a terminology that refers to some of its other publications as "Unicode Technical Standards" [UTS], of which two formally exist now: UTS #6 SCSU, and UTS #10 Unicode Collation Algorithm [UCA]. It is important to keep this straight, because some people, when they say "Unicode" are talking about the *organization*, rather than the Unicode Standard per se. And when people talk about "the standard", they are generally referring to "The Unicode Standard", but the Unicode Consortium is actually responsible for several standards. 2. The Unicode *Standard* itself is a very complex standard, consisting of many pieces now. To keep track of just what something like "The Unicode Standard, Version 3.2" means, we now have to keep web pages enumerating all the parts exactly -- like components in an assemble-your-own-furniture kit. See: http://www.unicode.org/unicode/standard/versions/ In any one particular version, the Unicode Standard now consists of a book publication, some number of web publications (referred to as Unicode Standard Annexes [UAX]), and a large number of contributory data files -- some normative and some informative, some data and some documentation. These definitions, including the exact list of contributory data files and their versions, are themselves under tight control by the Unicode Technical Committee, as they constitute the very *definition* of the Unicode Standard. It is not by accident that the version definitions start off now with the following wording: "The Unicode Standard, Version 3.2.0 is defined by the following list..." and so on for earlier versions. 3. The Unicode *Book* is a periodic publication, constituting the central document for any given version of the Unicode *Standard*, but is by no means the entire standard. The book, in turn, is very complex, consisting of many chapters and parts, some of which constitute tightly controlled, normative specification, and some of which is informative, editorial content. The "book" now also exists in an online version (pdf files): http://www.unicode.org/unicode/uni2book/u2.html which is *almost* identical to the published hardcover book, but not quite. (The Introduction is slightly restructured, the online glossary is restructured and has been added to, the charts are constructed slightly differently and have introductory pages of their own, etc.) 4. The Unicode *CCS* [coded character set] is the mapping of the set of abstract characters contained in the Unicode repertoire (at any given version) to a bunch of code points in the Unicode codespace (0x0000..0x10FFFF). Technically speaking, it is the Unicode *CCS* which is synchronized closely with ISO/IEC 10646, rather than the Unicode *Standard*. 10646 and the Unicode CCS have exactly the same coded characters (at various key synchronization points in their joint publication histories), but the *text* of the ISO/IEC 10646 standard doesn't look anything like the *text* of the Unicode Standard, and the Unicode Standard [sensum #2 above] contains all kinds of material, both textual and data, that goes far beyond the scope of 10646. There are other standards produced by some national bodies that are effectively just translations of 10646 (GB 13000 in China, JIS X 0221 in Japan), but the Unicode Standard is nothing like those. Finally, the attribute "Unicode ..." can be applied to all kinds of other "things" characteristic of the Unicode Standard, including algorithms for the manipulation of characters. Obvious examples which come to mind are "Unicode Bidirectional Algorithm", "Unicode Normalization", "Unicode Encoding Forms", and "Unicode Character Properties". O.k., so now before asserting or denying that "Unicode ... is a shared set of rules", it would be helpful to pin down first what you are referring to. That might make the ensuing debate more fruitful. --Ken

