I have attached a proposal to describe the meaning of PUA characters in a document. The idea is that this description would be to the characters as the DTD is to the XML elements (but it also applies to non-XML documents). Eric.Title: Formalizing the Unicode Private Use Area
Table of Content
1. MotivationThe Unicode standard is a constantly evolving character collection, and there may be times when one needs a character that is not yet part of the standard. Unicode recognizes this situation:
Indeed, a document that uses PUA code points does not have a meaning by itself, just like a document where the encoding is not specified has no meaning by itself. First and foremost, this note provides a mean to build those agreements. The idea is that a document could specify a semantics for the Private Use Area characters it contains, at the same level as Unicode specifies a semantics for the assigned characters (i.e. those that are part of the Unicode repertoire). Just like Unicode, part of the semantics is formalized and represented in a machine readable form, and part of it is informal. 2. TerminologyA gaiji character is a character that is not part of the Unicode repertoire and is encoded in the PUA. In this document, there is no intention to restrict gaiji characters to ideographs. Of course, this notion is relative to a particular version of the Unicode standard. 3. RequirementsThe design goals are:
4. Overall StructureThe goals dealing with extensibility, human readability, and machine processing are easily satisfied by using XML.This document describes a DTD.
5. CharactersThe unicode-name element encloses the Unicode name of that character. It is not applicable to gaiji characters. The name element is used for non-Unicode characters. Exactly one of unicode-name and name must be present. The unicode-1.0-name element encloses the Unicode 1.0 name of the character, if it exists. The alternative-names element encloses a set of alternative-name elements, which in turn enclose alternative names for this character. <!ELEMENT unicode-name (#PCDATA)> The code element encloses the Unicode code value of the character, using the U+xxxx syntax. The char element contains a single character, which is the character itself. <!ELEMENT code (#PCDATA)> The cross-references element encloses a set of cross-ref elements. Each cross-ref element contains a code element and a name element for the character which is referenced. The cross-ref element has a role attribute which can take the values inequal or other. The default value for that attribute is other. <!ELEMENT cross-references (cross-ref)*> The compatibility-decomposition element contains a sequence of characters into which the character being described can be compatibly decomposed. The canonical-decomposition element contains the characters into which the character being described is canonically decomposed. <!ELEMENT compatibility-decomposition (#PCDATA)> case can have the values UPPERCASE, TitleCase or lowercase. combining-class encloses the combining class (in its numeric form). directionality encloses the directionality property. jamo-short-name encloses the Jamo short name property. It can be present only for Unicode conjoining Hangul jamo characters. general-category numeric-values is present if the character is a number. It encloses the numeric value as recorded in section 4.6. In addition, the attribute value is the numeric value represented as a decimal number, without ',' to separate the character groups. The attribute decimal can take the values yes or no. mirrored is present for those characters that have the mirrored property. mathematical is present for characters that have the mathematical property.
<!ELEMENT case (#PCDATA)> The informative-note element contains an informative note.
<!ELEMENT informative-note ?> Finally, these elements are assembled in a character element: <!ELEMENT character ((unicode-name | name), unicode-1.0-name?, Here are some examples: <character><name>LATIN CAPITAL LETTER A</name> <code>U+0041</code> <char>A</char> <direction>LR</direction> </character> <character> <name>COMBINING REVERSE SOLIDUS OVERLAY</name> <code>U+E000</code> <char></char> <combining_class>1</combining_class> </character> <character> <name>DOLLAR SIGN</name> <alternate-names> <name>milreis</name> <name>escudo</name> </alternate-names> <code>U+0024</code> <char>$</char> <direction>LR</direction> <cross-references> <cross-ref><name>currency sign</name><code>0A4</code></cross-ref> </cross-references> <informative-note>Glyph may have one or two vertical bars. other currency symbol characters: 20A0 ₠ - 20AF ₯</informative-note> </character> 6. CollectionsCollections are formed by grouping characters and by combining collections. A collection is well-formed iff:
An enumerated-collection is just a set of character elements. <!ELEMENT enumerated-collection (character)*> A ref-collection references a external collection (that is, external to the resource in which this reference occurs). It must have a system identifier, an URI, which may be used to retrieve the referenced collection. Relative URIs are relative to the location of resource within which the ref-collection occurs. In addition, there may be a public identifier. A processor attempting to retrieve the referenced collection may use the public identifier to try to generate an alternative URI. If the processor is unable to do so, it must use the URI specified in the system identifier. <!ELEMENT ref-collection EMPTY> A union-collection groups the characters of multiple collections. If the set-wise union of those collections are not well-formed, characters of the later collections are removed from the union. <!ELEMENT union-collection (%collection;)*> A subsetted-collection removes some the characters of a base collection. The characters to remove are identified by their code value. <!ELEMENT subsetted-collection (%collection;, code*)> A remapped-collection reassigns new code points to the characters of a base collection. <!ELEMENT remapped-collection (%collection;, %map;)> A simple-map just lists pairs of code points. Characters which are not listed as the source of a pair are mapped to their original code point. No two pairs should map from the same character. The map should not assign two different characters to the same code point. <!ELEMENT simple-map (replace)*> A shift-map adds an offset (positive or negative to each code point. By construction it preserves well-formedness. <!ELEMENT shift-map (#PCDATA)> <!-- really, an integer> And this complete the means of constructing collections: <!ENTITY % collection "(enumerated-collection|union-collection Here are some examples. <collection><union-collection> <ref-collection publicID="-//Unicode Consortium//CHC Unicode v3.0" systemID="ftp://ftp.unicode.org/data/chc/v3.0"> <enumerated-collection> <character> <name>COMBINING REVERSE SOLIDUS OVERLAY</name> <code>U+E000</code> <char></char> <combining_class>1</combining_class> </character> <enumerated-collection> </union-collection> </collection> Here is another collection that uses the same PUA code point, but defines it differently: <collection><union-collection> <ref-collection publicID="-//Unicode Consortium//CHC Unicode v3.0" systemID="ftp://ftp.unicode.org/data/chc/v3.0"/> <enumerated-collection> <character> <name>Adobe Logo</name> <code>U+E000</code> <char></char> <combining_class>1</combining_class> </character> </enumerated-collection> </union-collection> </collection> Let's assume that our first collection is accessible via the URI http://atm.corp.adobe.com/chc/eric.chc and the second is accessible via the URI http://oranda.corp.adobe.com/chc/adobecorp.chc. Just forming the union of those collections will drop one of the two PUA characters (the one in the collection mentionned second). The following collection can be built for documents that need both PUA characters: <collection><union-collection> <remapped-collection> <ref-collection systemID="http://atm.corp.adobe.com/chc/eric.chc"/> <simple-map> <replace from="U+E000" to="U+E001"/> </simple-map> </remapped-collection> <ref-collection systemID="http://oranda.corp.adobe.com/chc/adobecorp.chc"/> </union-collection> </collection> In documents that use this collection, the code point U+E000 refers to the Adobe Logo character, and the code point U+E001 refers to the COMBINING REVERSE SOLIDUS OVERLAY characters. 7. Related workThe first source of inspiration is the XML world. In an XML document, the element names that are used have no particular meaning by themselves, just like the PUA code points have no meaning. But in the XML world, this is the norm rather than the exception and mechanisms have been designed to cope with that. In fact, these were a major source of inspiration: DTD and XML schemas are similar character collections, namespaces correspond to the collection bases, and the collection naming and referencing is based on DTD naming and referencing. The W3C NOTE A Notation for Character Collections for the WWW by Martin Dürst is an XML DTD to describe sets of character code values. The main objective is to be able to answer the question "Is this character code in this collection?". Particular attention is paid to support efficient implementation when the set descriptions are resources on the network. While this is useful when the sets are made of standard characters, it's really not enough to deal with private use characters, as it does not attach a meaning to them. The ConScript Unicode Registry by John Cowan and Michael Everson is a registry of Private Use Area uses. The goal of this effort is really to have a centralized allocation of the private use area. It does not attempt to record semantics of the characters. Adobe Systems Inc. Confidential. Copyright © 2001 Adobe Systems Inc. |

