Re: Defined Private Use was: SSP default ignorable characters

Philippe Verdy Wed, 28 Apr 2004 13:13:28 -0700

From: "Peter Kirk" <[EMAIL PROTECTED]>
>
> Software developers, or applications, are not supposed to be party to
> the agreement between *users*.


Do you say there that software developers are failing to comply with Unicode
rules by refusing to develop systems that allow *users* to make such private
private agreements and use the PUAs effectively as they are legitimately in
right to ask to their software developers?

Interesting point. This would be an argument for the developement (out of
Unicode) of some standard technical solutions to exchange these private
conventions on PUA usage, including exchange of character properties, etc...

Why not then within fonts -- namely in Opentype tables for fonts built with
these PUA assignments?

If so, a fully Unicode-compliant system should offer ways to allow interchange
of data between parties of these private agreements, and ensure that the PUA
encoding conventions are isolated and kept within the domain of the private
agreement (for example by labelling documents, with tags containing a URI,
either by out of band encoding in rich text formats such as XML or precomposed
PDF files, oe either in band within the encoded text using special tags, in a
way similar to language tags, but currently Unicode has not defined such an area
in plane 14 for other use than just language tags).

I note however that language tags (even if they are discouraged by Unicode) are
not deprecated and that they could even be used according to the RFC 3066
encoding format or one of its extensios, to cover as well additional attributes
identifying private communities sharing a common agreement.

So with Unicode language tags containing a standard language code and attributes
such extension would become possible if Unicode explicits less ambiguously how
to handle documents containing Language tags (notably for their application
scope within the encoded documents). When a plain text document would be later
converted to some rich text format, the language tag could be extracted and put
of band within some XML schema to describe the semantic of the encoded
plain-text fragments containing PUAs, within their restricted scope.

So instead of identifying PUAs only with thir codepoint (which is bound to a
unique namespace), they would be identified within a namespace made of the
private agreement URI, and the codepoint (quite similar to the concept of
namespaces in XML, where all entities are named within a well defined scope).
One way to cope with this would be then to reserve and bind all non-PUA and all
invalid codepoints in all possible namespaces, to the Unicode.org namespace.

There's a way to make those PUAs easily manageable by users:
- let each user have a registry of PUA agreements (identified in interchanges by
their URI). If the user accepts this agreement, it is recorded in that user's
registry
- the registry will map each described Unicode PUA codepoint to non-Unicode
codepoints (for example in the larger 31-bit space which was originally defined
for ISO 10646). These internal mappings will allow local-only management of
these encoded strings. For all interchanges, all non-Unicode codepoints (out of
the 17 first planes), will be looked up in the user database that will remap
this 32-bit codepoint into the URI + the 21-bit Unicode PUA, so that either a
plain-text document can be regenerated using language-tags tagging, or using XML
attributes or either rich text format...
- for local document handling, UTF-8 (the original version!) or UTF-32 could be
used to easily manage all private character properties, without colliding with
PUAs used in other private agreements or with other standard Unicode codepoints.

Such solution would have the additional effect that it will greatly reduce the
number of PUAs needed in Unicode and each one can use them the way he wants with
its own sets of character properties (including by overriding the default
combining classes and canonical decompositions!). No need to split the PUA space
which is really large enough with more than 135,000 codepoints, to allow
encoding any single private agreement.

The difficulties will be in the way to describe this agreement within a URI:
what should that URI provide? If it's a URL, it could be the one of a XML
document describing the set of conventions and properties tables and sets of
suggested or required fonts... The problem is then to create and maintain a
schema that allows describing these conventions. Such schema should allow
containing at least all the properties that already described in Unicode, plus
some other private data or tables.

The next complexity will be when one wants to extend and agreement to allow
migrating data from one private convention to another one. This looks exactly
like describing a transliteration scheme working within the larger local-only
31-bit space... And it can be as complex as in other stateful transliteration
schemes, or as simple as when mapping legacy 8-bit sets to Unicode. (using
simple stateless mappings).

Re: Defined Private Use was: SSP default ignorable characters

Reply via email to