Towards a classification system for uses of the Private Use Area (derives from Re: Private Use Agreements and Unapproved Characters)

William Overington Fri, 26 Apr 2002 06:21:03 -0700

I have now modified my idea for a system and put this system forward for
discussion.  A copy of the original document is appended.


----

Consider please that there exists for the Private Use Area the concept of
the hexadecimal point.  The term "hexadecimal point" is similar to the
concept of a decimal point, the difference being that a decimal point is for
base 10 numbers and a hexadecimal point is for base 16 numbers.

A classification system could regard all characters that are defined within
it to have a code point value that is a real number, consisting of a part to
the left of the hexadecimal point that is a value from the Private Use Area
range of code points, and a part to the right of the hexadecimal point that
is a value that is assigned as part of the method of registering the
characters as being included in the classification system.  So, for example,
if a set of characters for some particular script P is registered, it might
be registered as having a part to the right of the hexadecimal point of,
say, 1005 so that if the characters were placed at U+E000 through to U+E0FF
then they would be regarded within the classification system as being at
A+E000.1005 through to A+E0FF.1005 at integer spaced intervals.

Four hexadecimal places would seem to be a good balance between having scope
and avoiding complexity, with .0 being unused for allocating to characters
and having a meaning of "undefined".  However, the possibility of more than
four hexadecimal places is kept open so that later expansion is always
possible.

Although phrased as a part to the right of the hexadecimal point, these
agreed system codes are really designations of trays of character designs;
however, the use of the hexadecimal point is convenient for expressing an
individual character in the form, for example, of A+E023.1005 so that its
meaning is uniquely defined within the system.

The classification system would be to use primarily U+E000 through to U+EFFF
for defining blocks of up to 4096 characters and then using U+F35B to mean
the start of defining the part to the right of the hexadecimal point and
U+F35D to mean the end of defining the part to the right of the hexadecimal
point.  The characters U+F330 through to U+F339 are then used to provide
digits and U+F341 through to U+F346 are used to provide letters for
expressing the hexadecimal values.  In founts that implement scripts that
are codified within the Private Use Area using this classification system,
these characters are set as being zero width so that they do not display in
a document.  There would, however, be the possibility of having a general
fount that is provided specifically as an analysis tool to be used when
trying to detect such a type tray sequence within a plain text file where
the characters are from the Private Use Area and the coding is as yet
unknown by the analyst.  In this case the characters would be displayed as
follows.  U+F35B as a LEFT SQUARE BRACKET AUGMENTED WITH A SQUARE.  U+F35D
as a RIGHT SQUARE BRACKET AUGMENTED WITH A SQUARE.  Digits along the pattern
of U+F330 as DIGIT ZERO WITH A SQUARE BENEATH.  Letters along the pattern of
U+F342 as LATIN CAPITAL LETTER B WITH A SQUARE BENEATH.  For the avoidance
of doubt, these are open squares, not filled squares.  For the brackets, the
square would be a small square superimposed on top of the vertical line of
the bracket symbol, centred on the centre of that vertical line.

A plain text file could indicate, for uses involving use of the
classification system, the use of the particular script P mentioned above
using the following sequence of characters.

U+F35B U+F331 U+F330 U+F330 U+F335 U+F35D

All characters in the Private Use Area would be presumed to have that part
to the right of the hexadecimal point until another sequence starting U+F35B
were received.  This classification system would be primarily intended to be
used for characters in the range U+E000 through to U+EFFF, yet is defined
for all Private Use Area characters, including those in planes 15 and 16,
for completeness.  This means that the classification system can apply to
those character designs that use U+F000 through to U+F0FF.  Codes starting
with a D to the right of the hexadecimal point could be allocated to them
for permanent use.  For example, some particular such fount Q might be
designated to have, say, D157 to the right of the hexadecimal point.

The use of this hexadecimal point technique would allow characters from
several different character sets to be used in the same plain text file if
so desired.

Naturally, it would be best to avoid using characters in the range U+F300
through to U+F3FF for characters within type trays defined using this
system.  I have chosen the U+F3.. block for this suggested classification
system as I am unaware of any existing use of that area.  If anyone knows of
any present use of the U+F3.. block I would be pleased to know of that use,
however, as it may be difficult to find any part of the Private Use Area
that is not being used for something by somebody somewhere, perhaps any such
overlap will just have to exist as a known limitation of the classification
system.

The classification system could also include codes for characters that are
never intended to become standard Unicode characters yet for which a
universal designation would be helpful.

The matter of keeping the system up to date could be partly resolved by, for
scripts that are under consideration for inclusion in Unicode, a time out of
validity of a year and a day after the publication of some particular
version number of the Unicode specification.  If necessary the time out
could be extended if a decision about whether to include the characters in
Unicode has not been reached by the time of that version of Unicode being
published, or, if necessary the validity could be made permanent if the
decision is not to include the characters in Unicode.

I feel that such a classification system would be very helpful and
potentially of great usefulness.

I feel that such a classification system for the Private Use Area should be
established and I am interested to write it up and publish it and to
participate in allocations of type tray codes to those people who would like
to have a code allocated.

I recognize that this classification system will not be codified as part of
the Unicode system as such, unless the meanings of the codes in the U+F3..
block that I have suggested become promoted to regular Unicode codes at some
future date.  As to whether they could be promoted, that is a debatable
point, yet I suggest that regular Unicode code points that designate an
optional classification system that may but need not be used in conjunction
with the Private Use Area would not be endorsing any "assignment to a
particular set of characters" but simply allowing a more rigorous
classification system to be used than can be used where the codes used to
produce the classification system are within the area that is being
classified.

However, whilst recognizing that having the codes used to produce the
classification system of the Private Use Area within the Private Use Area
itself could lead to problems, I feel that, with care, the suggested
classification system could be used to provide a workable system that could
be of good use in practice.

William Overington

26 April 2002

-----Original Message-----
From: William Overington <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
Cc: Patrick Rourke <[EMAIL PROTECTED]>; Doug Ewell
<[EMAIL PROTECTED]>; [EMAIL PROTECTED]
<[EMAIL PROTECTED]>
Date: Wednesday, March 13, 2002 12:50 PM
Subject: Re: Private Use Agreements and Unapproved Characters


>Here is a system that I think would work.
>
>Consider please that there exists for the private use area the concept of
>the hexadecimal point.  The term "hexadecimal point" is similar to the
>concept of a decimal point, the difference being that a decimal point is
for
>base 10 numbers and a hexadecimal point is for base 16 numbers.
>
>An agreed system could regard all characters that are defined within it to
>have a code point value that is a real number, consisting of a part to the
>left of the hexadecimal point that is a value from the private use area
>range of code points, and a part to the right of the hexadecimal point that
>is a value that is assigned as part of the method of registering the
>characters as being included in the agreed system.  So, for example, if a
>set of characters for some particular script P is registered, it might be
>registered as having a part to the right of the hexadecimal point of, say,
>1005 so that if the characters were placed at U+E000 through to U+E0FF then
>they would be regarded within the agreed system as being at A+E000.1005
>through to A+E0FF.1005 at integer spaced intervals.
>
>Four hexadecimal places would seem to be a good balance between having
scope
>and avoiding complexity, with .0 being unused for allocating to characters
>and having a meaning of "undefined".
>
>One possibility for the agreed system would be to use U+E000 through to
>U+EFFF for defining blocks of up to 4096 characters and then using two
>characters from the range U+F000 through to U+F8FF to mean the start and
the
>end of defining the part to the right of the hexadecimal point.  I am aware
>at the back of my mind that some of the characters in the range U+F000
>through to U+F8FF are often used for a particular type of user defined
fount
>such as dingbat type things, so I wonder if someone could please say if
they
>know of that matter so that any suggestions for defining these start and
end
>of defining codes does not clash with that usage.  Indeed that usage could
>be included into the agreed system and codes starting with a D to the right
>of the hexadecimal point could be allocated to them for permanent use.  For
>example, some particular such fount Q might be designated to have, say,
>D157 to the right of the hexadecimal point.
>
>Suppose though, on a temporary basis herein pending resolution of that
>matter, that within the agreed system U+F000 were to mean the start of
>defining the part to the right of the hexadecimal point and U+F001 were to
>mean the end of defining the part to the right of the hexadecimal point,
>then a plain text file could indicate, for uses involving use of the agreed
>system, the use of the particular script P mentioned above using the
>following sequence of characters.
>
>U+F000 U+0031 U+0030 U+0030 U+0035 U+F001
>
>All characters in the private use area would be presumed to have that part
>to the right of the hexadecimal point until another sequence starting
U+F000
>were received.
>
>The use of this hexadecimal point technique would allow characters from
>several different character sets to be used in the same plain text file.
>
>The agreed system could also include codes for characters that are never
>intended to become standard Unicode characters yet for which a universal
>designation would be helpful.  These character sets could use designations
>starting with some character such as C to the right of the hexadecimal
>point.
>
>Although phrased as a part to the right of the hexadecimal point, these
>agreed system codes are really designations of trays of character designs;
>however, the use of the hexadecimal point is convenient for expressing an
>individual character in the form, for example, of A+E023.1005 so that its
>meaning is uniquely defined within the system.
>
>Also, by using a part to the right of the hexadecimal point, the system has
>unlimited scope.
>
>The matter of keeping the system up to date could be partly resolved by,
for
>scripts that are under consideration for inclusion in Unicode, a time out
of
>validity of a year and a day after the publication of some particular
>version number of the Unicode specification.  If necessary the time out
>could be extended if a decision about whether to include the characters in
>Unicode has not been reached by the time of that version of Unicode being
>published, or, if necessary the validity could be made permanent if the
>decision is not to include the characters in Unicode.
>
>I feel that such an agreed system would be very helpful and potentially of
>great usefulness.
>
>The next matter is as to what is meant by agreed in the phrase agreed
>system.
>
>I feel that if the matter is discussed here in this discussion forum then
>whatever consensus exists when the discussion hopefully reaches a consensus
>could be taken as the agreed system.  Please know that although the phrase
>"private agreement" is used in the specification in the section about the
>private use area, later in that section the word "published" is used, so
one
>does not, in fact, need any agreement at all, it is quite permissible to
>simply publish one's own suggested system.  Naturally, the more agreement
>amongst those people who express an interest that one can achieve the
better
>that is, yet I feel that the best way forward is to discuss a system and
>then proceed by taking on board such comments that are received that can be
>accommodated in the system and then publishing a system and starting to use
>that system and then anyone who so wishes may participate in the use of
that
>published system.
>
>William Overington
>
>13 March 2002
>
>
>
>
>
>
>
>
>
>

Towards a classification system for uses of the Private Use Area (derives from Re: Private Use Agreements and Unapproved Characters)

Reply via email to