Re: Towards a classification system for uses of the Private Use Area

William Overington Mon, 29 Apr 2002 09:09:09 -0700

Michael Everson raises some interesting matters.

>>Development of a character set for Egyptian hieroglyphics.
>
>I'm involved with this and there's been no talk about using the PUA.
>This is because existing 8-bit systems comprising a superset of what
>is likely to be encoded can easily be transcoded directly to Unicode
>without need for testing in the PUA. Because there are existing
>implementation and all that's needed is a set of (probably rich text)
>mapping tables.


My reason for including Egyptian Hieroglyphics was that I seemed to remember
something in one of the threads in this discussion forum about Egyptian
being encoded and using about 800 characters to start.  It may have been at
the time, in the same or another thread, that someone mentioned using the
Private Use Area for testing out characters before being added into the
Private Use Area.

Mention was made about continuing the discussion in a specialised discussion
group at Unicode and I tried to join that discussion list.  I could not find
it on the webspace and so I enquired.  It appears that I would most likely
have had to have already have had expertise in the language in order to join
the discussion group and, since I did not and was just interested as a
general education topic to learn about the typography, I decided not to
apply.  I mention this because it has been a feature of this discussion
group that when I or some others put forward a new idea that some people -
not you - suggest that the person with the idea spend their energies on one
of the many problems that need to be done and so forth, yet the reality is
that unless one is representing an organization or has specific linguistic
expertise already that such opportunities are not available.

>
>>Development of a character set for Cuneiform tablets.
>
>Some use of PUA characters has I think been used by some font
>developers. But it's for exchange within a very small group of
>investigators.
>

Well, I would be interested to have a look at those founts if that were
possible.  Indeed, following the recent posting about cuneiform in this
forum, I have been having a look at various web sites about cuneiform and
the subject is fascinating.

If one first goes to

http://www.jhu.edu/ice/

one can follow a link in the SEE ALSO: section of that document labelled
database which takes one to

http://www.jhu.edu/ice/database/cuneiformsigns.xml

which contains a link to a project at the University of Birmingham at

http://www.eee.bham.ac.uk/cuneiform/

in England.

Clicking there on the NoFrames link leads to

http://www.eee.bham.ac.uk/cuneiform/menunf.html

and there is lots of interesting information to be found in the links from
that page.

In particular, the link Cuneiform leads to

http://www.eee.bham.ac.uk/cuneiform/cuneiform.html

which page includes a marvellous interactive illustration that uses a
Viewpoint Media Player.

There is a link labelled Instructions which leads to

http://www.eee.bham.ac.uk/cuneiform/instructions.html

and there is there a link for a free download of a Viewpoint Media Player
plug in.

Going back to

http://www.eee.bham.ac.uk/cuneiform/menunf.html

one can then use the link labelled Results in the 3D Digital Imaging list
which leads to the

http://www.eee.bham.ac.uk/cuneiform/3dresults.html

web page.

Down that page is a link labelled

Click here to interact with a complete 3D scan of the cuneiform tablet!

and that takes one to

http://www.eee.bham.ac.uk/cuneiform/tab1.html

which is a large display using the Viewpoint Media Player.

This is absolutely fantastic and I feel that my knowledge of what is
available to be done on computers in the home over the internet leapt
forward a quantum leap by this demonstration.

I then disconnected the telephone line link to the internet and the display
still worked.

However, if I go back then forward on the browser, the display is lost.

The instructions page

http://www.eee.bham.ac.uk/cuneiform/instructions.html

has a link labelled Viewpoint with the message To find out more about VET,
go to Viewpoint.

which opens a new window at

http://www.viewpoint.com/

VET stands for Viewpoint Experience Technology.

There is at that website a DEVELOPER CENTRAL section and there is there a
link labelled learn the basics where there is a file getstarted.pdf
available.  This is 792 kilobytes and can be downloaded to local storage for
off-line viewing.

There are also various other items available for download, yet I have not
yet reached that stage.

----

As I read about these ancient clay tablets and the great quantity of them
that exist I began to wonder as to what information about observations of
the night sky in ancient times survive.  For example, are there any
observations of comets that could be tied in with the object that we now
know as Halley's comet?  Or, when a new comet appears in the present day sky
and its orbit is calculated and it is said to return once in every 3000
years or so, is it possible to find it being observed in ancient times and
that observation recorded on clay tablets?

A search at http://www.yahoo.com using an advanced search with an AND on the
two words cuneiform and astronomy gave a number of interesting links.

One such is http://www.lexiline.com/lexiline/lexi42.htm which has a picture
of a clay tablet with a diagram upon it.


>>The eutocode system, which is part of my own research, which is mentioned
in
>>the DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform)
section
>>at http://www.users.globalnet.co.uk/~ngo which is our family webspace in
>>England.
>
>"The unicode system is today a 21 bit system. Details are at the
>http://www.unicode.org website."? Where is a "21-bit system"
>mentioned on the Unicode website?

The quote is from the file
http://www.users.globalnet.co.uk/~ngo/ast02900.htm and is at the start of
the text of the document, which document is dated Tuesday 12 February 2002.

The answer to your specific question is that I am, after a bit of a search,
not able to find "21-bit system" mentioned on the Unicode website.

There is a reference in a FAQ about UTF and BOM to 21 significant bits.

I have learned a lot by looking up that document and in trying to understand
quite exactly why my statement should be questioned.  Is it that Unicode is
not defined as being a system of any number of bits but just as a system
that can be represented in various ways using 8 bit bytes, or 16 bit words
or 32 bit words?

Is what I wrote wrong?

My two sentences were intended to have the following meaning.

The unicode system is today a 21 bit system.  Details of unicode, which is
today a 21 bit system, are available at the http://www.unicode.org website.

I am fond of precision and try to be precise, so, if my statement is wrong I
will happily change it.  Yet what should I change it to become?  As far as I
know, unicode is a 21 bit system.  I accept that I cannot find a direct
statement to that effect on the Unicode website.  I ask that readers please
consider how someone approaching the Unicode website for the first time and
wondering what is Unicode, and then finding out that it encodes the
characters of the languages of the world and symbols for mathematics and so
on actually finds out that it is a 21 bit system, in that he or she may have
heard of ASCII and know that it is 7 bit or even 8 bit and wonders how
exactly Unicode manages to encode all the characters for all of the
languages of the world.  Perhaps the Unicode website needs some more direct
information straight off, as unfortunately the size of the coding space only
gradually gets through.  Indeed, perhaps the first paragraph of Chapter 1 of
the Unicode specification needs to include a mention of the 21 bit coding
space so that the new user of Unicode can immediately get a grasp of what is
going on.  There is a note on the page that I am reading, and I note that I
saved the file to hard disc on 22 February 2001, that this excerpt from the
book, The Unicode Standard, Version 3.0, has been slightly modified for the
web, so I do not know what the book says as the nature of the modifications
is not stated on the web page.  Please know that I am not trying to quibble
my way out of criticism by in some way seeking to purport that precision is
unimportant, for I believe that precision is important and I don't like it
when other people try to pretend that precision is unimportant in order to
seek to justify imprecision, for precision is the basis to success, yet is
my statement that the unicode system is today a 21 bit system even imprecise
even if it is not wrong?  Let us go into this as precisely as we in this
discussion group can go.

So, is the unicode system today a 21 bit system?

Is there anyone prepared to state that it is not a 21 bit system, stating
reasons for so saying?

Now, thinking about this whole situation it appears to me that, looking at
my idea that characters from a cuneiform type tray within the Private Use
Area and characters from a eutocode type tray within the Private Use Area
could be used together within the same plain text file so as to store both
character codes and data about the three-dimensional physical shape of a
particular clay tablet which carries cuneiform characters upon it, that
there is scope for considerable future development.

It would seem possible that, in time, cuneiform characters will be
designated as regular Unicode characters in a plane that is not plane 0.  It
seems to me that, unicode using 21 bits, and whereas computer storage media
such as hard discs is oriented to 8 bit bytes, that there is scope to
develop a file coding format that is essentially of a plain text nature
where characters are treated as being 24 bits long, so that each character
is stored as 3 bytes and code points starting, at 24 bits, expressed in
hexadecimal, having the first hexadecimal character as 0 or 1 is reserved
for Unicode, 2 is unused, and the rest are used for inputting and
manipulating data.  Such a special file format and coding system might be
very useful for encoding physical data about cuneiform tablets and Unicode
character codes together in one essentially plain text file, the format of
which could be open so that everyone could use it without having to buy
specific proprietary software to handle that file format.

I have it in mind that the various sections would be as follows.

000000 to 10FFFF Unicode
110000 to 2FFFFF reserved
300000 to 3FFFFD control, though only a few of these code points are going
to be used.
400000 to 4FFFFD obey the current x0p process, load 18 bits of data into
register x0, obey the current x0q process.
500000 to 5FFFFD obey the current y0p process, load 18 bits of data into
register y0, obey the current y0q process.
600000 to 6FFFFD obey the current z0p process, load 18 bits of data into
register z0, obey the current z0q process.
700000 to 7FFFFD obey the current t0p process, load 18 bits of data into
register t0, obey the current t0q process.
800000 to 8FFFFD obey the current x0a process, load 18 bits of data into
register x0, obey the current x0b process.
900000 to 9FFFFD obey the current y0a process, load 18 bits of data into
register y0, obey the current y0b process.
A00000 to AFFFFD obey the current z0a process, load 18 bits of data into
register z0, obey the current z0b process.
B00000 to BFFFFD obey the current t0a process, load 18 bits of data into
register t0, obey the current t0b process.
C00000 to CFFFFD obey the current x1a process, load 18 bits of data into
register x1, obey the current x1b process.
D00000 to DFFFFD obey the current y1a process, load 18 bits of data into
register y1, obey the current y1b process.
E00000 to EFFFFD obey the current z1a process, load 18 bits of data into
register z1, obey the current z1b process.
F00000 to FFFFFD obey the current t1a process, load 18 bits of data into
register t1, obey the current t1b process.

The codes from 400000 through to FFFFFD are such that the two least
significant bits are always set to 00, so as to avoid any problems with any
ending FE or FF being encountered whilst still having total coverage of the
data set.

The 24 processes x0p, x0q and so on all have default actions.  There is a
choice of processes available, which can be set using 24-bit codes that
begin with hexadecimal 3.  For example, defaults might include actions such
as do nothing and move pen and draw line and move data to oldx, oldy, oldz
and oldt registers so as to produce a vector graphics drawing system with a
data resolution of 36 bits in x, y, z and t axes.  Data storage needing 18
or less bits of resolution would not use the codes from C00000 through to
FFFFFD at all.  There could be processes that could be chosen that do such
things as a choice for z0p that will autoincrement x by one step so that a
sequence of 24 bit codes could be used to encode a sequence of values of z
which automatically increment x by one step for each value of z that is
supplied.  This would mean, for example, that physical scan data for heights
of clay in a clay tablet in a scan along a line could be encoded in a fairly
packed manner, yet the file coding having the flexibility to include tags
that link specific Unicode cuneiform characters to specific areas of the
clay tablet.  Please know that this is suggested as a general overview.  A
lot of research needs to be done in order to devise a coding system that
would be useful in practice rather than simply hopefully just an interesting
speculation on the possibilities as is presented here.

Yet I feel that such application possibilities to be able to use Unicode
characters in conjunction with graphic data with everything encoded together
in an open format file are an important possibility for the future.

>
>>Within my classification system, suppose please that someone developing
the
>>character codes for Egyptian hieroglyphics requests
>
>Of whom? Of the maintainers of the "type tray" maintainers (an
>analogue to John Cowan and me, for ConScript)? But why? ConScript has
>a number of fun scripts in it and people might be interested in
>encoding or exchanging more than one script, that's why there's a
>central registry. But it seems to me that people exploring the
>encoding of Egyptian and Cuneiform won't be worried about an overlap
>-- apart from me, and if asked I'd just get my fellows to assign two
>separate blocks to do it just in case there were a problem. For those
>scripts, though (or for Blissymbols, another candidate for PUA test
>implementation), the intention would be to use PUA as a very
>temporary stopgap for testing.

Well, it is, I feel, one of the difficulties of being an inventor, a matter
which needs great tact and diplomacy, is that when putting forward a new
idea that one wishes to become accepted, to balance carefully between
offering leadership and acting in a self-centred manner, so that although I
am happy to be the person of whom the requests are made if that helps get
the idea into use, I am reluctant to suggest that it be me in case it look
as if I am on an empire building trip, yet I am also reluctant to suggest
that the requests be considered by some other person or committee lest it be
thought that I am trying to get someone else to do the work rather than
carry it out myself.  Sigh.  I had in mind a sort of informal arrangement
whereby someone requesting a type tray designation, something which would
only probably happen infrequently, would post the request in this
[EMAIL PROTECTED] discussion forum and that any one of a number of people
interested in the classification system could offer advice if a clash with
some existing use would occur or if the suggested type tray code would be in
the wrong part of the classification, yet that after any discussion had
settled within a few days that any requester would typically have his or her
request granted.  I got the idea for this system as a result of observing
the newsgroup alt.config in action in defining the alt.* part of the
newsgroup hierarchy.  It sounds unstructured, yet it works very efficiently
in practice and people wishing to start new newsgroups are encouraged and
helped to get their new newsgroup started.

When I first started researching on eutocode, I took great care to have a
look at the ConScript registry, specifically to try to avoid clashes between
code point allocation between eutocode and the ConScript registry.  I am
aware that I did not need to do that in the sense of  legalistic obligation
to do so, yet I felt that I needed to do so in order to gain knowledge of
the state of the art of use of the Private Use Area.  As it happens,
eutocode only overlaps a little with the ConScript registry allocations of
code points, though eutocode was, in fact, placed in the upper part of the
U+E... section so as to be in the middle of the Private Use Area so as to
minimize clashes with other uses of the Private Use Area as far as possible,
yet I was pleased that I had, at that time, avoided overlap with allocations
in the ConScript registry.  There is now, unfortunately, an overlap of
eutocode with the ConScript Registry around U+E800.

Certainly, if you so wish, and I accept that you may not so wish, the
ConScript registry could have a type tray code within the classification
system, and, perhaps I might be allowed to request that the existence of the
classification system codes might be noted in the ConScript registry in the
sense that perhaps you might choose to avoid placing anything in the U+F3..
block.

In relation to the Private Use Area as mentioned in the Unicode
specification I feel that it would be helpful if there were a note of
guidance that the area U+F000 to U+F0FF is used for the symbol founts.  I
was aware that there was such an area and was trying to find out where it
was located.  Now, I appreciate that such a note might be seen as
endorsement of specific allocations, yet the other way of looking at it is
that someone could be trying to work within the Unicode specification rules
and doing his or her best to do so, yet a piece of information like that,
which is very relevant, is not easily available, perhaps because placing
such a note in the specification would be seen as endorsement.  Please
consider, how, apart from picking the information up serendipitously by
happening to read this posting or some other posting that happens to refer
to it, how exactly is a person learning to use the Unicode system and to
apply the Unicode system as an end user supposed to find out such
information, which, after all, is quite important when it comes to
allocating codes within the Private Use Area?

Maybe some sort of all but tacit understanding that, say, Egyptian
Hieroglyphics and Cuneiform development within the Private Use Area do not
overlap might be achieved.  Yet such an arrangement would necessarily cut
down on the number of code points available for each.  Also, eventually, if
some other learned group tried to "fit in" and neither overlap Egyptian
Hieroglyphics nor Cuneiform in the Private Use Area, it might get impossible
to do.  Certainly, some scripts might have been promoted out of the Private
Use Area by then, yet I feel that even where there is promotion to permanent
places in regular Unicode, then there are three problems that need to be
addressed.  The first is that there may be works in existence that will not
get updated - for example, a student project done with what was available at
the time when the fount was in the Private Use Area.  The second is that the
Private Use Area founts don't get changed on all of the PCs: obsolescent
encodings still get produced.  The third is that some systems will not
handle 21 bit unicode and so the old Private Use Area founts will remain in
use.  Someone in this discussion forum referred some time ago to the great
tsu nami (tidal wave) where companies assume that everyone has converted to
the latest version of their hardware and software.  This is not always the
actual situation, old PCs go on for years in college departments.  A new PC
often means that the total number available increases, not that any machine
is actually discarded.

Thank you for mentioning Bliss.

>
>>that he or she be assigned type tray 3001 and that someone
>>developing character codes for cuneiform requests the assignment of
>>type tray C001 and that I request type tray E001 for the eutocode
>>system, and that all of these requests are granted.
>
>That's more or less how ConScript functions.
>
>>Then, in order to apply the classification system to any plain text file,
>>the file needs to contain some classification characters near the start.
>>
>>For a file using the Egyptian hieroglyphics characters, the following
>>sequence would be needed.
>>
>>U+F35B U+F333 U+F330 U+F330 U+F331 U+F35D
>
>I don't understand this. Just assign Egyptian and Cuneiform to two
>separate areas.

No, it is important that the individual or group designating codes for
Egyptian and Cuneiform have a good number of code points in order to have
scope for their work.  My suggested classification system means that they
need to avoid only the U+F3.. block for their designations if they choose to
use my suggested classification system: also, my suggestion can be applied
retrospectively if someone has already prepared documents and founts as long
as there is no clash with the U+F3.. block.

>
>>Suppose then that one day someone comes across a plain text file and
within
>>that plain text file are character codes from the Private Use Area and
that
>>person has no idea as to which character set those character codes may be
>>intended to represent.
>
>The person wouldn't, because PUA values are agreed between sender and
receiver.

Or published.

Yet, what if that person is a researcher in a department of ancient
languages in a university and there are all of these files on the hard disc
of a computer, which files someone produced two years ago, a student who has
since left?

>I think people working with Egyptian, Cuneiform, or Blissymbols will
>use their own fonts for private research and can't imagine how a
>central clearing house would benefit them.

It is not a central clearing house.  It is just a list of type trays.  A
person or group that requests a type tray simply has, within the
classification system, that type tray designation for its type tray:
designation of individual code points within the type tray are not noted in
the list.  This would be radically different from the ConScript registry,
which provides lots of interesting information about the various scripts.

William Overington

29 April 2002

Re: Towards a classification system for uses of the Private Use Area

Reply via email to