On 12/12/2003 04:29, Philippe Verdy wrote:
...
But what you suggest here is exactly what a standard file compressor does.
It does not solve any problem in the representation of characters, the
compression scheme remains private, and can only be interpreted as text by
redecomposing these PUAs (in their scope) to the appropriate complex DGCs.
In addition, you need to find a way to store these associations between PUAs
and DGCs, so the complexity is even worse.
You would probably use it only if there are multiple occurences of these
complex DGCs, just to save some space (this is what is performed in the
Hangul Johab syllables as they occur very frequently when writing modern
Korean, and the space benefit comes from the fact that it does not need to
encode the associations between syllables and DGCs of jamos, as this is
defined by their canonical equivalences and implemented with a very basic
algorithm).
So unless you can create such simple algorithm to map complex DGC with PUA
ranges, there's little use of what you propose here.
This is not intended as a file compression technique. (Indeed it would
be an extremely poor one as it is based on UTF-32!) It is intended only
to solve the problem Mark mentioned that indexing etc of strings is
inefficient when the string is counted and divided according to grapheme
clusters - according to the recommendations for editing in UAX #29. The
mechanism I proposed was intended to allow a string of grapheme clusters
to be indexed efficiently, and nothing else - although as you point out
it might also help with rendering (although not neccessarily, as the
same grapheme cluster is not always rendered the same e.g. in Arabic).
--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/