Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

Peter Kirk Fri, 12 Dec 2003 06:33:18 -0800

On 12/12/2003 04:29, Philippe Verdy wrote:

...

But what you suggest here is exactly what a standard file compressor does.

It does not solve any problem in the representation of characters, the
compression scheme remains private, and can only be interpreted as text by
redecomposing these PUAs (in their scope) to the appropriate complex DGCs.
In addition, you need to find a way to store these associations between PUAs
and DGCs, so the complexity is even worse.

You would probably use it only if there are multiple occurences of these
complex DGCs, just to save some space (this is what is performed in the
Hangul Johab syllables as they occur very frequently when writing modern
Korean, and the space benefit comes from the fact that it does not need to
encode the associations between syllables and DGCs of jamos, as this is
defined by their canonical equivalences and implemented with a very basic
algorithm).

So unless you can create such simple algorithm to map complex DGC with PUA ranges, there's little use of what you propose here.

This is not intended as a file compression technique. (Indeed it would be an extremely poor one as it is based on UTF-32!) It is intended only to solve the problem Mark mentioned that indexing etc of strings is inefficient when the string is counted and divided according to grapheme clusters - according to the recommendations for editing in UAX #29. The mechanism I proposed was intended to allow a string of grapheme clusters to be indexed efficiently, and nothing else - although as you point out it might also help with rendering (although not neccessarily, as the same grapheme cluster is not always rendered the same e.g. in Arabic).

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

Reply via email to