Mark,

This sounds like a great idea.  I was wondering however, if spaces in non-plane 0 
characters set will cause problems with the compression efficiency.  Maybe you should 
consider a special case for spaces.  

Maybe you could use something like offsetting the displacement values to accommodate 
special markings.

Encoded Offset                  Actual Offset
+2                                      +1
+1                                      0
0                                       Space character
-1                                      Restart next character from offset 0 (Resync)
-2                                      -1
-3                                      -2

If nothing else it should give better Korean compression.

The Resync could be used prior to a null character as a string terminator.  Nulls not 
preceded by a resync are not termination nulls.

This scheme would require a slight modification to comparison routines.  However, you 
should still be able to compare without full decoding.  Resync will cause a problem 
with compares.  The space insertion will only require minor adjustments.

Carl

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Mark Davis
Sent: Thursday, May 31, 2001 11:27 PM
To: Unicode
Cc: Unicore
Subject: Compression - binary ordered


As a by-product of our recent work on collation, we developed a method of
Unicode compression that is similar to SCSU, in that small alphabets are
about a byte per character and large alphabets are about two bytes per
character.

The main difference from SCSU is that this method preserves binary order. As
this is a hot topic right now, I thought it might be of interest. The latest
draft description is on http://oss.software.ibm.com/icu/develop/bocu.htm.
Comments are welcome.

Mark
—————

πάντων µέτρον ἄνθρωπος — Πρωταγόρας

[http://www.macchiato.com]




Reply via email to