Thanks for your comments.

We aren't worried about the transition to and from space for supplementary
characters, since as a fraction of all text they will be exceedingly rare (<
0.01%, our estimate).

As to Korean, it might save some storage to always reset at space, but
(a) I don't see an obvious way to modify the algorithm and preserve binary
order -- which is, after all, the whole point.
(b) the algorithm only takes one "long jump" for a single space, which is
the typical case. Adding a Resync byte always would probably degrade the
storage. One would have to run the numbers again.

Mark

----- Original Message -----
From: "Carl W. Brown" <[EMAIL PROTECTED]>
To: "Unicode" <[EMAIL PROTECTED]>
Sent: Friday, June 01, 2001 09:26
Subject: RE: Compression - binary ordered


> Mark,
>
> This sounds like a great idea.  I was wondering however, if spaces in
non-plane 0 characters set will cause problems with the compression
efficiency.  Maybe you should consider a special case for spaces.
>
> Maybe you could use something like offsetting the displacement values to
accommodate special markings.
>
> Encoded Offset Actual Offset
> +2 +1
> +1 0
> 0 Space character
> -1 Restart next character from offset 0 (Resync)
> -2 -1
> -3 -2
>
> If nothing else it should give better Korean compression.
>
> The Resync could be used prior to a null character as a string terminator.
Nulls not preceded by a resync are not termination nulls.
>
> This scheme would require a slight modification to comparison routines.
However, you should still be able to compare without full decoding.  Resync
will cause a problem with compares.  The space insertion will only require
minor adjustments.
>
> Carl
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Mark Davis
> Sent: Thursday, May 31, 2001 11:27 PM
> To: Unicode
> Cc: Unicore
> Subject: Compression - binary ordered
>
>
> As a by-product of our recent work on collation, we developed a method of
> Unicode compression that is similar to SCSU, in that small alphabets are
> about a byte per character and large alphabets are about two bytes per
> character.
>
> The main difference from SCSU is that this method preserves binary order.
As
> this is a hot topic right now, I thought it might be of interest. The
latest
> draft description is on http://oss.software.ibm.com/icu/develop/bocu.htm.
> Comments are welcome.
>
> Mark
> —————
>
> πάντων µέτρον ἄνθρωπος — Πρωταγόρας
>
> [http://www.macchiato.com]
>
>
>
>


Reply via email to