RE: Compression through normalization

D. Starner Wed, 26 Nov 2003 15:06:24 -0800

> Use Base64 - it is stable through all normalisation forms.

The problem with Base64 (and worse yet, PUA characters for bytes), is that 
it's inefficent. Base64 offers 6 bits per 8 (75%) on UTF-8, 6 bits per 16 (37%)
on UTF-16. You can get 15 bits per 16 (93%) on UTF-16 and 15 bits per 24 (62%)
on UTF-8 with the following scheme, and the only normalization is Hangul, which
is at least algorithmic.


You could remove the normalization and increase compression in the UTF-8 
(but cost it in the SCSU case) by using low characters that don't decompose 
or compose, but then you have to carry long lists of usable characters.
(The numbers get tricky, so I haven't run them.)
You could remove the normalization cost by using Plane 2, but any characters
on Plane 2 would be larger. 
(Assuming your binary data is linearly distributed, the numbers are
easy, except for SCSU. I think astral windows in SCSU have little
effect when used on data like this.)

       Base64 / CJK15 / CJ15
UTF-8  75%    / 62%   / 59%
UTF-16 37%    / 93%   / 78%
SCSU   75%    / 93%   / 78%(?)

CJK15:
Break the byte stream into 15 bit chunks. Let a be a 15-bit chunk and U
be the resulting Unicode character. Then
if a < 1800h then U = a + 3400h
else if A < 6800h then U = a - 1800h + 4E00h /* a + 3600h */
else U = a - 6800h + AC00h /* a + 4400h */

CJ15:
replace the last else with
else U = a - 6800h + 20000h /* a + 19800h */
-- 
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: Compression through normalization

Reply via email to