...To get an idea of what orders of magnitude we are talking about here:
And what about a compressor that would identify the source as being Unicode, and would convert it first to NFC, but including composed forms for the compositions normally excluded from NFC? This seems marginal but some languages would have better compression results when taking these canonically equivalent compositions into account, such as pointed Hebrew and Arabic.
The Hebrew Bible consists of about 2,881,000 Unicode characters including accents, or 2,632,000 excluding accents - these figures include spaces. Of these, about 172,000 are U+05BC dagesh or mapiq, 46,000 are shin dot and 12,000 are sin dot. All of these, or very nearly, can be canonically composed with the preceding base characters into characters FB30-FB4A, thus saving 230,000 characters. Also a significant number of combinations could be composed into FB2E, FB2F and FB4B. So the Hebrew text could be compressed by something around 10% simply by composing it using characters already defined. This compressed version is canonically equivalent to the uncompressed version, but is not normalised because the characters are in the composition exclusion table.
-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/

