Quoting Philippe Verdy <[EMAIL PROTECTED]>: > [EMAIL PROTECTED] wrote: > > Further, a Unicode-aware algorithm would expect a choseong character to > > be followed by a jungseong and a jongseong to follow a jungsong, and > > could essentially perform the same benefits to compression that > > normalising to NFC perfroms but without making an irreversible change > > (i.e. it could tokenise the Jamo sequences rather than normalising and > > then tokenising). > > Isn't it equivalent to what bzip2 does, but without knowledge of Unicode > composition rules, simply by discovering that jamos are structured > within their syllables, and creating, on the fly code positions to > represent their composition ?
I imagine so. > A 2% difference can be explained by the fact that bzip2 must still > discover the new "clusters" by encoding them first in their decomposed > form before using codes to represent the composed forms for the rest of > the text. Yes. Do we care about that 2%? Can we improve upon it? > > > Whether a "silent" normalization to NFC can be a legitimate part of > > > Unicode compression remains in question. I notice the list is still > > > split as to whether this process "changes" the text (because checksums > > > will differ) or not (because C10 says processes must consider the text > > > to be equivalent). > > And what about a compressor that would identify the source as being > Unicode, and would convert it first to NFC, but including composed forms > for the compositions normally excluded from NFC? This seems marginal but > some languages would have better compression results when taking these > canonically equivalent compositions into account, such as pointed Hebrew > and Arabic. Agreed, if we are to rely upon the equivalence of sequences then there is no need to exclude such compositions.

