Quoting Philippe Verdy <[EMAIL PROTECTED]>:

> [EMAIL PROTECTED] wrote:
> > Further, a Unicode-aware algorithm would expect a choseong character to 
> > be followed by a jungseong and a jongseong to follow a jungsong, and 
> > could essentially perform the same benefits to compression that 
> > normalising to NFC perfroms but without making an irreversible change 
> > (i.e. it could tokenise the Jamo sequences rather than normalising and 
> > then tokenising).
> 
> Isn't it equivalent to what bzip2 does, but without knowledge of Unicode 
> composition rules, simply by discovering that jamos are structured 
> within their syllables, and creating, on the fly code positions to 
> represent their composition ?

I imagine so.

> A 2% difference can be explained by the fact that bzip2 must still 
> discover the new "clusters" by encoding them first in their decomposed 
> form before using codes to represent the composed forms for the rest of 
> the text.

Yes. Do we care about that 2%? Can we improve upon it?

> > > Whether a "silent" normalization to NFC can be a legitimate part of
> > > Unicode compression remains in question.  I notice the list is still
> > > split as to whether this process "changes" the text (because checksums
> > > will differ) or not (because C10 says processes must consider the text
> > > to be equivalent).
> 
> And what about a compressor that would identify the source as being 
> Unicode, and would convert it first to NFC, but including composed forms 
> for the compositions normally excluded from NFC? This seems marginal but 
> some languages would have better compression results when taking these 
> canonically equivalent compositions into account, such as pointed Hebrew 
> and Arabic.

Agreed, if we are to rely upon the equivalence of sequences then there is no 
need to exclude such compositions.

Reply via email to