Re: Compression through normalization

Peter Kirk Fri, 05 Dec 2003 04:31:11 -0800

On 05/12/2003 00:34, Doug Ewell wrote:

Peter Kirk <peterkirk at qaya dot org> wrote:

Surely ignoring Composition Exclusions is not unilaterally extending C10. The excluded precomposed characters are still canonically equivalent to the decomposed (and normalised) forms. And so composing a text with them, for compression or any other purpose, still conforms to C10, which explicitly allows "replacement of character sequences by their canonical-equivalent sequences" - not only when the resulting sequence is NFC or NFD.
Ignoring the composition exclusions does still respect canonical
equivalence, but does not preserve a canonical normalization form (using
the language of UAX #15).  So although it is not a violation of C10, it
does seem to run afoul of Mark's recommendation:
"In practice, if a compressor does not produce codepoint-identical text,
it should produce NFC
(not just any canonically equivalent text), and should document that it
does so."

OK. So it's Mark, not me, who is unilaterally extending C10. Well, Ken said much the same, so it's bilateral; and I agree it is a sensible extension.

But, as Ken also pointed out, it is quite permissible to use any encoding for the intermediate e.g. compressed form of the text, as long as it is possible to recover from this the normalised form of the original text. My suggestion of composing the text using composition exclusions meets this test, in a way not met by some of the other suggestions, e.g. composing Korean characters into precomposed forms which are (sadly) not canonically equivalent.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Compression through normalization

Reply via email to