RE: Compression through normalization

Philippe Verdy Sat, 06 Dec 2003 05:47:53 -0800

Doug Ewell
> OK, then I suppose I should play devil's advocate and ask Peter's and
> Philippe's question again:  If C10 only restricts the modifications to
> "canonically equivalent sequences," why should there be an additional
> restriction that further limits them to NFC or NFD?  Or, put another
> way, shouldn't such a restriction be part of C10, if it is important?


First C10 only restricts modifications just to preserve all the semantics of
the encoded text in any context. There are situations where this restriction
does not apply: when performing text transformations (such as folding, or
even substringing, which may or may not respect canonical equivalence: case
folding applied to substrings does not as concatenation of folding of
substrings does not always return canonically equivalent results, even if
canonical cluster bounds are preserved when substringing).

Compression of an existing text is not viewed as being a text
transformation, so the intent of C10 should be observed, but only if the
compressor claims that _it preserves_ canonical equivalence.

But as C10 does not mandate any normalized form (just canonical equivalence
of the results), I don't think that it requires that a compressor should
produce its result in either NFC or NFD form

Instead I think that it's up to the next process to determine which
normalization form best fits its need: if the compressor was desinged to
recompose to NFC, and then the next process prefers NFD, the last
renormlization in the compressor will be superfluous.

So for me, a compressor can choose its own noramlization on input and apply
it before compressing, and the decompressor needs nothing else than just
decompressing and keeping the string in the form that was accepted or forced
on input by the compressor and encoded in the compressed stream.

With this view, normalization of strings should not be done on output from a
process but on its input.

One can save unnecessary normalizations across multiple processes in one
system, provided that the strings produced on output are reliable marked
(out of band with some meta-data) to indicate their current supported
normalization forms (i.e. if the string is already in one or more of the 4
standardized normalization forms): this just requires 8 bits of information
on output of any process, in addition to the output string, with 2 bits per
form to mean: YES, NO, UNKNOWN/UNTESTED (and posibly MAYBE, if the input is
also MAYBE and the process does not force or checks any normalization form
on input, this test being left for the next process if it needs it).

If this is not indicated (in the output from an external and not directly
supported process), then a fast-check on input may be used if this saves
work.


__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

<<attachment: winmail.dat>>

RE: Compression through normalization

Reply via email to