Re: Compression through normalization

Peter Kirk Sat, 06 Dec 2003 10:31:30 -0800

On 06/12/2003 03:48, Philippe Verdy wrote:

...
But as C10 does not mandate any normalized form (just canonical equivalence
of the results), I don't think that it requires that a compressor should
produce its result in either NFC or NFD form
Instead I think that it's up to the next process to determine which normalization form best fits its need: if the compressor was desinged to recompose to NFC, and then the next process prefers NFD, the last renormlization in the compressor will be superfluous.

Bear in mind that according to C9

no process can assume that another process will make a distinction between two different, but canonical-equivalent character sequences

(including a distinction between normalisation forms) and therefore the next process is not supposed to rely on the (de)compressor to normalise into any particular form.

So for me, a compressor can choose its own noramlization on input and apply
it before compressing, and the decompressor needs nothing else than just
decompressing and keeping the string in the form that was accepted or forced
on input by the compressor and encoded in the compressed stream.

With this view, normalization of strings should not be done on output from a process but on its input.

This is what C9 seems to require. Normalisation on output is not forbidden, of course, but the next process is not supposed to rely on it having been done, certainly not to fail if it has not been.

One can save unnecessary normalizations across multiple processes in one system, provided that the strings produced on output are reliable marked (out of band with some meta-data) to indicate their current supported normalization forms (i.e. if the string is already in one or more of the 4 standardized normalization forms): this just requires 8 bits of information on output of any process, in addition to the output string, with 2 bits per form to mean: YES, NO, UNKNOWN/UNTESTED (and posibly MAYBE, if the input is also MAYBE and the process does not force or checks any normalization form on input, this test being left for the next process if it needs it).

I'm not sure how well this one agrees with C9. These 8 bits have to be communicated between the processes in question by some protocol separate from the Unicode text. I am not sure if a process is permitted to rely on such information.

If this is not indicated (in the output from an external and not directly
supported process), then a fast-check on input may be used if this saves
work.

A fast check on input is of course sensible if the input is expected to be in a particular form e.g. if it is recommended by the higher level protocol in use; but if the fast check fails the process should not fail but should perform full normalisation.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Compression through normalization

Reply via email to