Re: Compression through normalization

Peter Kirk Sat, 06 Dec 2003 12:58:33 -0800

On 06/12/2003 09:49, Doug Ewell wrote:

...

But as C10 does not mandate any normalized form (just canonical equivalence of the results), I don't think that it requires that a compressor should produce its result in either NFC or NFD form

Right. I know that. But Mark and Ken said it should, and so I'm trying to find out if this *SHOULD* be in C10, or alternatively how I would expect to find out about this extra recommendation if I weren't on the Unicode mailing list.

My feeling is that it should NOT be in C10. The conformance rules do not mention normalisation, which is a convenience and a recommendation but not a conformance issue - that is, apart from C14-C16 which do not specify that normalisation should be done, only how it must be done if it is done. I would also argue that adding such a rule to C10 would conflict with C9 or at least with the principle underlying it, that for conformance purposes all canonically equivalent forms are equal and indistinguishable.

Note that this is a devil's advocate argument.  I'm not necessarily
disagreeing with Mark and Ken's recommendations.  I'm just trying to
reconcile the differences between what they say (NFC or NFD only) and
what C10 says (any canonically equivalent sequence).

I would agree with these recommendations as long as they remain recommendations rather than conformance requirements, and are perhaps added to section 5.6 of TUS (4.0), which already includes the sentence (in the context of canonically equivalent alternative spellings):

Implementations that are âliberalâ in what they accept, but âconservativeâ in what they issue, will have the fewest compatibility problems.

... With this view, normalization of strings should not be done on output from a process but on its input.

Yes. Correct. To paraphrase Peter Kirk, according to C9, if my Unicode text is going through Process A (which outputs NFD) and on to Process B (which wants NFC input), A does not have to convert the text to NFC just to appease B. Instead, it is up to B to do the work of converting to NFC.

A more accurate paraphrase would be that process A may output any canonically equivalent form, maybe one which is not a normalisation form at all e.g. because it has composed composition exclusions.

So, in a compressor/decompressor environment, the compressor is the one
that has to do any normalization work.  The decompressor just
decompresses.  This is consistent with what I wrote a few days ago about
wanting this normalized-compression bit to work with existing
decompressors.

For "any normalization work" I would prefer to say "any canonically equivalent transformation work". The compressor performs whatever canonically equivalent transformations it chooses, perhaps because they produce the best compression. The decompressor is only obliged to decompress; C9 implies that it is not obliged to normalise and that no other process can rely on it doing so - although it is free to do so and may choose to do so if that is expected to improve overall efficiency. Mark and Ken's recommendation is indeed that it should do so, and I have no quarrel with that.

...

Subprocesses within a closed system may be able to make certain
assumptions for efficiency.  Process B, for example, may know that its
only source of input is Process A, which is guaranteed always to produce
NFC. ...

Does C9 actually allow this? Well, perhaps within a closed system, but then standardisation and so Unicode is irrelevant to data transfer between sub-processes within a closed system. Outside a closed system, Process B's best assumption for efficiency may be that Process A has *probably* normalised, and so it is worth performing first a quick check before a full normalisation.

... For any other situation, the algorithm described in Annex 8 of UAX
#15 should be employed (not re-invented).


--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Compression through normalization

Reply via email to