...Bear in mind that according to C9
But as C10 does not mandate any normalized form (just canonical equivalence of the results), I don't think that it requires that a compressor should produce its result in either NFC or NFD form
Instead I think that it's up to the next process to determine which
normalization form best fits its need: if the compressor was desinged to
recompose to NFC, and then the next process prefers NFD, the last
renormlization in the compressor will be superfluous.
no process can assume that another process will make a distinction between two different, but canonical-equivalent character sequences
(including a distinction between normalisation forms) and therefore the next process is not supposed to rely on the (de)compressor to normalise into any particular form.
This is what C9 seems to require. Normalisation on output is not forbidden, of course, but the next process is not supposed to rely on it having been done, certainly not to fail if it has not been.So for me, a compressor can choose its own noramlization on input and apply it before compressing, and the decompressor needs nothing else than just decompressing and keeping the string in the form that was accepted or forced on input by the compressor and encoded in the compressed stream.
With this view, normalization of strings should not be done on output from a
process but on its input.
One can save unnecessary normalizations across multiple processes in oneI'm not sure how well this one agrees with C9. These 8 bits have to be communicated between the processes in question by some protocol separate from the Unicode text. I am not sure if a process is permitted to rely on such information.
system, provided that the strings produced on output are reliable marked
(out of band with some meta-data) to indicate their current supported
normalization forms (i.e. if the string is already in one or more of the 4
standardized normalization forms): this just requires 8 bits of information
on output of any process, in addition to the output string, with 2 bits per
form to mean: YES, NO, UNKNOWN/UNTESTED (and posibly MAYBE, if the input is
also MAYBE and the process does not force or checks any normalization form
on input, this test being left for the next process if it needs it).
A fast check on input is of course sensible if the input is expected to be in a particular form e.g. if it is recommended by the higher level protocol in use; but if the fast check fails the process should not fail but should perform full normalisation.If this is not indicated (in the output from an external and not directly supported process), then a fast-check on input may be used if this saves work.
-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/

