Doug Ewell > OK, then I suppose I should play devil's advocate and ask Peter's and > Philippe's question again: If C10 only restricts the modifications to > "canonically equivalent sequences," why should there be an additional > restriction that further limits them to NFC or NFD? Or, put another > way, shouldn't such a restriction be part of C10, if it is important?
First C10 only restricts modifications just to preserve all the semantics of the encoded text in any context. There are situations where this restriction does not apply: when performing text transformations (such as folding, or even substringing, which may or may not respect canonical equivalence: case folding applied to substrings does not as concatenation of folding of substrings does not always return canonically equivalent results, even if canonical cluster bounds are preserved when substringing). Compression of an existing text is not viewed as being a text transformation, so the intent of C10 should be observed, but only if the compressor claims that _it preserves_ canonical equivalence. But as C10 does not mandate any normalized form (just canonical equivalence of the results), I don't think that it requires that a compressor should produce its result in either NFC or NFD form Instead I think that it's up to the next process to determine which normalization form best fits its need: if the compressor was desinged to recompose to NFC, and then the next process prefers NFD, the last renormlization in the compressor will be superfluous. So for me, a compressor can choose its own noramlization on input and apply it before compressing, and the decompressor needs nothing else than just decompressing and keeping the string in the form that was accepted or forced on input by the compressor and encoded in the compressed stream. With this view, normalization of strings should not be done on output from a process but on its input. One can save unnecessary normalizations across multiple processes in one system, provided that the strings produced on output are reliable marked (out of band with some meta-data) to indicate their current supported normalization forms (i.e. if the string is already in one or more of the 4 standardized normalization forms): this just requires 8 bits of information on output of any process, in addition to the output string, with 2 bits per form to mean: YES, NO, UNKNOWN/UNTESTED (and posibly MAYBE, if the input is also MAYBE and the process does not force or checks any normalization form on input, this test being left for the next process if it needs it). If this is not indicated (in the output from an external and not directly supported process), then a fast-check on input may be used if this saves work. __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>

