Peter Kirk <peterkirk at qaya dot org> wrote:OK. So it's Mark, not me, who is unilaterally extending C10. Well, Ken said much the same, so it's bilateral; and I agree it is a sensible extension.
Surely ignoring Composition Exclusions is not unilaterally extending
C10. The excluded precomposed characters are still canonically
equivalent to the decomposed (and normalised) forms. And so composing
a text with them, for compression or any other purpose, still conforms
to C10, which explicitly allows "replacement of character sequences by
their canonical-equivalent sequences" - not only when the resulting
sequence is NFC or NFD.
Ignoring the composition exclusions does still respect canonical equivalence, but does not preserve a canonical normalization form (using the language of UAX #15). So although it is not a violation of C10, it does seem to run afoul of Mark's recommendation:
"In practice, if a compressor does not produce codepoint-identical text, it should produce NFC (not just any canonically equivalent text), and should document that it does so."
But, as Ken also pointed out, it is quite permissible to use any encoding for the intermediate e.g. compressed form of the text, as long as it is possible to recover from this the normalised form of the original text. My suggestion of composing the text using composition exclusions meets this test, in a way not met by some of the other suggestions, e.g. composing Korean characters into precomposed forms which are (sadly) not canonically equivalent.
-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/

