...Philippe, you have some interesting ideas here and in your previous posting.
The bad thing is that there's no way to say that a superfluous
CGJ character can be "safely" removed if CC(char1) <= CC(char2),
so that it will preserve the semantic of the encoded text even
though such filtered text would not be canonically equivalent.
I wonder if it would be possible to define a character with combining class zero which is automatically removed during normalisation when it is superfluous, in the sense that you define here. Of course that means a change to the normalisation algorithm, but one which does not cause backward compatibility issues.
I guess what is more likely to be acceptable, as it doesn't require but only suggests a change to the algorithm, is a character which can optionally be removed, when superfluous, as a matter of canonical or compatibility equivalence. If we call this character CCO, we can define that a sequence <c1, CCO, c2> is canonically or compatibly equivalent to <c1, c2> if cc(c1) <= cc(c2), or if either cc(c1) or cc(c2) = 0. I am deliberately now not using CGJ as this behaviour might destabilise the normalisation of current text using CGJ. But there would be no stability impact if this is a new character.
The advantage of doing this is that a text could be generated with lots of CCOs which could then be removed automatically if they are superfluous.
I am half feeling that there must be some objections to this, but it's too late at night here to put my finger on them, so I will send this out and see what response it generates.
-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/

