On Sat, 18 May 2013 02:02:07 +0200
Philippe Verdy <[email protected]> wrote:

> Yes it is expected. And in fact very common in Unicode since long
> (there are in fact many "Mn" marks with combining class 0, this is
> not just for one script).
> 
> A combining class 0 DOES NOT mean that the character will not be a
> non-spacing mark (or that it will be spacing), but just that it blocks
> reorderings under standard normalizations and for recognizing
> canonical equivalences.
> 
> (see for example CGJ which also has combining class 0 and which is
> used mostly to insert such blocking behavior, without having any real
> semantic meaning by itself ; once the normalization step has been
> done, it can be discarded from the input stream in renderers or
> collators, except for special purposes like rendering CGJ with its
> own visible glyph in some "visible controls" edit mode)

It cannot be discarded when collation is used for sorting. 

> Technically CGJ is not "Mn" (not a combining mark) but a "formatting
> control",

Wrong!  Its general category is still Mn.

> but it still participates to the grouping of "default
> grapheme clusters", as if it was a combining mark -- and for most
> parts, it is an artefact of the encoding in the UCS, and considered
> foreign to the script by native writers, but it is also needed for
> compatibility reasons.

As far as I am aware, there are no 'compatibility' reasons for
preserving CGJ.  It was initially convenient to assign CGJ the general
category Mn, and this has been found to be a happy coincidence, for it
can serve a useful role in disrupting various processes.  The earliest
such role was in disrupting contractions in collation, for experience
has shown that it is more natural to treat a potential digraph as such
unless there is a mark to the contrary, rather than to require a
marker to show that it is a digraph. 

Secondly, it has been found useful to preserve the arrangement of
combining marks. 

> However, in many scripts, there exists true
> combining marks (Mn) that have combining class 0 (i.e. whose relative
> ordering in the encoded stream is semantically significant when they
> are used in conjunction with other reorderable combining marks).

It is usually the case that one order is right and the others are
wrong.  The only situation I can think of is where a base character is
omitted, and even then I have no clear candidates.

Reply via email to