On Sat, 18 May 2013 02:02:07 +0200 Philippe Verdy <[email protected]> wrote:
> Yes it is expected. And in fact very common in Unicode since long > (there are in fact many "Mn" marks with combining class 0, this is > not just for one script). > > A combining class 0 DOES NOT mean that the character will not be a > non-spacing mark (or that it will be spacing), but just that it blocks > reorderings under standard normalizations and for recognizing > canonical equivalences. > > (see for example CGJ which also has combining class 0 and which is > used mostly to insert such blocking behavior, without having any real > semantic meaning by itself ; once the normalization step has been > done, it can be discarded from the input stream in renderers or > collators, except for special purposes like rendering CGJ with its > own visible glyph in some "visible controls" edit mode) It cannot be discarded when collation is used for sorting. > Technically CGJ is not "Mn" (not a combining mark) but a "formatting > control", Wrong! Its general category is still Mn. > but it still participates to the grouping of "default > grapheme clusters", as if it was a combining mark -- and for most > parts, it is an artefact of the encoding in the UCS, and considered > foreign to the script by native writers, but it is also needed for > compatibility reasons. As far as I am aware, there are no 'compatibility' reasons for preserving CGJ. It was initially convenient to assign CGJ the general category Mn, and this has been found to be a happy coincidence, for it can serve a useful role in disrupting various processes. The earliest such role was in disrupting contractions in collation, for experience has shown that it is more natural to treat a potential digraph as such unless there is a mark to the contrary, rather than to require a marker to show that it is a digraph. Secondly, it has been found useful to preserve the arrangement of combining marks. > However, in many scripts, there exists true > combining marks (Mn) that have combining class 0 (i.e. whose relative > ordering in the encoded stream is semantically significant when they > are used in conjunction with other reorderable combining marks). It is usually the case that one order is right and the others are wrong. The only situation I can think of is where a base character is omitted, and even then I have no clear candidates.

