Back in the topic 'Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?', I mentioned that normalisation can be necessary even to collated FCD text correctly, and gave two examples:
Danish (still at CLDR Version 22.1) <U+0061 LATIN SMALL LETTER A, U+00E5 LATIN SMALL LETTER A WITH RING ABOVE>, for which there is an ICU bug report http://bugs.icu-project.org/trac/ticket/9319 Default collation <U+0F71 TIBETAN VOWEL SIGN AA, U+0F73 TIBETAN VOWEL SIGN II> I remarked that the UCA (Technical Report 10) and LDML (Techical Report 35) specifications, taken together, make sense only if there is no such problem. Before raising a specific Unicode bug, I think it would be worth exploring the options. The concepts of FCD is defined in Unicode Technical Note #5 (UTN#5) Canonical Equivalence in Applications http://www.unicode.org/notes/tn5/ , along with the concept of canonical closure. Unicode has not admitted to endorsing it, so I suspect that I can't raise a bug report against it! The process of Unicode collation, in its full form, proceeds through at least the following steps: 1) Normalise the text string to NFD. 2) Split a fully decomposed canonically equivalent string (a rearrangement of the normalised string) into sequences of 'collating elements'*. As the non-zero numbers of the canonical combing classes are to a degree arbitrary, and this rearrangement attempts to undo the artificiality of the canonical order to better accord with the language of the text. Compared to the original normalised string, these collating elements interleave. 3) Look up sequences of ordered n-tuples of numbers, known as 'collation elements'*, for each collating element. 4) Adjust the sequences of n-tuples to reduce the untoward effects of symbols, spaces and punctuation. 5) Convert the n-tuples to a simple sequence of numbers - the 'sort key' - that may be used for comparison. *The term 'collating element' is taken from ISO 14651, 'collation element' from the UCA. I am not all sure how one distinguishes them in French! The ordering is largely defined by the mapping from collating elements to collation elements. Step 1 is a complete waste of time for most text in many languages, and therefore there is great interest in omitting it. Step 2 is easy to get wrong, especially if the text has not in fact been normalised. The primary problem with UTN#5 is that it fails to address the issue of decomposing the normalised string into collating elements, which is how the two examples above fail. Markus Scherer has identified the problem as being that in some collations, characters need to be split between collating elements. There are several tweaks and options that could be chosen. The FCD check uses, for each character x, the canonical combining class (ccc) of the leading element in its NFD decomposition, lcc(x), and the canonical combining class of the trailing element, tcc(x). The FCD check, checks for adjacent elements x and y, that one of the following three conditions hold: (1) tcc(x) = 0; or (2) lcc(y) = 0; or (3) tcc(x) <= lcc(y) The Tibetan example fails because the components of <U+0F73 TIBETAN VOWEL SIGN II> have different canonical combining classes. <U+0F71, U+0F73> has decomposes to <U+0F71, U+0F71, U+0F72>, which is then split into collating elements <U+0F71, U+0F72> and <U+0F71>. To stop this case being FCD, we could replace condition (3) by (3') tcc(x) <= lcc(y) and lcc(y) = tcc(y). Markus Scherer has suggested simply prohibiting composed characters whose complete decomposition lack characters with ccc = 0. I think the difference amounts to one character, U+0344 COMBINING GREEK DIALYTIKA TONOS, which decomposes to two characters with the same canonical combining class. Note that the 'Tibetan example' comes from the default collation; the most relevant language appears to be Sanskrit! The next tweak would be to canonical closure. I should first comment on a little known potential issue with the generated collation element table. Consider a hypothetical language whose collation differed from the default by having a 'LETTER Y WITH DOUBLE ACUTE' as a letter of the alphabet. There is no composed character for this. Suppose further that it used a dot below as an ordinary accent. Now, when collating <U+1EF5 LATIN SMALL LETTER Y WITH DOT BELOW, U+030B COMBINING DOUBLE ACUTE ACCENT>, the NFD form would be <U+0079 LATIN SMALL LETTER Y, U+0323 COMBINING DOT BELOW, U+030B>, and the relevant possible collating elements would be: <U+0079> <U+0079, U+030B> ! a letter in this language <U+030B> <U+0323> The correct decomposition into collating elements would then be: <U+0079, U+030B>, <U+0323> When we form the 'canonical closure' as described in UTN#5, we also generate an entry in the canonically closed collation table for U+1EF5. When we try to use the FCD trick for the collation of <U+1EF5, U+030B>, we are liable to decompose it into collating elements <U+1EF5>, <U+030B> which would have the same collation elements as <U+0079>, <U+0323>, <U+030B> instead of using the collating elements for <U+0079, U+030B>. The tweak is for canonical closure to generate a collating element for <U+1EF5, U+030B> (and similarly for all the other precomposed characters containing 'u'). More subtly, the creation of a collating element for <U+1EF5> should *not* lead to the creation of a collating element for <U+0079, U+0323>, for that would lead to errors in the detection of interleaving collating elements! The handling of the Danish case would also be included in the adjustment of canonical closure. Does anyone feel up to rigorously justifying revisions to the concepts and algorithms of FCD and canonical closure? Occasionally one will encounter cases where the canonical closure is infinite - in these cases, normalisation will be necessary regardless of the outcome of the FCD check. Perhaps one could merely revise the definition of FCD, and devise a test for the adequacy of the current canonical closure. If the collation fails this adequacy test, then again disabling normalisation should be prohibited. (I would suggest that in these cases the normalisation setting should be overridden with only the gentlest of chidings.) A lazy option would be to wait (how long?) and then remove the option of no normalisation on the ground that sufficient computing power is available. Thoughts, anyone? Richard.

