Hi Richard, I was looking again at your example where U+0344 causes bad results in collation of FCD strings. See inline below.
On Tue, Feb 12, 2013 at 12:19 PM, Richard Wordingham < [email protected]> wrote: > On Mon, 11 Feb 2013 17:13:58 -0800 > Markus Scherer <[email protected]> wrote: > > > I would not revise FCD itself. For a number of processes, it is > > sufficient as is. For collation it's not. > > > > About the Tibetan precomposed vowels: > > > > For the LDML spec, I submitted a CLDR ticket this morning: > > http://unicode.org/cldr/trac/ticket/5667 > > If we want to proceed along the current lines, then all we need is > 'CFCD' (Collation FCD), which differs from FCD by excluding characters > that decompose to two or more characters of which none have canonical > combining class zero. The motivation for the sterner exclusion is > provided by adding the following contrived collating elements to the > a default collation: > > <U+03B1 GREEK SMALL LETTER ALPHA, U+0308 COMBINING DIAERESIS> > <U+0301 COMBINING ACUTE ACCENT, U+0345 COMBINING GREEK YPOGEGRAMMENI> > > Proper canonical closure then requires contractions for: > a) <U+03B1, U+0344 COMBINING GREEK DIALYTIKA TONOS> - this sequence is > canonically equivalent to <U+03B1, U+0308, U+0301>, > b) <U+03B1, U+0344, U+0345>, and > c) <U+0344, U+0345> > This "proper canonical closure" assumes adding contractions for overlaps between existing contractions and decomposition mappings. Canonical closure will then also add the decompositions of b) and c): d) <03B1, 0308, 0301, 0345> e) <0308, 0301, 0345> Now consider the sequence <U+03B1, U+0359 COMBINING ASTERISK BELOW, > U+0344, U+0345>. Using the extended set of contractions, this > splits into the discontiguous collating elements <U+03B1, U+0344, > U+0345> and <U+0359>. > > However, using the original contractions along with normalisation, we > obtain the collating elements <U+03B1, U+0308>, <U+0359>, <U+0301, > U+0345>, which in general will sort differently. > This is true when "using the original contractions", but I would argue that the goal of canonical closure is that *with the canonically-closed mappings* we get the same result for FCD input text (minus the Tibetan composite vowels) as for NFD input text -- but it will get different results for NFD input as an implementation without overlap closure. In your example: With the canonical closure adding contraction d) we obtain the collating elements <03B1, 0308, 0301, 0345>, <0359> which will collate the same as the FCD version. I think we should remove U+0344 from the FCD exclusions<http://unicode.org/repos/cldr/trunk/specs/ldml/tr35-collation.html#Collation_Settings>where I added it a few weeks ago. Instead, we should document that an implementation (like ICU currently) which does not add the overlap contractions will get some different FCD/NFD results, and an implementation that does add the overlaps will get some different results for NFD than an implementation that doesn't add the overlaps. markus

