On Mon, 11 Feb 2013 17:13:58 -0800 Markus Scherer <[email protected]> wrote:
> I would not revise FCD itself. For a number of processes, it is > sufficient as is. For collation it's not. > > About the Tibetan precomposed vowels: > > For the LDML spec, I submitted a CLDR ticket this morning: > http://unicode.org/cldr/trac/ticket/5667 If we want to proceed along the current lines, then all we need is 'CFCD' (Collation FCD), which differs from FCD by excluding characters that decompose to two or more characters of which none have canonical combining class zero. The motivation for the sterner exclusion is provided by adding the following contrived collating elements to the a default collation: <U+03B1 GREEK SMALL LETTER ALPHA, U+0308 COMBINING DIAERESIS> <U+0301 COMBINING ACUTE ACCENT, U+0345 COMBINING GREEK YPOGEGRAMMENI> Proper canonical closure then requires contractions for: a) <U+03B1, U+0344 COMBINING GREEK DIALYTIKA TONOS> - this sequence is canonically equivalent to <U+03B1, U+0308, U+0301>, b) <U+03B1, U+0344, U+0345>, and c) <U+0344, U+0345> Now consider the sequence <U+03B1, U+0359 COMBINING ASTERISK BELOW, U+0344, U+0345>. Using the extended set of contractions, this splits into the discontiguous collating elements <U+03B1, U+0344, U+0345> and <U+0359>. However, using the original contractions along with normalisation, we obtain the collating elements <U+03B1, U+0308>, <U+0359>, <U+0301, U+0345>, which in general will sort differently. At first sight, the solution would be to prohibit <U+03B1, U+0344> and <U+03B1, U+0344, U+0345> from being discontiguous collating elements. But then the sequence will split into the collating elements <U+03B1>, <U+0359>, <U+0344, U+0345> - yet a third set of collating elements! > For UTS #10 section 6.5, I just now submitted an error report on > unicode.org . > For ICU, we have a statement in the User Guide that overlaps between > contractions and decomposition mappings are not supported, and we > have a ticket for trying to fix this by building more data. I believe the algorithm is less difficult than I thought. Having reduced the set of characters in CFCD strings, the algorithm simplifies. I use the following notation: Let F be the set of all CFCD strings. Let E(s) be the set of CFCD strings canonically equivalent to s. Let U be the set of strings of length one. Let T be a set of NFD collating elements. Then the canonical closure S of T is the least set such that: 1) E(T) ⊂ S 2) If xu ∈ S, vy ∈ T, u and v are characters, and vy is the last collation element in xuvy, then x(E(uv) ∩ U ∩ F)E(y) ⊂ S. It may be argued that the concept of CFCD strings is too strict: (i) The total exclusion of characters decomposing to two characters with non-zero canonical combining class is too strict - I can't see any problem in their following characters with trailing canonical class of zero. This would cover most of the cases where they are likely to be encountered. (ii) It does not seem onerous for a collating application to decompose the non-CFCD FCD characters on the fly. (iii) Having a string in FCD is not very far from having it in NFD. An iterator or stream delivering FCD can very simply be converted to one delivering NFD - all that is required is a mapping from characters to their NFD decompositions. Richard.

