On Sun, 23 Feb 2014 20:49:24 +0100 Philippe Verdy <[email protected]> wrote:
> It seems surprisng that Michael Everson asks the question, when he > already knows so much about Unicode algorithms (but may be less about > notations used in CLDR data) > > The CLDR also has several competing notations for specifying > collations so that may be the purpose of his question. I have no confidence that his question has been understood. Collation is a monster, and it is unsafe to assume that one understands it. The ICU notation and implementation for an abstract definition of collation turned out to be full of traps, and won't catch up with CLDR definitions until Markus Scherer's raft of collation amendments goes in. (Or have I missed the announcement?) Rigorous definitions have had to address collation elements (i.e. sets of weights, one at each level with 0 a special value), which is not as abstract as the ICU notation was meant to be. As an example of the treachery of collation definitions, one might naïvely think that adding &a<<ạ to the default collation would result in ạ << á holding, but it doesn't, for á has two collation elements, not one. CLDR has now* redefined the notation so that &[before 2]á << ạ will give the ordering relationships a << ạ << á << à without having to reorder U+0323 COMBINING DOT BELOW. In the default collations, secondary differences are implemented by adding collation elements with zero primary weights, while tertiary differences are implemented as different tertiary weights in collation elements with non-zero primary weights. I doubt that using both methods at the same level works well. Fortunately, collation generally only needs to work well when restricted to valid words. For some languages, the task of placing an arbitrary string of the language's characters in the correct place by alphabetical order is meaningless. *At least, referring to Version 24 of the LFML specification, I assume Part 5 Section 3.5, which defines "&..<<", also applies to Section 3.9, which purports to define the meaning of "&[before 2]..<<". It's conceivable that I am wrong, and the meaning of "&[before 2]á << ạ" is undefined. Richard. _______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

