On Sun, Feb 23, 2014 at 2:13 PM, Philippe Verdy <[email protected]> wrote:
> 2014-02-23 22:32 GMT+01:00 Richard Wordingham < > [email protected]>: > >> On Sun, 23 Feb 2014 20:49:24 +0100 >> Philippe Verdy <[email protected]> wrote:*At least, referring to >> Version 24 of the LFML specification, I assume >> Part 5 Section 3.5, which defines "&..<<", also applies to Section 3.9, >> which purports to define the meaning of "&[before 2]..<<". It's >> conceivable that I am wrong, and the meaning of "&[before 2]á << ạ" is >> undefined. >> > No, it's well-defined, and I believe that part of the spec is fairly complete since CLDR 24. This looks like a cryptic notation anyway. If we assume that there's an > implicit reset at start of a collation rule, and that collation does not > define any relative order for the empty string, you could simply write this > reset at level 2 as: > << á << ạ > It might have made sense 15 years ago to permit relations without an initial reset, because at the time the rules were applied on a blank slate. Ever since ICU/CLDR collation rules were redefined to apply on top of DUCET (and later on top of the CLDR root collation), you really need to reset to something for the result to make sense. CLDR 24 forbids rules without initial reset, and ICU 53 will follow suit. instead of the mysterious notation (and in fact verbose and probably > inconsistant in the way the same level 2 is further used with "<<"): > &[before 2]á << ạ > It is true that the "2" and the strength of the operator are redundant, but the notation is now well-defined. I don't know your criteria for "mysterious" :-) It does help to know the root collation mappings, or at least how they are generally constructed; for example, that á maps to two collation elements. I don't thing the "&" is necessary except as a separator between separate > rules (where all rules must implicitly start by a reset at some level). > See above. The "monster" you describe belongs to ICU implementation (which is not part > of any standard but now integrated in various products that have abandonned > the idea of implementing (unstable and complex) collations themselves. > I think Richard refers to the "monster" because it is very, very tricky to get one's head around the interaction of all of the pieces of the UCA, Unicode normalization, and the CLDR additions. At least when it comes to the heads of Richard, Mark, Ken, and my own... Also, the implementation of UCA is easy if you don't care about data size or speed of string comparisons. Once you care about size and speed and want additional functionality (like in ICU), it's a major chunk of code. In the case of ICU, that code had accreted functionality and changed with changing specs and had gotten buggy and hard to maintain, so I am in the process of reimplementing it, with hopes of getting it into ICU 53 in March. The code and data actually got smaller, but it's still large. My opinion is that this part of ICU should be detached from it in a > completely separated project, to help simplifying it, > It's complex for reasons stated above, and it benefits from many lower-level parts of ICU (Unicode properties, normalization, data loading, data structures, ...). It is notable that after so many years years, collation is still not > implemented in Javascript, and still does not have a standardized API in > Javascript/ECMAscript minimum support for strings > Collation was added to the ECMAScript standard in 2012, with several browsers implementing it. PyICU makes it available in Python. If someone wanted to port code to JavaScript or Python, and wanted it to be fast, the new (upcoming) ICU Java code might be a reasonable start. When performance of applications on the client side is a problem (for > client-side applications needing to perform dynamic collations), full > collators are not implemented at all, and these applications use a much > simpler model (even if they don't work very well with lots of languages). > Right. If the client code need not collate newly typed strings, then one good technique is to have the server send the corresponding sort keys. By the way, ICU makes a strong effort to write very short sort keys. Best regards, markus -- Google Internationalization Engineering
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

