2014-02-24 20:38 GMT+01:00 Richard Wordingham < [email protected]>:
> My understanding of the meaning of the notation is that: > > 1) ạ is to have the same number and type of collation elements as á > currently has; > 2) The last collation element of ạ that has a positive weight at level > 2 is to be immediately before the corresponding collation element of > á at the secondary level; > 3) No collation element is to be ordered between these two collation > elements; and > 4) Their other collation elements are to be the same. > I disagree with point your point (1). * The number of levels does not matter, the notation just indicates that the relation does not specify any starting weight for levels lower than the one indicated by the reset. * And the effective number of collation elements does not matter: we should assume that if one of the item has not enough collation elements, there's for each level a zero weight for each missing level. In practive this only affects the first element, except in case of contractions. I disagree as well on point (2). The starting element (at the reset) may have a null weight at that level, so that we can still order other elements with the same null weight at that level, notably if they have non null weights for higher levels. I agree on your point (3) EXCEPT when the first item of a pair is a "reset" (i.e. an empty string). The point (4) is completely wrong. The other collaction elements in the first pair may be arbitrary (also possibly with distinct weights, but at higher levels) !!! In fact the two items listed after the reset do not matter at all. All what is important is the item for the reset itself, and the first non-empty item ordered after it. That's why I think that "&[before2] xxx" makes sense (even alone) and is in fact the same as "& << xxx" or even just "<< xxx" if you consider that evey rule starts by an implicit reset in order to create a valid pair (in the first pair, the 1st item of the pair is the reset itself, i.e. an empty string, the second item is the first non-empty string indicated after it; and the pair itself has a numeric property specifying its level, here 2). The form "&a < b < c < d ..." is a compressed form of these rules: "<a"; "a<b", "b<c" (where the "<" is any king of comparator for some level). The "reset" is then automatically the first item of each pair. Internally in my implementation I do not really use pairs, but triplets of the form (item1, weight, item2), where weight is a numeric value: initialy it is set to an integer matchin the level number, but I may set it to a fraction between two integers, to define frequencies (by adding a fraction betwen 0 included and 1 excluded to indicate the probability of having other collation elements with a lower weight at that level). The fraction is then used to generate extremely compact collation keys (using atithmetic coding), even shorter than what we get in ICU or with Huffmann coding or other context-free prefix coding. So my own syntax never needs any explicit reset, it just order collection elements with simple rules, in which I can also add optional statistics (used only for the generation of collation keys, but not needed at all for comparing two strings). My own syntax is even simpler: I don"t use at all multiple operators "<", "<<", "<<<", I just specfy rules for each level separately, and for that I do not even need any operator, byt just a common separator, so that I can create the list of successive pairs (starting with the pair containing the impliciit "reset" for that level). In the middle of the list however I may set some %-tage value to indicate some relative frequency for the second element of each pair (the 1st element of a pair with the reset always stars at 0%, so the %tage is onlu indocated for the second element of each pair, and this %age can never be 100% or more. Unless I specifiy a êrcentage, the list of pairs uses implicit %-tage values distributed lineaility in the given list (internally the precision of a double is sufficient for practical cases, though doubles are immediately converted to 64-bit integers, which may be needed if the statistics given are extremely skewed; if statistics are not used to generate collation keys, I could even use 32-bit integers only for building the lookup tables for comparing strings; so a tuning parameter can simply be used to ignore/discard the statistics provided in rules and the construction of lookup tables is significantly faster and the binary lookup tables are 50% more compact). Such specification is much simpler to edit; and in many cases I don't need to specify everything because I also perform the normalization closures or the rules. The result is that rules are much shorter (just to represent for the DUCET itself, via a script parsing it and generating the rules, I get strings converted to short JSON data that can be parsed easily (the normalization closure being still the longest part of the algorithm when instantiating a new collator). I also never need and use the collation weights suggested in the DUCET except for their relative values to order the elements in much more compact rules. Building the effective table of weights is then extremely trivial. performing the final encoding of collation keys is not much more complex than a traditionnal arithmeric coder (it is a bit simpler if you use Huffman, however it is *slower* when instanciating the collator, because you need .to precompute the bit pattern, something absolutely no needed with artihmetic coding which uses the statastics dirctly without needed to prebuild an helper lookup table, also because of the complex bit stuffing which is simpler to do with the arithmetic coder). I already spoke about this several months ago on this list. However I've not experimented it with lots of locales needing complex collation rules (so I did not investigate all the data already specified in CLDR for some locales). My code remains experimental mainly because I do not handle specific rules needing custom work breakers (e.g. for South-East Asian scripts). I've not experimented it a lot also with CJK Sinograms (and the complex lookup they require if we need to lookup for the supplementary data such as translitteration mappings for Pinyin or similar). And I still don't handle some of the preprocessing needed for some Indic scripts (includng Thai), or ignorable "stop words" (that may be needed even for English or French for some kinds of collators), and the initial code to handle parsing and preprocessing numbers is broken in my "pipelined" implementation that attempts to avoid most internal bufferings and internal copies of transformed strings (to avoid stressing memory allocators and garbage collectors). The code is not written in C/C++ but in Java (easier to test, I have little experience of Python, and don't like PHP, I could have used Lua, also good for experimenting designs, but it still lacks a JIT compiler to native code and its performances are too poor compared to current cimplementations of Javascript). I'd like to have a version can be be deployed on the client side using a scripted language that is easily deployed (this also excludes C/C++ implementations like ICU4C that I consider simply not available if it's only supported by specific framework libraries; but my initial attemps with using Javascript also has shown that it was still too slow even with JIT and still stressed the memory allocator too much; that code is also very unsafe, errorprone and a nightmare to debug).
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

