I expect to publish soon a completely different way to represent the DUCET in a way compatible with the rationals representation. In fact this new dataset will be FULLY compatible with it (or also with the CLDR root derived-DUCET), without even needing to specify in the data *any* numeric value (say good bye to *all* numeric weights in this datafile, and as well to the existing very complex representation in CLDR, which needlessly tries to assign fixed collation levels).
My representation will in fact ONLY contain ordered lists of collation elements. The primary level will contain only 10 pseudo-elements: <ignorables>,<variables>,<whitespaces>,<punctuations>,<symbols>,<currencyunits>,<numbers>,<letters><privateuses>,<levelterminator> The secondary level will list only the "leaders" in the sequences of collation elements, listed after the 11 pseudo-elements above. But New pseudo elements will be added for scripts in the subsequence between <letters> and <privateuses>. The ternary level will list only the existing "leaders" for currently differentiated at the primary level in the existing DUCET. The format will allow specifying arbitrary rational values to some collation elements (in ascending order listed at each level), based on estimated statistics of use of the collation elements in the sequence (it will be in fact given as a percentage). All other collation elements will be assigned rational values algorithmically. Typically, a tailoring could change those statistics to compute themselves a good rational approximation of the statistics for the fixed elements, and then will assign the other unspecified weights, using a linear distribution in the rational range delimited by collation elements with assigned statistics. Those statistics will not need to be specified for all collation elements, only for a few significant subranges of collation elements in the sequence. Pseudo-elements in the sequences collation elements will be automatically assigned an unmodifiable statistics of "EPSILON", i.e. the smallest rational representable =1/N, where N is any integer between the number of collation elements and pseudo-elements in the subsequence, and the largest acceptable integer supported by the implementation). If a tailoring assigns a statistics lower than this EPSILON value, it will be automatically increased to be at least this value, and the sum of statistics for all collation elements in the subsequence will then be checked to make sure that it does not exceed 100%, after temporarily assigning this epsilon value to all collation elements with unspecified statistics; if this 100% is exceeded, the EPSILON value will be reduced and all other statistics will be proportionally adjusted. This makes all subsequences conforming so that each one can get a well defined proportional represetnation as a subsegment of rationals over the hald-open [0/1..1/1[ range of rationals. I just decided to remove the last boundary 1/1, it is not necessary and later complicates the conversion to bitstrings in the Huffman or arithmetic conversion used when generating collation keys. But when only comparing pairs of strings, where no collation key is used or needed, we can just compare the rational weights without generating Huffman or arithmetic bitstrings, which is MUCH faster, and there's no need in this case to use the statistics. The rationals don't have to be tuned per locale to match their frequencies of use for just comparing texts.That's why their specification will be unspecified for the use in the DUCET or CLDR root (all the rational weights will be generated by a very basic algorithmic counting the number of collation elements in each subsequence to compute their common denominator, and then assigning numerators as an integer sequence). It will be implied that the first element in the sequence for a given elevel, will be used to represent all ignorable collation elements at this level, and the last will be used for the level terminator (the only pseudo-element to keep in this sequence, and that you won't even need to specify in the data set). So the data set will consist only in lists of collation elements (for example as a comma-separated list, the separation could as well be newlines, or "<" or just spaces or tabs to allow commenting the list after a "#" sharp sign up to the newline), and nothing else. Using a syntax that does need to specify the level number between each element, it will be MUCH more compact than even the existing "abbreviated" syntax used in LDML. No complex parser except for the isolated collation elements themselves. There will be no "reset" at all, a single reset will be implied at the begining of each sequence for a given level N, or if a collation element matches <levelterminator>. It will be implicit that levels will need to be specified separately in ascending order, and it will not be necessary to assign numeric values to these levels, only possibly symbolic names (such as "accents" or "cases"), if one want to swap levels in a subsequent tailoring. With some help, I could be able to demonstrate a basic implementation in Java with almost no use of object programming, about how to infer the rational weights (for the simplest case where you don't want to tune the statistics to make generated collation keys optimally compressed with Huffmann or arithmetic coding), from the simpler data file (optionally but later, the code would demonstrate how to use the existing DUCET format (using the integer numeric values exposed there, as a way to define the numerator of a rational whose denominator would be infered ; but I am now convinced that this DUCET format currently used in the UCD is creating more troubles than it helps). -- Philippe. 2011/9/7 Philippe Verdy <[email protected]>: > I've just tried this idea, and this is working magically as I wanted. > It generates extremely compact collation keys, notably if I no longer > restrict collation levels so that they use a limited range: instead I > explicitly use a level separator, to which I assign the final range of > rationals [7/16 .. 16/16], that I can represent as 4 bits (1111). > > Then each level can use the collation weights in the half-open range > [0/16 .. 7/16[, which can be splitted in as many elements as needed. I > have also used the [0/16 .. 1/16[ half-open range for the weight > representing ignorable elements: this will use 4 bits(0000) in the > collation keys. This leaves the half-open range [1/16 .. 7/16[ for > representing all non-ignorable weights for this level. > > For the primary level, I can predefined fixed ranges of rationals for > representing variable elements, whitespaces, symbols, punctuations, > numbers, and then letters for each script (also with fixed ranges for > each script). > > Loading tailorings becomes extremely easy when using rationals, the > algorithms become MUCH simpler. > > And finally, once you have assigned rationals to all collation > elements for a specific tailoring (including for the DUCET itself), it > is an extremely easy transform to infer integer weights. This means > that you no longer need to use gaps in the sequences of default > weights, or to given them special meanings. But my opinion is that > this representation using integers is much more difficult to compress > in collation keys. > > In fact, in a specific locale, where collation elements have some good > estimation of the frequency of use of collation elements, it is easy > to assign them an associated fixed rational, and to place all other > collation elements by spliting the intervals as needed to fit. And you > get extremely compact collation keys represented eigher by Huffman > encoding (or by arithmetic codding if you have a licence for it from > IBM, but this does not reduces much the length of collation keys).

