2011/8/31 <[email protected]>: > PRI #203 Proposed Update for UTS #10: Unicode Collation Algorithm > http://www.unicode.org/review/pri203/
I just have had an idea, which may be interesting. Currently, the collation weights defined in the DUCET for each level are maintained as a list of integers only. The main problem is that at each revision of the UCS (when characters get added), or for the need to correct the relative weights of characters, the collation weights need to be renumbered (notably for the primary level, but in fact as well for the subsequent levels, if there's a need to make more numeric space available for higher collation levels that normally have lower collation weight values). Why not using a representation not based on integers, but instead on *rationals* ? We can indefinitely split a rational interval without having to renumber all other nearby collation elements at the same or lower collation levels. And when designing a tailoring, instead of complex renumbering strategies, we can as well subdivide intervals into as many rational subintervals as we want (just choose the denominator of the inserted rationals as the product of the denominator D for the existing level you want to split, by the number N of (element to insert, plus one), and then you can insert between the rational weights (x)/(D) and (X+1)/(D), whose value does not have to be changed, the (N-1) new distinct rational collation weights (XN+1)/(ND) to (XN+N-1)/(ND). Let's just remember that the absolute value of collation weights does not matter, only their relative order in the sequence for any specific collation level (as well, if collation weights are allocated so that you don't need to insert a level separator in the generated collation keys, the set of collation weights for level N+1 just has to be fully lower than the set of collation weights for level N. And rational numbers perfectly fit this need. Once you have computed all rationals for all levels needed for a specific tailoring (or even for the DUCET) a very basic integer counting of the ordered sequence of rationals can convert them into compact integers that are then easy to insert in collation keys. Using rationals would also have some benefits: it allows compression according to statistics of use: you can start by definining an interval of integers (i.e. rationals with denominator 1) from 0 to 255: you'll split the sequence of all collation weights according to their relative frequency. For more collation weights between two elements, count the number you need to insert, compute the new denominator, and insert them there. More formally, you can just start by the rational inclusive interval [0/1,1/1] which contains the full sequence of all possible collation weights. The position at which you'll split this interval is arbitrary, so you can use technics similar to the Arithmetic coding (used for data compression), or to the Huffmann coding, to split it conveniently with a varying number of bits for each subdivision, based on the estimated statistics of use. With this concept, you'll be able to build extremely compact collation keys that will use variable numbers of bits): just append those sequences of bits, level by level for all collation elements, then add arbitrary padding null bits to fill the collation key into an integer number of bytes if you want. In addition, you'll be able to define *very stable* collation weights in the DUCET, now represented as rationals instead of just integers. Notably, you'll be able to assign *permanent* collation weights for the pseudo-collation elements that represent the script separation. What do you think of this marvelous idea ? --Philippe.

