> 2013/3/16 Richard Wordingham <[email protected]>: >> But with the low/high split scheme, start units have to have low values >> (e.g. 20, 21 & 22) and continuation units have high values (e.g. 80) >> just to stop this very problem.
Note also that all technics used for data compression can be used there as well. For example frequent sequences of weights could also have a predictive encoding model, notably when creating collation keys for strenghts 2 or higher, because there will be very frequent sequences of identical secondary weights. So instead of encoding the same secondary weights independantly, you could also represent the sequnce of two identical secondaries by a single code. Predictive models include distinary-based compression schemes such as LZW, and it is also combinable woth the Huffmann or aithmetic coding to compress these seuqnces with the smallest number of bits. in the final collation key. The result of these optimization steps is such that collation keys could even be the same size as the oriignal plain-text, but they could even be shorter and have the same average size as the result of a data-compression of plain-text (if you have first trained your encoding using a significant corpus of text from which you have precomputed the predictive model which will finally be used to encode all collation keys. In such a scheme you don't even need to make distinction between weights for different collation levels. For example if two distinct collation elements only have primary differences, and there cannot exist any other collation element thatwill sort between them, then their secondary differences does not matter, and zeroes no longer need to be encoded (the compression scheme will then generate zero bits for these secondaries) : the LZW algorithm would detect this implicitly. The only complexity is about how you will precompute the dictionary needed for the predictive model, and the number of entries it will contain (because this dictionnary will need to be static, and not dynamic as collation keys must still be computable separately from separate occurences of plain text).

