From: "Philippe Verdy" <[EMAIL PROTECTED]> > To make things simpler, introduce a special collation key value which is lower > than all others, (0 in the example above), and you get a simpler view of > collation elements as a single vector of numeric value, if you use it as a > terminator between each level for the resulting collation string: > "aa" => (1, 1, 0, 10, 10, 0) > "AB" => (1, 2, 0, 11, 11, 0) > "Ab" => (1, 2, 0, 11, 10, 0) > "Aba" => (1, 2, 2, 0, 11, 10, 10, 0) > This simplifies things to get binary comparable vectors of numeric values. The > length of the vector depends on the length (in characters or collation elements) > of input strings, and on the number of levels considered.
Note that DUCET uses another solution: no separator is used, but instead all primary weights are made higher than all secondary weights. If you read the Unicode collation algorithm, you'll see that the value 0 is used to mean "ignorable at that level", so that it can be suppressed for the collation keys generated from input strings. I should have better used the values already assigned in the current version of DUCET... Note however that these values are arbitrary, only their relative order is important. Also this algorithm is a bit more complex, because it allows tailoring the order with which weights in a given level are generated in the output collation key (they can be output in forward or backward order, notably for French ordering at level 2 for accents), and because it takes into account not only characters but also groups of Unicode characters as single units for collation (for example digraphs used in many languages which sort them as if they were one letter, such as in Spanish, or the recently discussed "gb" in Yoruba).

