On Sat, 16 Mar 2013 09:29:07 -0700 Markus Scherer <[email protected]> wrote:
> On Sat, Mar 16, 2013 at 4:09 AM, Richard Wordingham < > [email protected]> wrote: > > > Please give an example of how the low/high split would fail. With > > the primary collation weights 20, 21, 21 80 and 22 I get the > > following primary collation weight sequences for one and two > > collating elements, marking boundaries of collating elements with > > commas: > > > > The problem is that if you have 21 and 21 80, and another primary > starts with 80, you can't distinguish the sequence 21 | 80 from the > one weight 21 80. But with the low/high split scheme, start units have to have low values (e.g. 20, 21 & 22) and continuation units have high values (e.g. 80) just to stop this very problem. > > For most uses, in particular, those in DUCET, the trailing units > > must not be mistakable for variable primary collation elements. > You have to know which one is a trailing unit. I suppose you could do > it via ranges like in UTF-8, but that means you can use fewer byte > values per position and thus yields longer weights, and longer sort > keys. With allkeys-type definitions and no more tailoring than strengths and variable weight schemes (with untailorable variable weight ranges), the implementation doesn't need to know which are trailing units, unless it is checking well-formedness. Should it need to know, all it has to check for is zero level 3 weights. If the variableTop parametric tailoring parameter is effectively removed, then a very well-formed table would be such that all four possibilities for the set of variable primaries selectable by standard UCA parametric tailoring had a well-formed collection of variable weights. DUCET does this by ensuring that there are no large weights in the region of interest, and that keeps sorting implementations simple once one has split a string (and its characters!) into collating elements. The only size-related issue left is specifying how to mimic the odd behaviour of some ICU rules defining ordering. Perhaps that is not a UCA issue - the standard UCA parametric tailorings do not call up such definitions. Richard.

