On Fri, 15 Mar 2013 21:12:48 -0700, Markus Scherer wrote: > On Fri, Mar 15, 2013 at 6:52 PM, Richard Wordingham wrote: (Well, actually the send button was pressed at 01.52 GMT on Saturday.)
> > > The point is that no sequence of > > > units (8-bit, 16-bit or whatever the implementation uses) can be > > > an exact prefix of another sequence. > > That's only for efficiency. > No, it's critical for correctness. > > One could allocate low unit values to the > > start units and high unit values to continuation units. (Paragraph split in this post, for greatly improved clarity.) Please give an example of how the low/high split would fail. With the primary collation weights 20, 21, 21 80 and 22 I get the following primary collation weight sequences for one and two collating elements, marking boundaries of collating elements with commas: 20 20, 20 20, 21 20, 21 80 20, 22 21 21, 20 21, 21 21, 21 80 21, 22 21 80 21 80, 20 21 80, 21 21 80, 21 80 21 80, 22 22 22, 20 22, 21 22, 21 80 22, 22 They seem to be in perfect order to me. > > By using > > high values for continuation units, DUCET simplifies the > > identification' > One could pick nearly any range for the trailing units. With the UCA > spec using 16-bit units and only 21 bits to encode in a pair, there > is nearly free choice for the range of trail units. For most uses, in particular, those in DUCET, the trailing units must not be mistakable for variable primary collation elements. Before positive non-variable primary weights less than variable primary weights were allowed, it was very easy to check for such a problem as one read in an allkeys-style UCET. (It's still very easy if the first positive weight is variable, as in allkeys.txt itself.) Richard.

