Re: Phoenician

Philippe Verdy Fri, 07 May 2004 20:29:10 -0700

From: "Philippe Verdy" <[EMAIL PROTECTED]>
> To make things simpler, introduce a special collation key value which is lower
> than all others, (0 in the example above), and you get a simpler view of
> collation elements as a single vector of numeric value, if you use it as a
> terminator between each level for the resulting collation string:
> "aa" => (1, 1, 0, 10, 10, 0)
> "AB" => (1, 2, 0, 11, 11, 0)
> "Ab" => (1, 2, 0, 11, 10, 0)
> "Aba" => (1, 2, 2, 0, 11, 10, 10, 0)
> This simplifies things to get binary comparable vectors of numeric values. The
> length of the vector depends on the length (in characters or collation
elements)
> of input strings, and on the number of levels considered.


Note that DUCET uses another solution: no separator is used, but instead all
primary weights are made higher than all secondary weights. If you read the
Unicode collation algorithm, you'll see that the value 0 is used to mean
"ignorable at that level", so that it can be suppressed for the collation keys
generated from input strings.

I should have better used the values already assigned in the current version of
DUCET... Note however that these values are arbitrary, only their relative order
is important.

Also this algorithm is a bit more complex, because it allows tailoring the order
with which weights in a given level are generated in the output collation key
(they can be output in forward or backward order, notably for French ordering at
level 2 for accents), and because it takes into account not only characters but
also groups of Unicode characters as single units for collation (for example
digraphs used in many languages which sort them as if they were one letter, such
as in Spanish, or the recently discussed "gb" in Yoruba).

Re: Phoenician

Reply via email to