Jonathan Kaye wrote:

The problem seems clear: the addion of numbers changes the sensitivity of
the sort. If there are no numbers then the characters O, Ô and Õ are
distinct and sorted in the order you gave. If you add a following number
they all merge and the sort is then based on sorting the following number.
To summarise: aO5 aÕ2 aÔ4 will give a bad sort. If the numbers are the same
as in aO5 aÕ5 aÔ5 then the sort is good BUT if anything follows the "5" in
the previous example then the sort is bad so aO5z aÕ5a aÔ5g is sorted based
on the FINAL character. The sort comes out as aÕ5a, aÔ5g, aO5z.

But in the Unicode Collation Algorithm diacritics are only used for secondary level sorting. See again http://unicode.org/reports/tr10 .

So the strings aÕ5a, aÔ5g, and aO5z in the Unicode Collation Algorithm are first sorted as if each word had no diacritics, which is why the final letters take precedence over the diacritics. Only when two forms are identical, save for diacritics, will diacritics be considered. That is why O, Ô, Õ, and sort properly when alone, because they are forms distinguished only by diacritics.

That's what people normally want in sorts, as you can see by looking at dictionaries for languages which contain numerous diacritics.

In French dictionaries enculé comes before enculer, which is what users expect, and that is what happens now in OpenOffice.org. The diacritics are ignored except when two forms are identical except for diacritics. In such cases the diacritics are counted, but only in respect to such forms. Whether digits occur in the forms doesn't matter one way or the other.

Your forms are being sorted properly, if you recognize that diacritics are a secondary element in the sort, only taken into account for forms that are identical save for diacritics.

Diacritics are logically considered as additions to the base letters to be added for collation purposes after the base letters. In effect aÕ5a, aÔ5g, and aO5z are being sorted as though the forms were aO5a_~__, aO5g_^__, and aO5z____ with "_" representing blanks. Letters take precedence in words over diacritics because that's what people normally want. And diacritics are only applied in collation after all the base letters in the form have been considered.

Note that casing is a third level criteria, and so you can set casing on and mostly get the same results, save that now, but only in forms otherwise identical, the first case difference within two forms otherwise identical will determine which form collates first.

The innate Unicode value of the symbol is only used in generating a possible forth level of collation, usually not employed. Modern sorting technology has moved past considering the character value of a character at all.

Jim Allan







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to