[users] Re: an arcane property of OO sorting

Jim Allan Thu, 21 Jun 2007 14:53:29 -0700

Jonathan Kaye wrote:

The problem seems clear: the addion of numbers changes the sensitivity of
the sort. If there are no numbers then the characters O, Ô and Õ are
distinct and sorted in the order you gave. If you add a following number
they all merge and the sort is then based on sorting the following number.
To summarise: aO5 aÕ2 aÔ4 will give a bad sort. If the numbers are the same
as in aO5 aÕ5 aÔ5 then the sort is good BUT if anything follows the "5" in
the previous example then the sort is bad so aO5z aÕ5a aÔ5g is sorted based
on the FINAL character. The sort comes out as aÕ5a, aÔ5g, aO5z.

But in the Unicode Collation Algorithm diacritics are only used forsecondary level sorting. See again http://unicode.org/reports/tr10 .

So the strings aÕ5a, aÔ5g, and aO5z in the Unicode Collation Algorithmare first sorted as if each word had no diacritics, which is why thefinal letters take precedence over the diacritics. Only when two formsare identical, save for diacritics, will diacritics be considered. Thatis why O, Ô, Õ, and sort properly when alone, because they are formsdistinguished only by diacritics.

That's what people normally want in sorts, as you can see by looking atdictionaries for languages which contain numerous diacritics.

In French dictionaries enculé comes before enculer, which is what usersexpect, and that is what happens now in OpenOffice.org. The diacriticsare ignored except when two forms are identical except for diacritics.In such cases the diacritics are counted, but only in respect to suchforms. Whether digits occur in the forms doesn't matter one way or theother.

Your forms are being sorted properly, if you recognize that diacriticsare a secondary element in the sort, only taken into account for formsthat are identical save for diacritics.

Diacritics are logically considered as additions to the base letters tobe added for collation purposes after the base letters. In effect aÕ5a,aÔ5g, and aO5z are being sorted as though the forms were aO5a_~__,aO5g_^__, and aO5z____ with "_" representing blanks. Letters takeprecedence in words over diacritics because that's what people normallywant. And diacritics are only applied in collation after all the baseletters in the form have been considered.

Note that casing is a third level criteria, and so you can set casing onand mostly get the same results, save that now, but only in formsotherwise identical, the first case difference within two formsotherwise identical will determine which form collates first.

The innate Unicode value of the symbol is only used in generating apossible forth level of collation, usually not employed. Modern sortingtechnology has moved past considering the character value of a characterat all.


Jim Allan







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[users] Re: an arcane property of OO sorting

Reply via email to