[users] Re: an arcane property of OO sorting

Jonathan Kaye Thu, 21 Jun 2007 23:02:16 -0700

Jim Allan wrote:

<snip>
> But in the Unicode Collation Algorithm diacritics are only used for
> secondary level sorting. See again http://unicode.org/reports/tr10 .
> 
> So the strings aÕ5a, aÔ5g, and aO5z in the Unicode Collation Algorithm
> are first sorted as if each word had no diacritics, which is why the
> final letters take precedence over the diacritics. Only when two forms
> are identical, save for diacritics, will diacritics be considered. That
> is why O, Ô, Õ, and sort properly when alone, because they are forms
> distinguished only by diacritics.
> 
> That's what people normally want in sorts, as you can see by looking at
> dictionaries for languages which contain numerous diacritics.
> 
> In French dictionaries enculé comes before enculer, which is what users
> expect, and that is what happens now in OpenOffice.org. The diacritics
> are ignored except when two forms are identical except for diacritics.
> In such cases the diacritics are counted, but only in respect to such
> forms. Whether digits occur in the forms doesn't matter one way or the
> other.
> 
> Your forms are being sorted properly, if you recognize that diacritics
> are a secondary element in the sort, only taken into account for forms
> that are identical save for diacritics.
> 
> Diacritics are logically considered as additions to the base letters to
> be added for collation purposes after the base letters. In effect aÕ5a,
> aÔ5g, and aO5z are being sorted as though the forms were aO5a_~__,
> aO5g_^__, and aO5z____ with "_" representing blanks. Letters take
> precedence in words over diacritics because that's what people normally
> want. And diacritics are only applied in collation after all the base
> letters in the form have been considered.
> 
> Note that casing is a third level criteria, and so you can set casing on
> and mostly get the same results, save that now, but only in forms
> otherwise identical, the first case difference within two forms
> otherwise identical will determine which form collates first.
> 
> The innate Unicode value of the symbol is only used in generating a
> possible forth level of collation, usually not employed. Modern sorting
> technology has moved past considering the character value of a character
>   at all.
> 
> Jim Allan
Ok Jim,
Sorry for being thick. I suspected this might be a "feature" rather than a
bug. As I'm working on Namibian languages (Nama and Khoekhoegowab,
specifically) this is going to be a problem. "Phonemes" in these languages
are often expressed by digraphs or even trigraphs. They need to be encoded
into a special sort field which respects their identity. For example "kh"
is not the same as k+h (just like English sh is not the same as s+h) and
the Namibians want it to occupy a special place in the collating sequence.
I have encode all such cases as a single character and quickly run out of
normal characters. Way back in the days when this project was started I
used qsort to do the sorting. It worked on the 256 (- the reserved codes)
ascii codes and gave me a sort in strict numeric order regardless of
the "semantics" of the code. So ö ascii f6 was just a number, f6, and bore
no special relation to "o" which is 6f. When you consider that Namibian
languages have tones (up to 4 level plus more contour ones) and these come
out as accents in the final written form, you can start to appreciate the
scope of the problem. What's more, tones are not taken into consideration
for sorting unless they are the sole means of distinguishing two otherwise
identical forms. My coding strips off the tones (represented by numbers and
normally following the vowel they sit on) and puts them at the end of the
recoded string used for sorting.
So you are quite correct in saying that normally accented characters should
be treated this way but in my case this is a disaster. My question would
then be, is there a way of turning off this feature and having my codes
purely in terms of their ascii values.
Thanks for your patience and sorry about being thick in seeing what you are
talking about. If you have any suggestions maybe we continue this thread
offlist between the two of us as it's getting rather technical and probably
doesn't interest the typical OO user.
Cheers,
Jonathan
-- 
Registerd Linux user #445917 at http://counter.li.org/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[users] Re: an arcane property of OO sorting

Reply via email to