RE: UCA and Russian letter Ё

Whistler, Ken Wed, 26 Dec 2012 11:25:46 -0800

Leo asked:

> My question was narrower: assuming that the strings being compared are
> words, could it be supported without any markup?


... where "it" refers to conditional weighting based on the (identified) word 
boundary. And the answer to that is no, unless the word boundary was explicitly 
indicated with some kind of a markup character, and then the sequence of that 
markup character plus the target character of interest (in this case Russian 
Yo) was given a tailored contraction in the weight table which weighted it 
differently from any Russian Yo not in that particular contraction sequence.

> (NB that the "backward accents" feature is also, strictly speaking, 
> word-based.)

A correction here. The backwards accents feature in UCA is *not* word-based. As 
for any other string being compared via the UCA mechanism, weights are simply 
assigned to *all* characters in the string. The difference for weighting when 
using the backwards accents feature is that secondary weight significance in 
comparison is calculated from the end of the string, instead of the start of 
the string. This works when comparing single words, but it is applied 
indifferently to entire strings. And it gets the correct results, by the way. 
Work it out: you take two strings containing entire phrases in French, which 
only differ by accents on some word in the middle of the string. The only 
difference in weights assigned will be for the secondary weights for those 
accents, and if you use the backwards accents feature they will be calculated 
from the end of the string.

Once again, let me emphasize: the UCA algorithm per se simply has no concept at 
all of word boundaries. It applies strictly and only to string input, which 
could contain *anything*.

> In other words, after adoption, LDML became prescriptive in the sense
> "don't even think of inventing any sorting rules that cannot be
> described by LDML as it stands; we're not going to augment it". The
> Quebecois were very lucky, then.

No, I think that is an incorrect characterization of the situation for LDML. It 
 can be, and at times has been, augmented for new parameterizations which make 
sense. Those changes, however, have to make sense within the overall context of 
the way the multilevel weighting and string comparison algorithm works. The 
basic issue here is that because UCA is a string weighting and comparison 
algorithm, but does *not* have built in any kind of text segmentation logic 
(whether to identify words, syllables, or any other language-specific segment), 
it simply does not make sense to expect LDML to be augmented to describe 
collation behavior that depends on conditional behavior at segmentation 
boundaries. That is simply outside the scope of UCA and LDML. It isn't outside 
the scope of the bigger issue of sorting and collation behavior in general, of 
course -- it is just outside the scope of what UCA addresses.

Incidentally, for the record, backwards weighting of accents for French doesn't 
have anything particular to do with Quebecois. It is a feature of *some* 
influential French dictionary lexicographical ordering traditions -- in France 
-- and not in others.

--Ken

RE: UCA and Russian letter Ё

Reply via email to