RE: UCA and Russian letter Ё

2012-12-26 Thread Whistler, Ken
The UCA algorithm itself has no opinion on this issue. It is simply a 
specification of *how* to compare strings at multiple levels, given a 
multi-level collation weight table.

The UCA *does* have a default behavior, of course, based on the DUCET table. 
And the DUCET table puts all Unicode characters in *some* order, so there is a 
default answer for Russian Ye and Yo, as there is for everything else. The 
current default answer for UCA 6.2 (abbreviating the unnecessary 4th level 
weights) is:

0435 ; [.19D9.0020.0002] # CYRILLIC SMALL LETTER IE
0450 ; [.19D9.0020.0002][..0035.0002] # CYRILLIC SMALL LETTER IE WITH GRAVE
0451 ; [.19D9.0020.0002][..0047.0002] # CYRILLIC SMALL LETTER IO

So by default, DUCET weights Ye with grave as a secondary difference from Ye, 
and also weights Yo as a secondary difference from Ye. (The secondary weights 
can be seen in the second collation elements for those letters, the 0035 and 
0047 weights, respectively.)

Those weights would be applied to *all* instances of Ye and Yo in a string, 
because there is no concept in the algorithm of conditional weighting in 
particular positions in a word.

But it is important to note also that those weights are just defaults, and the 
concept here is that they are set up to be defaults for the Cyrillic script as 
a whole, and not as defaults for Russian language data in particular. The 
defaults were chosen so that any particular language written with the Cyrillic 
script (including Russian) doesn't get *too* screwed up if strings in it are 
sorted using the default table, but the default is not intended to be fully 
correct for *any* particular language, including Russian. Instead, that is what 
tailoring (using LDML or some other mechanism) is aimed at.

So I would say that UCA per se is not meant to solve the issue of how to 
collate Russian Ye and Yo. It is meant to provide a mechanism for tailoring 
weights for characters to provide appropriate collation orders for particular 
languages.

However, in some cases, where languages require collation rules that depend on 
boundary conditions, the algorithm by itself cannot handle those. But 
appropriate markup of text to *indicate* boundaries explicitly, and then to 
tailor the weights of the characters used for that markup, can result in 
strings which then *could* be compared using UCA to provide the expected 
results. That kind of markup  could be done by a preprocessing step, which 
could, for example, process for word or syllabic boundaries (according to 
particular language and orthographic rules) and then pass the marked-up text to 
the string comparison step.

But in any case, it isn't the job of UCA to arbitrate what the correct or 
expected result for comparison in a particular language is.

--Ken


 A basic question: does the UCA algorithm consider the Russian Ye and the
 Russian Yo as equal with regard to sort order? Or is it not meant to solve
 that issue?
 
 Leif Halvard Silli





RE: UCA and Russian letter Ё

2012-12-26 Thread Whistler, Ken
Leo asked:

 My question was narrower: assuming that the strings being compared are
 words, could it be supported without any markup?

... where it refers to conditional weighting based on the (identified) word 
boundary. And the answer to that is no, unless the word boundary was explicitly 
indicated with some kind of a markup character, and then the sequence of that 
markup character plus the target character of interest (in this case Russian 
Yo) was given a tailored contraction in the weight table which weighted it 
differently from any Russian Yo not in that particular contraction sequence.

 (NB that the backward accents feature is also, strictly speaking, 
 word-based.)

A correction here. The backwards accents feature in UCA is *not* word-based. As 
for any other string being compared via the UCA mechanism, weights are simply 
assigned to *all* characters in the string. The difference for weighting when 
using the backwards accents feature is that secondary weight significance in 
comparison is calculated from the end of the string, instead of the start of 
the string. This works when comparing single words, but it is applied 
indifferently to entire strings. And it gets the correct results, by the way. 
Work it out: you take two strings containing entire phrases in French, which 
only differ by accents on some word in the middle of the string. The only 
difference in weights assigned will be for the secondary weights for those 
accents, and if you use the backwards accents feature they will be calculated 
from the end of the string.

Once again, let me emphasize: the UCA algorithm per se simply has no concept at 
all of word boundaries. It applies strictly and only to string input, which 
could contain *anything*.

 In other words, after adoption, LDML became prescriptive in the sense
 don't even think of inventing any sorting rules that cannot be
 described by LDML as it stands; we're not going to augment it. The
 Quebecois were very lucky, then.

No, I think that is an incorrect characterization of the situation for LDML. It 
 can be, and at times has been, augmented for new parameterizations which make 
sense. Those changes, however, have to make sense within the overall context of 
the way the multilevel weighting and string comparison algorithm works. The 
basic issue here is that because UCA is a string weighting and comparison 
algorithm, but does *not* have built in any kind of text segmentation logic 
(whether to identify words, syllables, or any other language-specific segment), 
it simply does not make sense to expect LDML to be augmented to describe 
collation behavior that depends on conditional behavior at segmentation 
boundaries. That is simply outside the scope of UCA and LDML. It isn't outside 
the scope of the bigger issue of sorting and collation behavior in general, of 
course -- it is just outside the scope of what UCA addresses.

Incidentally, for the record, backwards weighting of accents for French doesn't 
have anything particular to do with Quebecois. It is a feature of *some* 
influential French dictionary lexicographical ordering traditions -- in France 
-- and not in others.

--Ken