The UCA algorithm itself has no "opinion" on this issue. It is simply a 
specification of *how* to compare strings at multiple levels, given a 
multi-level collation weight table.

The UCA *does* have a default behavior, of course, based on the DUCET table. 
And the DUCET table puts all Unicode characters in *some* order, so there is a 
default answer for Russian Ye and Yo, as there is for everything else. The 
current default answer for UCA 6.2 (abbreviating the unnecessary 4th level 
weights) is:

0435 ; [.19D9.0020.0002] # CYRILLIC SMALL LETTER IE
0450 ; [.19D9.0020.0002][.0000.0035.0002] # CYRILLIC SMALL LETTER IE WITH GRAVE
0451 ; [.19D9.0020.0002][.0000.0047.0002] # CYRILLIC SMALL LETTER IO

So by default, DUCET weights Ye with grave as a secondary difference from Ye, 
and also weights Yo as a secondary difference from Ye. (The secondary weights 
can be seen in the second collation elements for those letters, the 0035 and 
0047 weights, respectively.)

Those weights would be applied to *all* instances of Ye and Yo in a string, 
because there is no concept in the algorithm of conditional weighting in 
particular positions in a word.

But it is important to note also that those weights are just defaults, and the 
concept here is that they are set up to be defaults for the Cyrillic script as 
a whole, and not as defaults for Russian language data in particular. The 
defaults were chosen so that any particular language written with the Cyrillic 
script (including Russian) doesn't get *too* screwed up if strings in it are 
sorted using the default table, but the default is not intended to be fully 
correct for *any* particular language, including Russian. Instead, that is what 
tailoring (using LDML or some other mechanism) is aimed at.

So I would say that UCA per se is not meant to "solve the issue" of how to 
collate Russian Ye and Yo. It is meant to provide a mechanism for tailoring 
weights for characters to provide appropriate collation orders for particular 
languages.

However, in some cases, where languages require collation rules that depend on 
boundary conditions, the algorithm by itself cannot handle those. But 
appropriate markup of text to *indicate* boundaries explicitly, and then to 
tailor the weights of the characters used for that markup, can result in 
strings which then *could* be compared using UCA to provide the expected 
results. That kind of markup  could be done by a preprocessing step, which 
could, for example, process for word or syllabic boundaries (according to 
particular language and orthographic rules) and then pass the marked-up text to 
the string comparison step.

But in any case, it isn't the job of UCA to arbitrate what the correct or 
expected result for comparison in a particular language is.

--Ken


> A basic question: does the UCA algorithm consider the Russian Ye and the
> Russian Yo as equal with regard to sort order? Or is it not meant to solve
> that issue?
> 
> Leif Halvard Silli



Reply via email to