Leo Broukhis, Fri, 21 Dec 2012 08:57:11 -0800:
> On Fri, Dec 21, 2012 at 4:56 AM, Leif Halvard Silli wrote:
>> 
>> You say that the difference is primary in the beginning of a word but
>> elsewhere secondary. And yes, that orthographic dictionary that you
>> link to above, looks as you describe.
>> 
>> However, in reality, the difference is secondary - if that is the right
>> word - even as the first letter in a word. Wikipedia has the following
>> example: едок > ёж > ездит.[1] And, for instance the word ёлка could
>> also be written елка.
> 
>> [1] <http://en.wikipedia.org/wiki/Ё#Russian>
> 
> Wikipedia's example is sadly unsourced, unlike mine.

My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian 
Dictionary from 2003 agree that both list words on Ё and Е under the 
same category – namely, under the letter Е.  Also, the Russian 
wikipedia article on the letter Ё says as well that this is how sorting 
should happen. 
<http://ru.wikipedia.org/wiki/Ё#.D0.A1.D0.BE.D1.80.D1.82.D0.B8.D1.80.D0.BE.D0.B2.D0.BA.D0.B0>
 
And the article list xindy as one applications that handles this. 
<http://en.wikipedia.org/wiki/Xindy>

>> Hence I would argue that the dictionary you linked to above considers
>> the difference to *always* be secondary. It is just that the dictionary
>> applies the sorting algorithm to a collection where the words that
>> begins with the letter Ё has been separated from words that begins on
>> the letter Е.
> 
> Isn't that notionally the same as having the difference primary for
> the first letter?

Input from a coalition expert would be welcome. However, this is how I 
think: 

Should one expect such an algorithm to write the phone book on one’s 
behalf? Or that it writes the dictionary? I think that would be an 
unrealistic expectation. E.g. a dictionary or phone book has precise 
rules for how the words as written and grouped before they are sorted.

Fact is, again, that ёлка - "in the wild" - can be written ёлка and 
елка. So if you assume that the algorithm should only deal with ёлка, 
then you are also saying that you want the algorithm to deal with words 
that have been prepared for sorting. Thus you are talking about a well 
prepared text were ёлка is always written ёлка and not елка.

While not a definitive "proof", I may also mention that the CSS list 
module defines an enumeration style based on the Russian alphabet, in 
which the ё is excluded.

http://www.w3.org/TR/css3-lists/#lower-russian

>>> A cursory scan of the UCA doesn't reveal if that's implementable, and
>>> experiments in a fairly fresh Linux Mint yield either
>>> ель < ёлка < тель < тёлка or ель < тель < тёлка < ёлка depending on
>>> the LANG setting (en_US works better than ru_RU).
>> 
>> (Both examples consider the difference primary, but the the last
>> example is incorrect as the ёлка follows after the тёлка - which is
>> incorrect from every angle (except from the angle of the number of the
>> letter inside Unicode.)
> 
> Right. And, ironically, the [en] collation is the correct one.

Perhaps this bug is because the Russian localizers failed to get it the 
way they wanted: Full alignment of Е and Ё? ;-) 
 
>>> Could someone tell if the UCA in its current form is able to support that?
>> 
>> Is there not a need for 3 kinds of sorting? Namely: a) Е/Ё as always
>> distinct letters, b) Е/Ё as always non-distinct letters, c) Е/Ё as
>> non-distinct letters except when used as the first letter. (Note that
>> the last variant would only be yield correct result on collections of
>> words where a first-letter Ё is guaranteed be rendered with a Ё. Thus,
>> if ёлка is written елка, then the result becomes incorrect.)
> 
> We're not talking here about *words per se* that may or may not be
> rendered with a Ё, we're talking about letter sequences with Ё as a
> given. The dictionary order shows that all word-initial Ёs go after
> all word-initial Еs, but within a word the difference is secondary.
> For a set of letter sequences using canonical spelling of words, the
> collation algorithm should give their dictionary ordering, shouldn't
> it?

I believe the English Wikipedia article is pretty "canonical" when it 
says that it can be done both ways - see the sources I pointed to above 
for examples of sorting where the status as first letter doesn't matter.

I don't know why the dictionary you pointed two 
<http://ru.wikisource.org/wiki/Орфографический_словарь_русского_языка> 
has separated the words. It could be a technical limitation of 
MediaWiki. Or it could be because those who initiated the project felt 
it made the most sense. (It does make a lot of sense to me  … he, he.)  
But that dictionary is also "peculiar" in that it lists words that 
begins on the letter "Ы". :-) It is typical to say that no words begins 
on the letter Ы. :-) But the list managed to find some … (Including one 
word that simply means "to say ы".) Neither of the dictionaries I 
mentioned above have any words under the letter Ы. Even in the above 
mentioned CSS list module’s definition, the ы is excluded.

> Re the linguistic PS: you're right, and that proves that an
> approximation to the proper collation using secondary ordering is
> preferred to an approximation using primary ordering.

Probably.
-- 
leif halvard silli


Reply via email to