Re: UCA and Russian letter Ё
On Wed, Dec 26, 2012 at 11:18 AM, Whistler, Ken ken.whist...@sap.com wrote: Leo asked: My question was narrower: assuming that the strings being compared are words, could it be supported without any markup? ... where it refers to conditional weighting based on the (identified) word boundary. And the answer to that is no, unless the word boundary was explicitly indicated with some kind of a markup character, and then the sequence of that markup character plus the target character of interest (in this case Russian Yo) was given a tailored contraction in the weight table which weighted it differently from any Russian Yo not in that particular contraction sequence. I see your point: if something can be trivially emulated with a markup character, there is no need to augment the algorithm (emulating backward accents with markup is possible but much more cumbersome). What characters should be used for such markup, if need be? Thanks, Leo
RE: UCA and Russian letter Ё
The UCA algorithm itself has no opinion on this issue. It is simply a specification of *how* to compare strings at multiple levels, given a multi-level collation weight table. The UCA *does* have a default behavior, of course, based on the DUCET table. And the DUCET table puts all Unicode characters in *some* order, so there is a default answer for Russian Ye and Yo, as there is for everything else. The current default answer for UCA 6.2 (abbreviating the unnecessary 4th level weights) is: 0435 ; [.19D9.0020.0002] # CYRILLIC SMALL LETTER IE 0450 ; [.19D9.0020.0002][..0035.0002] # CYRILLIC SMALL LETTER IE WITH GRAVE 0451 ; [.19D9.0020.0002][..0047.0002] # CYRILLIC SMALL LETTER IO So by default, DUCET weights Ye with grave as a secondary difference from Ye, and also weights Yo as a secondary difference from Ye. (The secondary weights can be seen in the second collation elements for those letters, the 0035 and 0047 weights, respectively.) Those weights would be applied to *all* instances of Ye and Yo in a string, because there is no concept in the algorithm of conditional weighting in particular positions in a word. But it is important to note also that those weights are just defaults, and the concept here is that they are set up to be defaults for the Cyrillic script as a whole, and not as defaults for Russian language data in particular. The defaults were chosen so that any particular language written with the Cyrillic script (including Russian) doesn't get *too* screwed up if strings in it are sorted using the default table, but the default is not intended to be fully correct for *any* particular language, including Russian. Instead, that is what tailoring (using LDML or some other mechanism) is aimed at. So I would say that UCA per se is not meant to solve the issue of how to collate Russian Ye and Yo. It is meant to provide a mechanism for tailoring weights for characters to provide appropriate collation orders for particular languages. However, in some cases, where languages require collation rules that depend on boundary conditions, the algorithm by itself cannot handle those. But appropriate markup of text to *indicate* boundaries explicitly, and then to tailor the weights of the characters used for that markup, can result in strings which then *could* be compared using UCA to provide the expected results. That kind of markup could be done by a preprocessing step, which could, for example, process for word or syllabic boundaries (according to particular language and orthographic rules) and then pass the marked-up text to the string comparison step. But in any case, it isn't the job of UCA to arbitrate what the correct or expected result for comparison in a particular language is. --Ken A basic question: does the UCA algorithm consider the Russian Ye and the Russian Yo as equal with regard to sort order? Or is it not meant to solve that issue? Leif Halvard Silli
RE: UCA and Russian letter Ё
Leo asked: My question was narrower: assuming that the strings being compared are words, could it be supported without any markup? ... where it refers to conditional weighting based on the (identified) word boundary. And the answer to that is no, unless the word boundary was explicitly indicated with some kind of a markup character, and then the sequence of that markup character plus the target character of interest (in this case Russian Yo) was given a tailored contraction in the weight table which weighted it differently from any Russian Yo not in that particular contraction sequence. (NB that the backward accents feature is also, strictly speaking, word-based.) A correction here. The backwards accents feature in UCA is *not* word-based. As for any other string being compared via the UCA mechanism, weights are simply assigned to *all* characters in the string. The difference for weighting when using the backwards accents feature is that secondary weight significance in comparison is calculated from the end of the string, instead of the start of the string. This works when comparing single words, but it is applied indifferently to entire strings. And it gets the correct results, by the way. Work it out: you take two strings containing entire phrases in French, which only differ by accents on some word in the middle of the string. The only difference in weights assigned will be for the secondary weights for those accents, and if you use the backwards accents feature they will be calculated from the end of the string. Once again, let me emphasize: the UCA algorithm per se simply has no concept at all of word boundaries. It applies strictly and only to string input, which could contain *anything*. In other words, after adoption, LDML became prescriptive in the sense don't even think of inventing any sorting rules that cannot be described by LDML as it stands; we're not going to augment it. The Quebecois were very lucky, then. No, I think that is an incorrect characterization of the situation for LDML. It can be, and at times has been, augmented for new parameterizations which make sense. Those changes, however, have to make sense within the overall context of the way the multilevel weighting and string comparison algorithm works. The basic issue here is that because UCA is a string weighting and comparison algorithm, but does *not* have built in any kind of text segmentation logic (whether to identify words, syllables, or any other language-specific segment), it simply does not make sense to expect LDML to be augmented to describe collation behavior that depends on conditional behavior at segmentation boundaries. That is simply outside the scope of UCA and LDML. It isn't outside the scope of the bigger issue of sorting and collation behavior in general, of course -- it is just outside the scope of what UCA addresses. Incidentally, for the record, backwards weighting of accents for French doesn't have anything particular to do with Quebecois. It is a feature of *some* influential French dictionary lexicographical ordering traditions -- in France -- and not in others. --Ken
Re: UCA and Russian letter Ё
Hello, Leo Broukhis hatte geschrieben: In Russian, the difference between Е and Ё is primary at the beginning of a word as they are considered distinct letters of the alphabet, yet secondary in the middle of a word, as the dieresis over Ё is not mandatory. As an example, ель ёлка, but тёлка тель, see http://ru.wikisource.org/wiki/Орфографический_словарь_русского_языка Am 2012-12-21 um 20:05 Uhr schrieb Leif Halvard Silli: My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian Dictionary from 2003 agree that both list words on Ё and Е under the same category – namely, under the letter Е. So do both “Русско-Немецкий Словар” (Moskow,1955) and “Langenscheidts Taschenwörterbuch”, 4. Aufl. (Berlin 1963). Hence, I deem Leo’s example a red herring. Best wishes for a merry Xmas (or whatever) and a happy New Year, Otto
Re: UCA and Russian letter Ё
Ken, A basic question: does the UCA algorithm consider the Russian Ye and the Russian Yo as equal with regard to sort order? Or is it not meant to solve that issue? Leif Halvard Silli --- Opprinnelig melding --- Fra: Whistler, Ken ken.whist...@sap.com Til: l...@mailcom.com, jkorp...@cs.tut.fi Cc: unicode@unicode.org, ken.whist...@sap.com Sendt: 21/12/'12, 22:49 Leo Broukhis said: Granted, not yet, but by itself the argument is invalid. Unicode collation rules are descriptive; I'm not sure what you mean by that. UTS #10 is a *specification* of an algorithm, with various options for tailoring and parameterization which make it possible to accommodate various needs for particular cases. It is not intended as a descriptive mechanism. Perhaps you are referring to LDML, which includes a formal mechanism for describing a particular collation in terms of the default table and tailoring options and parameterization options of the UCA. if, for example, a language happens to sort accents backwards, this rule has to be - and is - accommodated despite its apparent illogicality; Backwards accent secondary weighting was actually included primarily because of prior art in collation standards, because of the need to be able to synchronize the UCA algorithm with ISO 14651, and because it makes it easier to explain how folks can implement versions of multi-level collation which can pass the conformance tests of the Canadian sorting standard, etc. along the same lines, if a language happens to make a distinction discussed in this thread, it has to be accommodated just as well. No, I don't think so. It is rather easy to come up with distinctions or collation requirements which simply cannot be accommodated within the intended bounds of the UCA. For example, sorting all numerical expressions mixed with text strictly by their numeric values, or sorting all (or some specified list) of abbreviations as if they were spelled out, and so forth. Many lexicographical ordering rules cannot be fully accommodated within the context of the UCA algorithm, which is a multilevel *string comparison* specification, and not a dictionary ordering specification. My question is as follows: does UCA have to be modified (e.g. by adding another bit flag word-initial primary next to the existing backward secondary) to support the feature if it were to be implemented, or is there a way to achieve the new Russian online collation within the existing UCA without modifying the strings to be sorted before the application of the algorithm? I don't think there is any out-of-the-box way to use UCA so that an implementation would automatically recognize a word boundary context and weight characters conditionally based on that context. So no, I don't think you could get an implementation to do that without first marking up text with additional characters to indicate word boundaries and then tailoring the weight table to weight sequences including that markup accordingly. This is actually derived trivially from the fact that UCA knows nothing whatsoever about word boundaries. At core, it is just a mechanism to take a string input and provide an output vector of collation weights. You would have to have to hook it up to a text segmentation algorithm to even identify words, and then that text segmentation algorithm would itself have to be tailored and tuned to whatever language you had in mind, because the criteria for identifying words will vary from language to language, and even orthography to orthography. But there is another possible sense of the question, does UCA have to be modified... to support..., i.e. is the UTC somehow required to augment the algorithm to support some particular kind of behavior for a particular language's sorting rules, just because someone has turned up particular odd behavior. And I think the answer to that is clearly no. Oh, and by the way, I don't think LDML must (or should) be augmented to enable it to describe any and all lexicographical ordering practices, either. That isn't the function of LDML. --Ken
Re: UCA and Russian letter Ё
My opinion is that BOTH the UCA algorithm AND the LDML formal decription of collations are just Best known practices to accomodate the collation (i.e. dictionary ordering AND string searches AND string comparisons). But neither of them can accomodate all possible orders or weak comparisons systems for all languages. Notably, it cannot accomodate directly the contextual mutation of initial or medial letters in words. But true linguistic dictionaries have ordered their entries by grouping together in a single entry or in successive entries all variations of a word, including these initial mutations, or derivations (like conjugated verbs, grammatical declinations, plurals, or genders.) The UCA still works provided that there's a prior preprocessing that allows infering (or using some lookup for exceptions) another form of words (or of numbers) for which the multilevel algorithm needed to generate weights can then work. The UCA admits it, but LDML cannot describe these proprocessing rules with just rules used to assign weights to some groups of characters, independantly of what words could mean in the intended language. So the LDML collation rules cannot be sufficient for all cases, and there must be also another formal language for describing the preprocessing rules. For now this does not exist, but nothing prevent this to appear in some future as additional data, and LDML could be extended for decribing these preprocessing transforms. But some steps will remain : the initial normalization, the place where letter case is assigned (or not) some higher collation weight. the place where preprocessing can perform lookups using this simplified view based on the first steps which perform more than just a standard Unicode normalization, and then the last steps that are outputing the weights level per level. Collation is a very complex concept and it is not fully standardized for interchange in LDML. The most notable parts being the preprocessing steps. But note that even various dictionaries for the same language will vary in how they perform these preprocessings (for exemple a dictionnary may include and sort separately the derived terms, using a simpler rule not requiring this preprocessing, so it will have MORE entries, even if they are linking their actual definition to another entry ; and most dictionnaries do not include separate entries for derivations like regular conjugations or plurals or declinations, unless they are VERY irregular ; and most dictionnaries for languages that include the standard mutation rules for initial letters will NOT list separate entries for these mutations and users for example know that if they cannot find a word starting by these mutable letters, they will look for words starting by unmutated words). The same is true for languages that use agglutination : it is not possible to list all possible agglutinations and users need to know how to recognize the morphemes. Here again another thing is not described in collation rules : the breaking rules that allow the separation of words or morphemes. But collation may not work correctly without it, when agglutination implies also mutations between two agglutinated morphemes. This is clearly a limitation of LDML, but not of UCA itself. 2012/12/23 Leo Broukhis l...@mailcom.com On Fri, Dec 21, 2012 at 1:49 PM, Whistler, Ken ken.whist...@sap.com wrote: Leo Broukhis said: Granted, not yet, but by itself the argument is invalid. Unicode collation rules are descriptive; I'm not sure what you mean by that. UTS #10 is a *specification* of an algorithm, with various options for tailoring and parameterization which make it possible to accommodate various needs for particular cases. It is not intended as a descriptive mechanism. What I meant is that the way its feature set which makes it flexible enough had been decided was descriptive, or, if you will, adaptive, following the pre-existing collation traditions in various languages and/or pre-existing standards. Perhaps you are referring to LDML, which includes a formal mechanism for describing a particular collation in terms of the default table and tailoring options and parameterization options of the UCA. You're right, LDML may be a better application of the word descriptive. if, for example, a language happens to sort accents backwards, this rule has to be - and is - accommodated despite its apparent illogicality; Backwards accent secondary weighting was actually included primarily because of prior art in collation standards, because of the need to be able to synchronize the UCA algorithm with ISO 14651, and because it makes it easier to explain how folks can implement versions of multi-level collation which can pass the conformance tests of the Canadian sorting standard, etc. along the same lines, if a language happens to make a distinction discussed in this thread, it has to be accommodated just as well. No, I don't think so.
Re: UCA and Russian letter Ё
On Fri, Dec 21, 2012 at 1:49 PM, Whistler, Ken ken.whist...@sap.com wrote: Leo Broukhis said: Granted, not yet, but by itself the argument is invalid. Unicode collation rules are descriptive; I'm not sure what you mean by that. UTS #10 is a *specification* of an algorithm, with various options for tailoring and parameterization which make it possible to accommodate various needs for particular cases. It is not intended as a descriptive mechanism. What I meant is that the way its feature set which makes it flexible enough had been decided was descriptive, or, if you will, adaptive, following the pre-existing collation traditions in various languages and/or pre-existing standards. Perhaps you are referring to LDML, which includes a formal mechanism for describing a particular collation in terms of the default table and tailoring options and parameterization options of the UCA. You're right, LDML may be a better application of the word descriptive. if, for example, a language happens to sort accents backwards, this rule has to be - and is - accommodated despite its apparent illogicality; Backwards accent secondary weighting was actually included primarily because of prior art in collation standards, because of the need to be able to synchronize the UCA algorithm with ISO 14651, and because it makes it easier to explain how folks can implement versions of multi-level collation which can pass the conformance tests of the Canadian sorting standard, etc. along the same lines, if a language happens to make a distinction discussed in this thread, it has to be accommodated just as well. No, I don't think so. My question can be construed as a hypothetical: had the described Ё-collation been a prior art in collation standards by the time of development of LDML and the UCA, how different would they have been? I'm hoping for an answer not at all or very little, and here's how it could have been implemented: It is rather easy to come up with distinctions or collation requirements which simply cannot be accommodated within the intended bounds of the UCA. For example, sorting all numerical expressions mixed with text strictly by their numeric values, or sorting all (or some specified list) of abbreviations as if they were spelled out, and so forth. Many lexicographical ordering rules cannot be fully accommodated within the context of the UCA algorithm, which is a multilevel *string comparison* specification, and not a dictionary ordering specification. That is true in general if rules happen to involve semantics, but we're discussing a formal rule here. Imagine that the backward accents feature was missing from LDML, e.g. because it was an emerging trend rather than a standard way of collation at the time of formalizing LDML, thus not included in LDML. Would you have said the same about it today if someone had asked about supporting it ? If not, why not? My question is as follows: does UCA have to be modified (e.g. by adding another bit flag word-initial primary next to the existing backward secondary) to support the feature if it were to be implemented, or is there a way to achieve the new Russian online collation within the existing UCA without modifying the strings to be sorted before the application of the algorithm? I don't think there is any out-of-the-box way to use UCA so that an implementation would automatically recognize a word boundary context and weight characters conditionally based on that context. So no, I don't think you could get an implementation to do that without first marking up text with additional characters to indicate word boundaries and then tailoring the weight table to weight sequences including that markup accordingly. My question was narrower: assuming that the strings being compared are words, could it be supported without any markup? (NB that the backward accents feature is also, strictly speaking, word-based.) But there is another possible sense of the question, does UCA have to be modified... to support..., i.e. is the UTC somehow required to augment the algorithm to support some particular kind of behavior for a particular language's sorting rules, just because someone has turned up particular odd behavior. And I think the answer to that is clearly no. Oh, and by the way, I don't think LDML must (or should) be augmented to enable it to describe any and all lexicographical ordering practices, either. That isn't the function of LDML. In other words, after adoption, LDML became prescriptive in the sense don't even think of inventing any sorting rules that cannot be described by LDML as it stands; we're not going to augment it. The Quebecois were very lucky, then. Leo
UCA and Russian letter Ё
In Russian, the difference between Е and Ё is primary at the beginning of a word as they are considered distinct letters of the alphabet, yet secondary in the middle of a word, as the dieresis over Ё is not mandatory. As an example, ель ёлка, but тёлка тель, see http://ru.wikisource.org/wiki/Орфографический_словарь_русского_языка A cursory scan of the UCA doesn't reveal if that's implementable, and experiments in a fairly fresh Linux Mint yield either ель ёлка тель тёлка or ель тель тёлка ёлка depending on the LANG setting (en_US works better than ru_RU). Could someone tell if the UCA in its current form is able to support that? Thanks, Leo
Re: UCA and Russian letter Ё
Leo Broukhis, Fri, 21 Dec 2012 01:31:18 -0800: In Russian, the difference between Е and Ё is primary at the beginning of a word as they are considered distinct letters of the alphabet, yet secondary in the middle of a word, as the dieresis over Ё is not mandatory. As an example, ель ёлка, but тёлка тель, see http://ru.wikisource.org/wiki/Орфографический_словарь_русского_языка You say that the difference is primary in the beginning of a word but elsewhere secondary. And yes, that orthographic dictionary that you link to above, looks as you describe. However, in reality, the difference is secondary - if that is the right word - even as the first letter in a word. Wikipedia has the following example: едок ёж ездит.[1] And, for instance the word ёлка could also be written елка. Hence I would argue that the dictionary you linked to above considers the difference to *always* be secondary. It is just that the dictionary applies the sorting algorithm to a collection where the words that begins with the letter Ё has been separated from words that begins on the letter Е. A cursory scan of the UCA doesn't reveal if that's implementable, and experiments in a fairly fresh Linux Mint yield either ель ёлка тель тёлка or ель тель тёлка ёлка depending on the LANG setting (en_US works better than ru_RU). (Both examples consider the difference primary, but the the last example is incorrect as the ёлка follows after the тёлка - which is incorrect from every angle (except from the angle of the number of the letter inside Unicode.) Could someone tell if the UCA in its current form is able to support that? Is there not a need for 3 kinds of sorting? Namely: a) Е/Ё as always distinct letters, b) Е/Ё as always non-distinct letters, c) Е/Ё as non-distinct letters except when used as the first letter. (Note that the last variant would only be yield correct result on collections of words where a first-letter Ё is guaranteed be rendered with a Ё. Thus, if ёлка is written елка, then the result becomes incorrect.) Linguistic PS: From the angle of the color of the sound, then Russian Ё is the light version of Russian О. (Its predecessor was also a digraph - IO.) But from the angle of stress then, when the Ё looses its stress, it alternates with Russian Е (since Е can both be with and without stress, whereas Ё can only be with stress). The reason why Е/Ё is often considered a secondary difference, is (I think) related to the stress: But for in lexicons and dictionaries, then Russian texts typically do not mark where the stress of a word is. The stress is simply known by the reader/user. [1] http://en.wikipedia.org/wiki/Ё#Russian -- leif halvard silli
Re: UCA and Russian letter Ё
[Philippe tells me that his message that I'm quoting could have been rejected by the mailing list as spam; my answer is below.] On Fri, Dec 21, 2012 at 5:13 AM, Philippe Verdy verd...@wanadoo.fr wrote: This is an interesting case. A solution would be to be able define a distinct collation element for ^ë, where ^ means begining of a word (even if there's no character encoded there). That element would be such that : e ë ^ë But this requires a prior definition of word boundaries to recognize the ^ as an additional collation element by itself (usable distinctly only in context, and ignored when it occurs anywhere else, meaning that all weights assigned to ^ alone would be null.) So ^ë would become valid as a collation element, but т^ё makes no sense if there's no possible word boundary between т and ё. This would work with the UCA algorithm, which does not really mandate what is a collation element (not only in terms of encoding as characters), or any syntax to support it. This mechanism of incorporating word boundaries in UCA would be an interesting extension for section 6.9 (Handling Collation Graphemes) of UTS#10 (but for now there's no support for it in LDML with a defined syntax allowing the insertion of boundaries or other contextual conditions). Would it also mean that using a CGJ at the beginning of a word will cause a ё at the beginning of a word to be treated as a mid-word one? Is space, CGJ a well-formed character sequence? Leo
Re: UCA and Russian letter Ё
On Fri, Dec 21, 2012 at 4:56 AM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: You say that the difference is primary in the beginning of a word but elsewhere secondary. And yes, that orthographic dictionary that you link to above, looks as you describe. However, in reality, the difference is secondary - if that is the right word - even as the first letter in a word. Wikipedia has the following example: едок ёж ездит.[1] And, for instance the word ёлка could also be written елка. [1] http://en.wikipedia.org/wiki/Ё#Russian Wikipedia's example is sadly unsourced, unlike mine. Hence I would argue that the dictionary you linked to above considers the difference to *always* be secondary. It is just that the dictionary applies the sorting algorithm to a collection where the words that begins with the letter Ё has been separated from words that begins on the letter Е. Isn't that notionally the same as having the difference primary for the first letter? A cursory scan of the UCA doesn't reveal if that's implementable, and experiments in a fairly fresh Linux Mint yield either ель ёлка тель тёлка or ель тель тёлка ёлка depending on the LANG setting (en_US works better than ru_RU). (Both examples consider the difference primary, but the the last example is incorrect as the ёлка follows after the тёлка - which is incorrect from every angle (except from the angle of the number of the letter inside Unicode.) Right. And, ironically, the [en] collation is the correct one. Could someone tell if the UCA in its current form is able to support that? Is there not a need for 3 kinds of sorting? Namely: a) Е/Ё as always distinct letters, b) Е/Ё as always non-distinct letters, c) Е/Ё as non-distinct letters except when used as the first letter. (Note that the last variant would only be yield correct result on collections of words where a first-letter Ё is guaranteed be rendered with a Ё. Thus, if ёлка is written елка, then the result becomes incorrect.) We're not talking here about *words per se* that may or may not be rendered with a Ё, we're talking about letter sequences with Ё as a given. The dictionary order shows that all word-initial Ёs go after all word-initial Еs, but within a word the difference is secondary. For a set of letter sequences using canonical spelling of words, the collation algorithm should give their dictionary ordering, shouldn't it? Re the linguistic PS: you're right, and that proves that an approximation to the proper collation using secondary ordering is preferred to an approximation using primary ordering. Leo
Re: UCA and Russian letter Ё
Resending my earlier reply. Apparently, by default, Gmail sends subject lines in KOI8-R if they contain Cyrillic, and unicode.org rejects those as likely spam. I just changed my Gmail settings to Use Unicode (UTF-8) encoding for outgoing messages and hope this goes through. (*Please change the subject line* if you want to discuss *this* issue.) My earlier reply was: Theoretically, it is possible to select collation elements based on the proximity of word boundaries or other criteria. However, I don't know if there is an implementation that has that built in. ICU (one of the commonly used implementations of UCA+CLDR) does not. It sounds like the secondary difference is ok for sorting, but you are looking to customize an alphabetic index such that there is a separate bucket for words beginning with Ё. I think the best would be to do that with some custom code that looks for Ё as the first character, in addition to the regular bucketing and sorting. Best regards, markus -- Google Internationalization Engineering
Re: UCA and Russian letter Ё
Leo Broukhis, Fri, 21 Dec 2012 08:57:11 -0800: On Fri, Dec 21, 2012 at 4:56 AM, Leif Halvard Silli wrote: You say that the difference is primary in the beginning of a word but elsewhere secondary. And yes, that orthographic dictionary that you link to above, looks as you describe. However, in reality, the difference is secondary - if that is the right word - even as the first letter in a word. Wikipedia has the following example: едок ёж ездит.[1] And, for instance the word ёлка could also be written елка. [1] http://en.wikipedia.org/wiki/Ё#Russian Wikipedia's example is sadly unsourced, unlike mine. My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian Dictionary from 2003 agree that both list words on Ё and Е under the same category – namely, under the letter Е. Also, the Russian wikipedia article on the letter Ё says as well that this is how sorting should happen. http://ru.wikipedia.org/wiki/Ё#.D0.A1.D0.BE.D1.80.D1.82.D0.B8.D1.80.D0.BE.D0.B2.D0.BA.D0.B0 And the article list xindy as one applications that handles this. http://en.wikipedia.org/wiki/Xindy Hence I would argue that the dictionary you linked to above considers the difference to *always* be secondary. It is just that the dictionary applies the sorting algorithm to a collection where the words that begins with the letter Ё has been separated from words that begins on the letter Е. Isn't that notionally the same as having the difference primary for the first letter? Input from a coalition expert would be welcome. However, this is how I think: Should one expect such an algorithm to write the phone book on one’s behalf? Or that it writes the dictionary? I think that would be an unrealistic expectation. E.g. a dictionary or phone book has precise rules for how the words as written and grouped before they are sorted. Fact is, again, that ёлка - in the wild - can be written ёлка and елка. So if you assume that the algorithm should only deal with ёлка, then you are also saying that you want the algorithm to deal with words that have been prepared for sorting. Thus you are talking about a well prepared text were ёлка is always written ёлка and not елка. While not a definitive proof, I may also mention that the CSS list module defines an enumeration style based on the Russian alphabet, in which the ё is excluded. http://www.w3.org/TR/css3-lists/#lower-russian A cursory scan of the UCA doesn't reveal if that's implementable, and experiments in a fairly fresh Linux Mint yield either ель ёлка тель тёлка or ель тель тёлка ёлка depending on the LANG setting (en_US works better than ru_RU). (Both examples consider the difference primary, but the the last example is incorrect as the ёлка follows after the тёлка - which is incorrect from every angle (except from the angle of the number of the letter inside Unicode.) Right. And, ironically, the [en] collation is the correct one. Perhaps this bug is because the Russian localizers failed to get it the way they wanted: Full alignment of Е and Ё? ;-) Could someone tell if the UCA in its current form is able to support that? Is there not a need for 3 kinds of sorting? Namely: a) Е/Ё as always distinct letters, b) Е/Ё as always non-distinct letters, c) Е/Ё as non-distinct letters except when used as the first letter. (Note that the last variant would only be yield correct result on collections of words where a first-letter Ё is guaranteed be rendered with a Ё. Thus, if ёлка is written елка, then the result becomes incorrect.) We're not talking here about *words per se* that may or may not be rendered with a Ё, we're talking about letter sequences with Ё as a given. The dictionary order shows that all word-initial Ёs go after all word-initial Еs, but within a word the difference is secondary. For a set of letter sequences using canonical spelling of words, the collation algorithm should give their dictionary ordering, shouldn't it? I believe the English Wikipedia article is pretty canonical when it says that it can be done both ways - see the sources I pointed to above for examples of sorting where the status as first letter doesn't matter. I don't know why the dictionary you pointed two http://ru.wikisource.org/wiki/Орфографический_словарь_русского_языка has separated the words. It could be a technical limitation of MediaWiki. Or it could be because those who initiated the project felt it made the most sense. (It does make a lot of sense to me … he, he.) But that dictionary is also peculiar in that it lists words that begins on the letter Ы. :-) It is typical to say that no words begins on the letter Ы. :-) But the list managed to find some … (Including one word that simply means to say ы.) Neither of the dictionaries I mentioned above have any words under the letter Ы. Even in the above mentioned CSS list module’s definition, the ы is excluded. Re the linguistic PS: you're right, and
Re: UCA and Russian letter Ё
2012-12-21 21:05, Leif Halvard Silli wrote: My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian Dictionary from 2003 agree that both list words on Ё and Е under the same category – namely, under the letter Е. This appears to be the case in any serious dictionary. The use of the Cyrillic letter yo (ё, called IO in the Unicode name) has varied through ages, but it has never been a dominant spelling to use it. According to “The World’s Writing Systems”, edited by Peter T. Daniels and William Bright (Oxford University Press, 1995), “The letter ё is used virtually only in dictionaries or language textbooks.” It may have become more popular in the Internet, but still less common than using the letter ye (IE, е) in its stead. Fact is, again, that ёлка - in the wild - can be written ёлка and елка. And in most contexts, it is written “елка”. It is of course possible that some people would prefer treating “ё” as a primarily different letter. But it’s rather illogical to require that it be treated that way at the start of a word only. I don’t think collation rules need to accommodate such preferences. Yucca
RE: UCA and Russian letter Ё
Fact is, again, that ёлка - in the wild - can be written ёлка and елка Though you need a better dictionary: it's the diminutive of ель (as in Yel'tsin) meaning fir tree, and is the 4-letter word for Christmas tree. С Рождеством, Joe
Re: UCA and Russian letter Ё
On Fri, Dec 21, 2012 at 11:35 AM, Jukka K. Korpela jkorp...@cs.tut.fi wrote: 2012-12-21 21:05, Leif Halvard Silli wrote: My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian Dictionary from 2003 agree that both list words on Ё and Е under the same category – namely, under the letter Е. This appears to be the case in any serious dictionary. You're right. In an influential orthographic dictionary the difference is secondary, e.g. ёлка is between елисейский дворец and ёлки-палки: http://lopatina-slovar.com/description/elka/34736 (The site database has been built by scanning a printed dictionary) However, the preferences could change, as electronic dictionaries seem to demonstrate. It is of course possible that some people would prefer treating “ё” as a primarily different letter. But it’s rather illogical to require that it be treated that way at the start of a word only. I don’t think collation rules need to accommodate such preferences. Granted, not yet, but by itself the argument is invalid. Unicode collation rules are descriptive; if, for example, a language happens to sort accents backwards, this rule has to be - and is - accommodated despite its apparent illogicality; along the same lines, if a language happens to make a distinction discussed in this thread, it has to be accommodated just as well. Also, In several languages the rules have changed over time, and so *older dictionaries may use a different order than modern ones* [emph. mine - LB]. Furthermore, collation may depend on use. For example, German dictionaries and telephone directories use different approaches. [http://en.wikipedia.org/wiki/Collation] The distinction in two collation methods in German (secondary vs expanded umlauts) is prominent enough to be mentioned in UCA. Luckily for Germans, both methods are covered by the algorithm thanks to requirements of other languages. My question is as follows: does UCA have to be modified (e.g. by adding another bit flag word-initial primary next to the existing backward secondary) to support the feature if it were to be implemented, or is there a way to achieve the new Russian online collation within the existing UCA without modifying the strings to be sorted before the application of the algorithm? Leo
Re: UCA and Russian letter Ё
Jukka K. Korpela, Fri, 21 Dec 2012 21:35:16 +0200: 2012-12-21 21:05, Leif Halvard Silli wrote: My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian Dictionary from 2003 agree that both list words on Ё and Е under the same category – namely, under the letter Е. This appears to be the case in any serious dictionary. In «Tolkovïj slovar’ sovremennogo russkogo jazïka» from 2005 («Dictionary over contempary Russian language»), has located words on Ё in its a separate category, consisting of exactly one word: Ёмкость. That, and the dictionary Leo pointed to, tell me that there is a difference between categorization and collation. The use of the Cyrillic letter yo (ё, called IO in the Unicode name) has varied through ages, but it has never been a dominant spelling to use it. According to “The World’s Writing Systems”, edited by Peter T. Daniels and William Bright (Oxford University Press, 1995), “The letter ё is used virtually only in dictionaries or language textbooks.” It may have become more popular in the Internet, but still less common than using the letter ye (IE, е) in its stead. The internet has also really boomed since 1995. ;-) Fact is, again, that ёлка - in the wild - can be written ёлка and елка. And in most contexts, it is written “елка”. Google Trends has «ёлка» as *pretty* close — I think, but «елка» remains in the leead. http://www.google.com/trends/explore#q=ёлка,елка It is of course possible that some people would prefer treating “ё” as a primarily different letter. But it’s rather illogical to require that it be treated that way at the start of a word only. I don’t think collation rules need to accommodate such preferences. Right: To require it would be not be in tune with praxis. -- leif halvard silli
Re: UCA and Russian letter Ё
On Fri, Dec 21, 2012 at 1:08 PM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: In «Tolkovïj slovar’ sovremennogo russkogo jazïka» from 2005 («Dictionary over contempary Russian language»), has located words on Ё in its a separate category, consisting of exactly one word: Ёмкость. This is either a mistake or a misunderstanding. There are a few dozen words starting with Ё: http://ru.wikisource.org/wiki/%D0%9E%D1%80%D1%84%D0%BE%D0%B3%D1%80%D0%B0%D1%84%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8%D0%B9_%D1%81%D0%BB%D0%BE%D0%B2%D0%B0%D1%80%D1%8C_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D1%8F%D0%B7%D1%8B%D0%BA%D0%B0_%28%D0%81%29 Some online dictionaries may require you to click on a word to expand a word range. That, and the dictionary Leo pointed to, tell me that there is a difference between categorization and collation. You're right. A primary difference is categorizing (e.g. when many people have to check in to an event, the waiting lines may be categorized by several primarily distinct letters of the last name), a secondary difference isn't. Also, speaking of dictionary vs phone book collation, I'd like to know how Ельцин vs Ёлкин would be sorted but I don't know how to find out. During Soviet times, the White Pages weren't accessible to the public. It is of course possible that some people would prefer treating “ё” as a primarily different letter. But it’s rather illogical to require that it be treated that way at the start of a word only. I don’t think collation rules need to accommodate such preferences. Right: To require it would be not be in tune with praxis. I'm not in a rush. :) Leo
RE: UCA and Russian letter Ё
Leo Broukhis said: Granted, not yet, but by itself the argument is invalid. Unicode collation rules are descriptive; I'm not sure what you mean by that. UTS #10 is a *specification* of an algorithm, with various options for tailoring and parameterization which make it possible to accommodate various needs for particular cases. It is not intended as a descriptive mechanism. Perhaps you are referring to LDML, which includes a formal mechanism for describing a particular collation in terms of the default table and tailoring options and parameterization options of the UCA. if, for example, a language happens to sort accents backwards, this rule has to be - and is - accommodated despite its apparent illogicality; Backwards accent secondary weighting was actually included primarily because of prior art in collation standards, because of the need to be able to synchronize the UCA algorithm with ISO 14651, and because it makes it easier to explain how folks can implement versions of multi-level collation which can pass the conformance tests of the Canadian sorting standard, etc. along the same lines, if a language happens to make a distinction discussed in this thread, it has to be accommodated just as well. No, I don't think so. It is rather easy to come up with distinctions or collation requirements which simply cannot be accommodated within the intended bounds of the UCA. For example, sorting all numerical expressions mixed with text strictly by their numeric values, or sorting all (or some specified list) of abbreviations as if they were spelled out, and so forth. Many lexicographical ordering rules cannot be fully accommodated within the context of the UCA algorithm, which is a multilevel *string comparison* specification, and not a dictionary ordering specification. My question is as follows: does UCA have to be modified (e.g. by adding another bit flag word-initial primary next to the existing backward secondary) to support the feature if it were to be implemented, or is there a way to achieve the new Russian online collation within the existing UCA without modifying the strings to be sorted before the application of the algorithm? I don't think there is any out-of-the-box way to use UCA so that an implementation would automatically recognize a word boundary context and weight characters conditionally based on that context. So no, I don't think you could get an implementation to do that without first marking up text with additional characters to indicate word boundaries and then tailoring the weight table to weight sequences including that markup accordingly. This is actually derived trivially from the fact that UCA knows nothing whatsoever about word boundaries. At core, it is just a mechanism to take a string input and provide an output vector of collation weights. You would have to have to hook it up to a text segmentation algorithm to even identify words, and then that text segmentation algorithm would itself have to be tailored and tuned to whatever language you had in mind, because the criteria for identifying words will vary from language to language, and even orthography to orthography. But there is another possible sense of the question, does UCA have to be modified... to support..., i.e. is the UTC somehow required to augment the algorithm to support some particular kind of behavior for a particular language's sorting rules, just because someone has turned up particular odd behavior. And I think the answer to that is clearly no. Oh, and by the way, I don't think LDML must (or should) be augmented to enable it to describe any and all lexicographical ordering practices, either. That isn't the function of LDML. --Ken
RE: UCA and Russian letter Ё
Joe, Fri, 21 Dec 2012 12:48:47 -0800: Fact is, again, that ёлка - in the wild - can be written ёлка and елка Though you need a better dictionary: it's the diminutive of ель (as in Yel'tsin) meaning fir tree, and is the 4-letter word for Christmas tree. The dictionary of Dal,[1] says: «Ель, ели́на, умал. ёлка [snip]», which ought to mean that ёлка is a diminutive of ель. My impression is the same as yours with regard to the Christmas tree/New year tree meaning, but many dictionaries do list fir tree as the primary meaning of ёлка and Christmas/New year tree as a secondary meaning. [1] http://en.wikipedia.org/wiki/Vladimir_Dal С праздником! -- leif halvard silli
Re: UCA and Russian letter Ё
Leo Broukhis, Fri, 21 Dec 2012 13:43:14 -0800: On Fri, Dec 21, 2012 at 1:08 PM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: In «Tolkovïj slovar’ sovremennogo russkogo jazïka» from 2005 («Dictionary over contempary Russian language»), has located words on Ё in its a separate category, consisting of exactly one word: Ёмкость. This is either a mistake or a misunderstanding. [ snip ] Not at all. THe dictionary I referred to is a dictionary on paper which only contains new words or words with changed meaning etc. Thus, a dictionary of hot words for the time being. That particular dictionary only found room for one such word on ё-. :-) That, and the dictionary Leo pointed to, tell me that there is a difference between categorization and collation. You're right. A primary difference is categorizing (e.g. when many people have to check in to an event, the waiting lines may be categorized by several primarily distinct letters of the last name), a secondary difference isn't. Also, speaking of dictionary vs phone book collation, I'd like to know how Ельцин vs Ёлкин would be sorted but I don't know how to find out. During Soviet times, the White Pages weren't accessible to the public. I think that this is definitely one thing that can be affected by electronic media. But I just checked how Thunderbird sorts words and Ё- and Е- and it treats them as one and the same, even when the the Ё is the first letter of the word. Which to me makes sense in such an uncategorized medium as a list of e-mail since the user wants him- or herself to verify that he/she has seen all the message. However, I agree that in a dictionary etc, then it could probably make sense to have separate categories for Ё and Е. Question is whether categorization is a subject for collation algorithm. -- leif halvard silli