Re: UCA and Russian letter Ё

2012-12-30 Thread Leo Broukhis
On Wed, Dec 26, 2012 at 11:18 AM, Whistler, Ken ken.whist...@sap.com wrote:
 Leo asked:

 My question was narrower: assuming that the strings being compared are
 words, could it be supported without any markup?

 ... where it refers to conditional weighting based on the (identified) word 
 boundary. And the answer to that is no, unless the word boundary was 
 explicitly indicated with some kind of a markup character, and then the 
 sequence of that markup character plus the target character of interest (in 
 this case Russian Yo) was given a tailored contraction in the weight table 
 which weighted it differently from any Russian Yo not in that particular 
 contraction sequence.

I see your point: if something can be trivially emulated with a markup
character, there is no need to augment the algorithm (emulating
backward accents with markup is possible but much more cumbersome).

What characters should be used for such markup, if need be?

Thanks,
Leo




RE: UCA and Russian letter Ё

2012-12-26 Thread Whistler, Ken
The UCA algorithm itself has no opinion on this issue. It is simply a 
specification of *how* to compare strings at multiple levels, given a 
multi-level collation weight table.

The UCA *does* have a default behavior, of course, based on the DUCET table. 
And the DUCET table puts all Unicode characters in *some* order, so there is a 
default answer for Russian Ye and Yo, as there is for everything else. The 
current default answer for UCA 6.2 (abbreviating the unnecessary 4th level 
weights) is:

0435 ; [.19D9.0020.0002] # CYRILLIC SMALL LETTER IE
0450 ; [.19D9.0020.0002][..0035.0002] # CYRILLIC SMALL LETTER IE WITH GRAVE
0451 ; [.19D9.0020.0002][..0047.0002] # CYRILLIC SMALL LETTER IO

So by default, DUCET weights Ye with grave as a secondary difference from Ye, 
and also weights Yo as a secondary difference from Ye. (The secondary weights 
can be seen in the second collation elements for those letters, the 0035 and 
0047 weights, respectively.)

Those weights would be applied to *all* instances of Ye and Yo in a string, 
because there is no concept in the algorithm of conditional weighting in 
particular positions in a word.

But it is important to note also that those weights are just defaults, and the 
concept here is that they are set up to be defaults for the Cyrillic script as 
a whole, and not as defaults for Russian language data in particular. The 
defaults were chosen so that any particular language written with the Cyrillic 
script (including Russian) doesn't get *too* screwed up if strings in it are 
sorted using the default table, but the default is not intended to be fully 
correct for *any* particular language, including Russian. Instead, that is what 
tailoring (using LDML or some other mechanism) is aimed at.

So I would say that UCA per se is not meant to solve the issue of how to 
collate Russian Ye and Yo. It is meant to provide a mechanism for tailoring 
weights for characters to provide appropriate collation orders for particular 
languages.

However, in some cases, where languages require collation rules that depend on 
boundary conditions, the algorithm by itself cannot handle those. But 
appropriate markup of text to *indicate* boundaries explicitly, and then to 
tailor the weights of the characters used for that markup, can result in 
strings which then *could* be compared using UCA to provide the expected 
results. That kind of markup  could be done by a preprocessing step, which 
could, for example, process for word or syllabic boundaries (according to 
particular language and orthographic rules) and then pass the marked-up text to 
the string comparison step.

But in any case, it isn't the job of UCA to arbitrate what the correct or 
expected result for comparison in a particular language is.

--Ken


 A basic question: does the UCA algorithm consider the Russian Ye and the
 Russian Yo as equal with regard to sort order? Or is it not meant to solve
 that issue?
 
 Leif Halvard Silli





RE: UCA and Russian letter Ё

2012-12-26 Thread Whistler, Ken
Leo asked:

 My question was narrower: assuming that the strings being compared are
 words, could it be supported without any markup?

... where it refers to conditional weighting based on the (identified) word 
boundary. And the answer to that is no, unless the word boundary was explicitly 
indicated with some kind of a markup character, and then the sequence of that 
markup character plus the target character of interest (in this case Russian 
Yo) was given a tailored contraction in the weight table which weighted it 
differently from any Russian Yo not in that particular contraction sequence.

 (NB that the backward accents feature is also, strictly speaking, 
 word-based.)

A correction here. The backwards accents feature in UCA is *not* word-based. As 
for any other string being compared via the UCA mechanism, weights are simply 
assigned to *all* characters in the string. The difference for weighting when 
using the backwards accents feature is that secondary weight significance in 
comparison is calculated from the end of the string, instead of the start of 
the string. This works when comparing single words, but it is applied 
indifferently to entire strings. And it gets the correct results, by the way. 
Work it out: you take two strings containing entire phrases in French, which 
only differ by accents on some word in the middle of the string. The only 
difference in weights assigned will be for the secondary weights for those 
accents, and if you use the backwards accents feature they will be calculated 
from the end of the string.

Once again, let me emphasize: the UCA algorithm per se simply has no concept at 
all of word boundaries. It applies strictly and only to string input, which 
could contain *anything*.

 In other words, after adoption, LDML became prescriptive in the sense
 don't even think of inventing any sorting rules that cannot be
 described by LDML as it stands; we're not going to augment it. The
 Quebecois were very lucky, then.

No, I think that is an incorrect characterization of the situation for LDML. It 
 can be, and at times has been, augmented for new parameterizations which make 
sense. Those changes, however, have to make sense within the overall context of 
the way the multilevel weighting and string comparison algorithm works. The 
basic issue here is that because UCA is a string weighting and comparison 
algorithm, but does *not* have built in any kind of text segmentation logic 
(whether to identify words, syllables, or any other language-specific segment), 
it simply does not make sense to expect LDML to be augmented to describe 
collation behavior that depends on conditional behavior at segmentation 
boundaries. That is simply outside the scope of UCA and LDML. It isn't outside 
the scope of the bigger issue of sorting and collation behavior in general, of 
course -- it is just outside the scope of what UCA addresses.

Incidentally, for the record, backwards weighting of accents for French doesn't 
have anything particular to do with Quebecois. It is a feature of *some* 
influential French dictionary lexicographical ordering traditions -- in France 
-- and not in others.

--Ken





Re: UCA and Russian letter Ё

2012-12-23 Thread Otto Stolz

Hello,

Leo Broukhis hatte geschrieben:

In Russian, the difference between Е and Ё is primary at the beginning
of a word as they are considered distinct letters of the alphabet, yet
secondary in the middle of a word, as the dieresis over Ё is not
mandatory. As an example, ель  ёлка, but тёлка  тель, see
http://ru.wikisource.org/wiki/Орфографический_словарь_русского_языка


Am 2012-12-21 um 20:05 Uhr schrieb Leif Halvard Silli:

My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian
Dictionary from 2003 agree that both list words on Ё and Е under the
same category – namely, under the letter Е.


So do both “Русско-Немецкий Словар” (Moskow,1955) and  “Langenscheidts
Taschenwörterbuch”, 4. Aufl. (Berlin 1963).

Hence, I deem Leo’s example a red herring.

Best wishes for a merry Xmas (or whatever) and a happy New Year,
  Otto








Re: UCA and Russian letter Ё

2012-12-23 Thread Leif H Silli

Ken,

A basic question: does the UCA algorithm consider the Russian Ye and the 
Russian Yo as equal with regard to sort order? Or is it not meant to solve 
that issue?


Leif Halvard Silli



--- Opprinnelig melding ---

Fra: Whistler, Ken ken.whist...@sap.com
Til: l...@mailcom.com, jkorp...@cs.tut.fi
Cc: unicode@unicode.org, ken.whist...@sap.com
Sendt: 21/12/'12,  22:49

Leo Broukhis said:


Granted, not yet, but by itself the argument is invalid. Unicode
collation rules are descriptive;


I'm not sure what you mean by that. UTS #10 is a *specification* of an 
algorithm, with various options  for tailoring and parameterization which 
make it possible to accommodate various needs for particular cases. It is 
not intended as a descriptive mechanism.


Perhaps you are referring to LDML, which includes a formal mechanism for 
describing a particular collation in terms of the default table and 
tailoring options and parameterization options of the UCA.



if, for example,  a language happens to sort accents backwards, this
rule has to be - and is - accommodated despite its apparent
illogicality;


Backwards accent secondary weighting was actually included primarily 
because of prior art in collation standards, because of the need to be 
able to synchronize the UCA algorithm with ISO 14651, and because it makes 
it easier  to explain how folks can implement versions of multi-level 
collation which can pass the conformance tests of the Canadian sorting 
standard, etc.



along the same lines, if a language happens to make a distinction
discussed in this thread, it has to be accommodated just as well.


No, I don't think so.

It is rather easy to come up with distinctions or collation requirements 
which simply cannot be accommodated within the intended bounds of the UCA. 
For example, sorting all numerical expressions mixed with text strictly by 
their numeric values, or sorting all (or some specified list) of 
abbreviations as if they were spelled out, and so forth.


Many lexicographical ordering rules cannot be fully accommodated within 
the context of the UCA algorithm, which is a multilevel *string 
comparison* specification, and not a dictionary ordering specification.




My question is as follows: does UCA have to be modified (e.g. by
adding another bit flag word-initial primary next to the existing
backward secondary) to support the feature if it were to be
implemented, or is there a way to achieve the new Russian online
collation within the existing UCA without modifying  the strings to
be sorted before the application of the algorithm?


I don't think there is any out-of-the-box way to use UCA so that an 
implementation would automatically recognize a word boundary context and 
weight characters conditionally based on that context. So no, I don't 
think you could get an implementation to do that without first marking up 
text with additional characters to indicate word boundaries and then 
tailoring the weight table to weight sequences including that markup 
accordingly.


This is actually derived trivially from the fact that UCA knows nothing 
whatsoever about word boundaries. At core, it is just a mechanism to take 
a string input and provide an output vector of collation weights. You 
would have to have to hook it up to a text segmentation algorithm to even 
identify words, and then that text segmentation algorithm would itself 
have to be tailored and tuned to whatever language you had in mind, 
because the criteria for identifying words will vary from language to 
language, and even orthography to orthography.


But there is another possible sense of the question, does UCA have to be 
modified... to support..., i.e. is the UTC somehow required to augment 
the algorithm to support some particular kind of behavior for a particular 
language's sorting rules, just because someone has turned up particular 
odd behavior. And I think the answer to that is clearly no. Oh, and by the 
way, I don't think LDML must (or should) be augmented to enable it to 
describe any and all lexicographical ordering practices, either. That 
isn't the function of LDML.


--Ken








Re: UCA and Russian letter Ё

2012-12-23 Thread Philippe Verdy
My opinion is that BOTH the UCA algorithm AND the LDML formal decription of
collations are just Best known practices to accomodate the collation
(i.e. dictionary ordering AND string searches AND string comparisons).

But neither of them can accomodate all possible orders or weak comparisons
systems for all languages. Notably, it cannot accomodate directly the
contextual mutation of initial or medial letters in words.

But true linguistic dictionaries have ordered their entries by grouping
together in a single entry or in successive entries all variations of a
word, including these initial mutations, or derivations (like conjugated
verbs, grammatical declinations, plurals, or genders.)

The UCA still works provided that there's a prior preprocessing that allows
infering (or using some lookup for exceptions) another form of words (or of
numbers) for which the multilevel algorithm needed to generate weights can
then work.

The UCA admits it, but LDML cannot describe these proprocessing rules with
just rules used to assign weights to some groups of characters,
independantly of what words could mean in the intended language. So the
LDML collation rules cannot be sufficient for all cases, and there must be
also another formal language for describing the preprocessing rules. For
now this does not exist, but nothing prevent this to appear in some future
as additional data, and LDML could be extended for decribing these
preprocessing transforms.

But some steps will remain : the initial normalization, the place where
letter case is assigned (or not) some higher collation weight. the place
where preprocessing can perform lookups using this simplified view based on
the first steps which perform more than just a standard Unicode
normalization, and then the last steps that are outputing the weights level
per level.

Collation is a very complex concept and it is not fully standardized for
interchange in LDML. The most notable parts being the preprocessing steps.

But note that even various dictionaries for the same language will vary in
how they perform these preprocessings (for exemple a dictionnary may
include and sort separately the derived terms, using a simpler rule not
requiring this preprocessing, so it will have MORE entries, even if they
are linking their actual definition to another entry ; and most
dictionnaries do not include separate entries for derivations like regular
conjugations or plurals or declinations, unless they are VERY irregular ;
and most dictionnaries for languages that include the standard mutation
rules for initial letters will NOT list separate entries for these
mutations and users for example know that if they cannot find a word
starting by these mutable letters, they will look for words starting by
unmutated words).

The same is true for languages that use agglutination : it is not possible
to list all possible agglutinations and users need to know how to recognize
the morphemes. Here again another thing is not described in collation rules
: the breaking rules that allow the separation of words or morphemes. But
collation may not work correctly without it, when agglutination implies
also mutations between two agglutinated morphemes.

This is clearly a limitation of LDML, but not of UCA itself.


2012/12/23 Leo Broukhis l...@mailcom.com

 On Fri, Dec 21, 2012 at 1:49 PM, Whistler, Ken ken.whist...@sap.com
 wrote:
  Leo Broukhis said:
 
  Granted, not yet, but by itself the argument is invalid. Unicode
  collation rules are descriptive;
 
  I'm not sure what you mean by that. UTS #10 is a *specification* of an
 algorithm, with various options  for tailoring and parameterization which
 make it possible to accommodate various needs for particular cases. It is
 not intended as a descriptive mechanism.

 What I meant is that the way its feature set which makes it flexible
 enough had been decided was descriptive, or, if you will, adaptive,
 following the pre-existing collation traditions in various languages
 and/or pre-existing standards.

  Perhaps you are referring to LDML, which includes a formal mechanism for
 describing a particular collation in terms of the default table and
 tailoring options and parameterization options of the UCA.

 You're right, LDML may be a better application of the word descriptive.

  if, for example,  a language happens to sort accents backwards, this
  rule has to be - and is - accommodated despite its apparent
  illogicality;
 
  Backwards accent secondary weighting was actually included primarily
 because of prior art in collation standards, because of the need to be able
 to synchronize the UCA algorithm with ISO 14651, and because it makes it
 easier  to explain how folks can implement versions of multi-level
 collation which can pass the conformance tests of the Canadian sorting
 standard, etc.
 
  along the same lines, if a language happens to make a distinction
  discussed in this thread, it has to be accommodated just as well.
 
  No, I don't think so.

Re: UCA and Russian letter Ё

2012-12-22 Thread Leo Broukhis
On Fri, Dec 21, 2012 at 1:49 PM, Whistler, Ken ken.whist...@sap.com wrote:
 Leo Broukhis said:

 Granted, not yet, but by itself the argument is invalid. Unicode
 collation rules are descriptive;

 I'm not sure what you mean by that. UTS #10 is a *specification* of an 
 algorithm, with various options  for tailoring and parameterization which 
 make it possible to accommodate various needs for particular cases. It is not 
 intended as a descriptive mechanism.

What I meant is that the way its feature set which makes it flexible
enough had been decided was descriptive, or, if you will, adaptive,
following the pre-existing collation traditions in various languages
and/or pre-existing standards.

 Perhaps you are referring to LDML, which includes a formal mechanism for 
 describing a particular collation in terms of the default table and tailoring 
 options and parameterization options of the UCA.

You're right, LDML may be a better application of the word descriptive.

 if, for example,  a language happens to sort accents backwards, this
 rule has to be - and is - accommodated despite its apparent
 illogicality;

 Backwards accent secondary weighting was actually included primarily because 
 of prior art in collation standards, because of the need to be able to 
 synchronize the UCA algorithm with ISO 14651, and because it makes it easier  
 to explain how folks can implement versions of multi-level collation which 
 can pass the conformance tests of the Canadian sorting standard, etc.

 along the same lines, if a language happens to make a distinction
 discussed in this thread, it has to be accommodated just as well.

 No, I don't think so.

My question can be construed as a hypothetical: had the described
Ё-collation been a prior art in collation standards by the time of
development of LDML and the UCA, how different would they have been?
I'm hoping for an answer not at all or very little, and here's
how it could have been implemented: 

 It is rather easy to come up with distinctions or collation requirements 
 which simply cannot be accommodated within the intended bounds of the UCA. 
 For example, sorting all numerical expressions mixed with text strictly by 
 their numeric values, or sorting all (or some specified list) of 
 abbreviations as if they were spelled out, and so forth.

 Many lexicographical ordering rules cannot be fully accommodated within the 
 context of the UCA algorithm, which is a multilevel *string comparison* 
 specification, and not a dictionary ordering specification.

That is true in general if rules happen to involve semantics, but
we're discussing a formal rule here.
Imagine that the backward accents feature was missing from LDML, e.g.
because it was an emerging trend rather than a standard way of
collation at the time of formalizing LDML, thus not included in LDML.
Would you have said the same about it today if someone had asked about
supporting it ? If not, why not?


 My question is as follows: does UCA have to be modified (e.g. by
 adding another bit flag word-initial primary next to the existing
 backward secondary) to support the feature if it were to be
 implemented, or is there a way to achieve the new Russian online
 collation within the existing UCA without modifying  the strings to
 be sorted before the application of the algorithm?

 I don't think there is any out-of-the-box way to use UCA so that an 
 implementation would automatically recognize a word boundary context and 
 weight characters conditionally based on that context. So no, I don't think 
 you could get an implementation to do that without first marking up text with 
 additional characters to indicate word boundaries and then tailoring the 
 weight table to weight sequences including that markup accordingly.

My question was narrower: assuming that the strings being compared are
words, could it be supported without any markup?
(NB that the backward accents feature is also, strictly speaking, word-based.)

 But there is another possible sense of the question, does UCA have to be 
 modified... to support..., i.e. is the UTC somehow required to augment the 
 algorithm to support some particular kind of behavior for a particular 
 language's sorting rules, just because someone has turned up particular odd 
 behavior. And I think the answer to that is clearly no. Oh, and by the way, I 
 don't think LDML must (or should) be augmented to enable it to describe any 
 and all lexicographical ordering practices, either. That isn't the function 
 of LDML.

In other words, after adoption, LDML became prescriptive in the sense
don't even think of inventing any sorting rules that cannot be
described by LDML as it stands; we're not going to augment it. The
Quebecois were very lucky, then.

Leo




UCA and Russian letter Ё

2012-12-21 Thread Leo Broukhis
In Russian, the difference between Е and Ё is primary at the beginning
of a word as they are considered distinct letters of the alphabet, yet
secondary in the middle of a word, as the dieresis over Ё is not
mandatory. As an example, ель  ёлка, but тёлка  тель, see
http://ru.wikisource.org/wiki/Орфографический_словарь_русского_языка

A cursory scan of the UCA doesn't reveal if that's implementable, and
experiments in a fairly fresh Linux Mint yield either
ель  ёлка  тель  тёлка or ель  тель  тёлка  ёлка depending on
the LANG setting (en_US works better than ru_RU).

Could someone tell if the UCA in its current form is able to support that?

Thanks,
Leo




Re: UCA and Russian letter Ё

2012-12-21 Thread Leif Halvard Silli
Leo Broukhis, Fri, 21 Dec 2012 01:31:18 -0800:
 In Russian, the difference between Е and Ё is primary at the beginning
 of a word as they are considered distinct letters of the alphabet, yet
 secondary in the middle of a word, as the dieresis over Ё is not
 mandatory.

 As an example, ель  ёлка, but тёлка  тель, see
 http://ru.wikisource.org/wiki/Орфографический_словарь_русского_языка

You say that the difference is primary in the beginning of a word but 
elsewhere secondary. And yes, that orthographic dictionary that you 
link to above, looks as you describe.

However, in reality, the difference is secondary - if that is the right 
word - even as the first letter in a word. Wikipedia has the following 
example: едок  ёж  ездит.[1] And, for instance the word ёлка could 
also be written елка.

Hence I would argue that the dictionary you linked to above considers 
the difference to *always* be secondary. It is just that the dictionary 
applies the sorting algorithm to a collection where the words that 
begins with the letter Ё has been separated from words that begins on 
the letter Е.

 A cursory scan of the UCA doesn't reveal if that's implementable, and
 experiments in a fairly fresh Linux Mint yield either
 ель  ёлка  тель  тёлка or ель  тель  тёлка  ёлка depending on
 the LANG setting (en_US works better than ru_RU).

(Both examples consider the difference primary, but the the last 
example is incorrect as the ёлка follows after the тёлка - which is 
incorrect from every angle (except from the angle of the number of the 
letter inside Unicode.)

 Could someone tell if the UCA in its current form is able to support that?

Is there not a need for 3 kinds of sorting? Namely: a) Е/Ё as always 
distinct letters, b) Е/Ё as always non-distinct letters, c) Е/Ё as 
non-distinct letters except when used as the first letter. (Note that 
the last variant would only be yield correct result on collections of 
words where a first-letter Ё is guaranteed be rendered with a Ё. Thus, 
if ёлка is written елка, then the result becomes incorrect.)

Linguistic PS: From the angle of the color of the sound, then Russian 
Ё is the light version of Russian О. (Its predecessor was also a 
digraph - IO.) But from the angle of stress then, when the Ё looses 
its stress, it alternates with Russian Е (since Е can both be with and 
without stress, whereas Ё can only be with stress). The reason why Е/Ё 
is often considered a secondary difference, is (I think) related to the 
stress: But for in lexicons and dictionaries, then Russian texts 
typically do not mark where the stress of a word is. The stress is 
simply known by the reader/user.

[1] http://en.wikipedia.org/wiki/Ё#Russian
-- 
leif halvard silli




Re: UCA and Russian letter Ё

2012-12-21 Thread Leo Broukhis
[Philippe tells me that his message that I'm quoting could have been
rejected by the mailing list as spam; my answer is below.]

On Fri, Dec 21, 2012 at 5:13 AM, Philippe Verdy verd...@wanadoo.fr wrote:
 This is an interesting case. A solution would be to be able define a
 distinct collation element for ^ë, where ^ means begining of a word
 (even if there's no character encoded there). That element would be such
 that :

   e  ë  ^ë

 But this requires a prior definition of word boundaries to recognize the ^
 as an additional collation element by itself (usable distinctly only in
 context, and ignored when it occurs anywhere else, meaning that all weights
 assigned to ^ alone would be null.)

 So ^ë would become valid as a collation element, but т^ё makes no sense
 if there's no possible word boundary between т and ё.

 This would work with the UCA algorithm, which does not really mandate what
 is a collation element (not only in terms of encoding as characters), or
 any syntax to support it.

 This mechanism of incorporating word boundaries in UCA would be an
 interesting extension for section 6.9 (Handling Collation Graphemes) of
 UTS#10 (but for now there's no support for it in LDML with a defined syntax
 allowing the insertion of boundaries or other contextual conditions).

 Would it also mean that using a CGJ at the beginning of a word will
cause a ё at the beginning of a word to be treated as a mid-word one?
Is space, CGJ  a well-formed character sequence?

Leo




Re: UCA and Russian letter Ё

2012-12-21 Thread Leo Broukhis
On Fri, Dec 21, 2012 at 4:56 AM, Leif Halvard Silli
xn--mlform-...@xn--mlform-iua.no wrote:

 You say that the difference is primary in the beginning of a word but
 elsewhere secondary. And yes, that orthographic dictionary that you
 link to above, looks as you describe.

 However, in reality, the difference is secondary - if that is the right
 word - even as the first letter in a word. Wikipedia has the following
 example: едок  ёж  ездит.[1] And, for instance the word ёлка could
 also be written елка.

 [1] http://en.wikipedia.org/wiki/Ё#Russian

Wikipedia's example is sadly unsourced, unlike mine.

 Hence I would argue that the dictionary you linked to above considers
 the difference to *always* be secondary. It is just that the dictionary
 applies the sorting algorithm to a collection where the words that
 begins with the letter Ё has been separated from words that begins on
 the letter Е.

Isn't that notionally the same as having the difference primary for
the first letter?

 A cursory scan of the UCA doesn't reveal if that's implementable, and
 experiments in a fairly fresh Linux Mint yield either
 ель  ёлка  тель  тёлка or ель  тель  тёлка  ёлка depending on
 the LANG setting (en_US works better than ru_RU).

 (Both examples consider the difference primary, but the the last
 example is incorrect as the ёлка follows after the тёлка - which is
 incorrect from every angle (except from the angle of the number of the
 letter inside Unicode.)

Right. And, ironically, the [en] collation is the correct one.

 Could someone tell if the UCA in its current form is able to support that?

 Is there not a need for 3 kinds of sorting? Namely: a) Е/Ё as always
 distinct letters, b) Е/Ё as always non-distinct letters, c) Е/Ё as
 non-distinct letters except when used as the first letter. (Note that
 the last variant would only be yield correct result on collections of
 words where a first-letter Ё is guaranteed be rendered with a Ё. Thus,
 if ёлка is written елка, then the result becomes incorrect.)

We're not talking here about *words per se* that may or may not be
rendered with a Ё, we're talking about letter sequences with Ё as a
given. The dictionary order shows that all word-initial Ёs go after
all word-initial Еs, but within a word the difference is secondary.
For a set of letter sequences using canonical spelling of words, the
collation algorithm should give their dictionary ordering, shouldn't
it?

Re the linguistic PS: you're right, and that proves that an
approximation to the proper collation using secondary ordering is
preferred to an approximation using primary ordering.

Leo




Re: UCA and Russian letter Ё

2012-12-21 Thread Markus Scherer
Resending my earlier reply. Apparently, by default, Gmail sends subject
lines in KOI8-R if they contain Cyrillic, and unicode.org rejects those as
likely spam. I just changed my Gmail settings to Use Unicode (UTF-8)
encoding for outgoing messages and hope this goes through. (*Please change
the subject line* if you want to discuss *this* issue.)

My earlier reply was:

Theoretically, it is possible to select collation elements based on the
proximity of word boundaries or other criteria. However, I don't know if
there is an implementation that has that built in. ICU (one of the commonly
used implementations of UCA+CLDR) does not.

It sounds like the secondary difference is ok for sorting, but you are
looking to customize an alphabetic index such that there is a separate
bucket for words beginning with Ё. I think the best would be to do that
with some custom code that looks for Ё as the first character, in addition
to the regular bucketing and sorting.

Best regards,
markus
-- 
Google Internationalization Engineering


Re: UCA and Russian letter Ё

2012-12-21 Thread Leif Halvard Silli
Leo Broukhis, Fri, 21 Dec 2012 08:57:11 -0800:
 On Fri, Dec 21, 2012 at 4:56 AM, Leif Halvard Silli wrote:
 
 You say that the difference is primary in the beginning of a word but
 elsewhere secondary. And yes, that orthographic dictionary that you
 link to above, looks as you describe.
 
 However, in reality, the difference is secondary - if that is the right
 word - even as the first letter in a word. Wikipedia has the following
 example: едок  ёж  ездит.[1] And, for instance the word ёлка could
 also be written елка.
 
 [1] http://en.wikipedia.org/wiki/Ё#Russian
 
 Wikipedia's example is sadly unsourced, unlike mine.

My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian 
Dictionary from 2003 agree that both list words on Ё and Е under the 
same category – namely, under the letter Е.  Also, the Russian 
wikipedia article on the letter Ё says as well that this is how sorting 
should happen. 
http://ru.wikipedia.org/wiki/Ё#.D0.A1.D0.BE.D1.80.D1.82.D0.B8.D1.80.D0.BE.D0.B2.D0.BA.D0.B0
 
And the article list xindy as one applications that handles this. 
http://en.wikipedia.org/wiki/Xindy

 Hence I would argue that the dictionary you linked to above considers
 the difference to *always* be secondary. It is just that the dictionary
 applies the sorting algorithm to a collection where the words that
 begins with the letter Ё has been separated from words that begins on
 the letter Е.
 
 Isn't that notionally the same as having the difference primary for
 the first letter?

Input from a coalition expert would be welcome. However, this is how I 
think: 

Should one expect such an algorithm to write the phone book on one’s 
behalf? Or that it writes the dictionary? I think that would be an 
unrealistic expectation. E.g. a dictionary or phone book has precise 
rules for how the words as written and grouped before they are sorted.

Fact is, again, that ёлка - in the wild - can be written ёлка and 
елка. So if you assume that the algorithm should only deal with ёлка, 
then you are also saying that you want the algorithm to deal with words 
that have been prepared for sorting. Thus you are talking about a well 
prepared text were ёлка is always written ёлка and not елка.

While not a definitive proof, I may also mention that the CSS list 
module defines an enumeration style based on the Russian alphabet, in 
which the ё is excluded.

http://www.w3.org/TR/css3-lists/#lower-russian

 A cursory scan of the UCA doesn't reveal if that's implementable, and
 experiments in a fairly fresh Linux Mint yield either
 ель  ёлка  тель  тёлка or ель  тель  тёлка  ёлка depending on
 the LANG setting (en_US works better than ru_RU).
 
 (Both examples consider the difference primary, but the the last
 example is incorrect as the ёлка follows after the тёлка - which is
 incorrect from every angle (except from the angle of the number of the
 letter inside Unicode.)
 
 Right. And, ironically, the [en] collation is the correct one.

Perhaps this bug is because the Russian localizers failed to get it the 
way they wanted: Full alignment of Е and Ё? ;-) 
 
 Could someone tell if the UCA in its current form is able to support that?
 
 Is there not a need for 3 kinds of sorting? Namely: a) Е/Ё as always
 distinct letters, b) Е/Ё as always non-distinct letters, c) Е/Ё as
 non-distinct letters except when used as the first letter. (Note that
 the last variant would only be yield correct result on collections of
 words where a first-letter Ё is guaranteed be rendered with a Ё. Thus,
 if ёлка is written елка, then the result becomes incorrect.)
 
 We're not talking here about *words per se* that may or may not be
 rendered with a Ё, we're talking about letter sequences with Ё as a
 given. The dictionary order shows that all word-initial Ёs go after
 all word-initial Еs, but within a word the difference is secondary.
 For a set of letter sequences using canonical spelling of words, the
 collation algorithm should give their dictionary ordering, shouldn't
 it?

I believe the English Wikipedia article is pretty canonical when it 
says that it can be done both ways - see the sources I pointed to above 
for examples of sorting where the status as first letter doesn't matter.

I don't know why the dictionary you pointed two 
http://ru.wikisource.org/wiki/Орфографический_словарь_русского_языка 
has separated the words. It could be a technical limitation of 
MediaWiki. Or it could be because those who initiated the project felt 
it made the most sense. (It does make a lot of sense to me  … he, he.)  
But that dictionary is also peculiar in that it lists words that 
begins on the letter Ы. :-) It is typical to say that no words begins 
on the letter Ы. :-) But the list managed to find some … (Including one 
word that simply means to say ы.) Neither of the dictionaries I 
mentioned above have any words under the letter Ы. Even in the above 
mentioned CSS list module’s definition, the ы is excluded.

 Re the linguistic PS: you're right, and 

Re: UCA and Russian letter Ё

2012-12-21 Thread Jukka K. Korpela

2012-12-21 21:05, Leif Halvard Silli wrote:


My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian
Dictionary from 2003 agree that both list words on Ё and Е under the
same category – namely, under the letter Е.


This appears to be the case in any serious dictionary.

The use of the Cyrillic letter yo (ё, called IO in the Unicode name) has 
varied through ages, but it has never been a dominant spelling to use 
it. According to “The World’s Writing Systems”, edited by Peter T. 
Daniels and William Bright (Oxford University Press, 1995), “The letter 
ё is used virtually only in dictionaries or language textbooks.” It may 
have become more popular in the Internet, but still less common than 
using the letter ye (IE, е) in its stead.



Fact is, again, that ёлка - in the wild - can be written ёлка and
елка.


And in most contexts, it is written “елка”.

It is of course possible that some people would prefer treating “ё” as a 
primarily different letter. But it’s rather illogical to require that it 
be treated that way at the start of a word only. I don’t think collation 
rules need to accommodate such preferences.


Yucca






RE: UCA and Russian letter Ё

2012-12-21 Thread Joe

 Fact is, again, that ёлка - in the wild - can be written ёлка and елка

Though you need a better dictionary: it's the diminutive of ель (as in 
Yel'tsin) meaning fir tree, and is the 4-letter word for Christmas tree.

С Рождеством,

Joe







Re: UCA and Russian letter Ё

2012-12-21 Thread Leo Broukhis
On Fri, Dec 21, 2012 at 11:35 AM, Jukka K. Korpela jkorp...@cs.tut.fi wrote:
 2012-12-21 21:05, Leif Halvard Silli wrote:

 My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian
 Dictionary from 2003 agree that both list words on Ё and Е under the
 same category – namely, under the letter Е.

 This appears to be the case in any serious dictionary.

You're right. In an influential orthographic dictionary the difference
is secondary,
e.g. ёлка is between елисейский дворец and ёлки-палки:
http://lopatina-slovar.com/description/elka/34736
(The site database has been built by scanning a printed dictionary)

However, the preferences could change, as electronic dictionaries seem
to demonstrate.

 It is of course possible that some people would prefer treating “ё” as a
 primarily different letter. But it’s rather illogical to require that it be
 treated that way at the start of a word only. I don’t think collation rules
 need to accommodate such preferences.

Granted, not yet, but by itself the argument is invalid. Unicode
collation rules are descriptive;
if, for example,  a language happens to sort accents backwards, this
rule has to be - and is - accommodated despite its apparent
illogicality;
along the same lines, if a language happens to make a distinction
discussed in this thread, it has to be accommodated just as well.

Also, In several languages the rules have changed over time, and so
*older dictionaries may use a different order than modern ones* [emph.
mine - LB]. Furthermore, collation may depend on use. For example,
German dictionaries and telephone directories use different
approaches.
[http://en.wikipedia.org/wiki/Collation]

The distinction in two collation methods in German (secondary vs
expanded umlauts) is prominent enough to be mentioned in UCA. Luckily
for Germans, both methods are covered by the algorithm thanks to
requirements of other languages.

My question is as follows: does UCA have to be modified (e.g. by
adding another bit flag word-initial primary next to the existing
backward secondary) to support the feature if it were to be
implemented, or is there a way to achieve the new Russian online
collation within the existing UCA without modifying  the strings to
be sorted before the application of the algorithm?

Leo




Re: UCA and Russian letter Ё

2012-12-21 Thread Leif Halvard Silli
Jukka K. Korpela, Fri, 21 Dec 2012 21:35:16 +0200:
 2012-12-21 21:05, Leif Halvard Silli wrote:
 
 My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian
 Dictionary from 2003 agree that both list words on Ё and Е under the
 same category – namely, under the letter Е.
 
 This appears to be the case in any serious dictionary.

In «Tolkovïj slovar’ sovremennogo russkogo jazïka» from 2005 
(«Dictionary over contempary Russian language»), has located words on Ё 
in its a separate category, consisting of exactly one word: Ёмкость. 
That, and the dictionary Leo pointed to, tell me that there is a 
difference between categorization and collation.

 The use of the Cyrillic letter yo (ё, called IO in the Unicode name) 
 has varied through ages, but it has never been a dominant spelling to 
 use it. According to “The World’s Writing Systems”, edited by Peter 
 T. Daniels and William Bright (Oxford University Press, 1995), “The 
 letter ё is used virtually only in dictionaries or language 
 textbooks.” It may have become more popular in the Internet, but 
 still less common than using the letter ye (IE, е) in its stead.

The internet has also really boomed since 1995. ;-)

 Fact is, again, that ёлка - in the wild - can be written ёлка and
 елка.
 
 And in most contexts, it is written “елка”.

Google Trends has «ёлка» as *pretty* close — I think, but «елка» 
remains in the leead. http://www.google.com/trends/explore#q=ёлка,елка

 It is of course possible that some people would prefer treating “ё” 
 as a primarily different letter. But it’s rather illogical to require 
 that it be treated that way at the start of a word only. I don’t 
 think collation rules need to accommodate such preferences.

Right: To require it would be not be in tune with praxis.
-- 
leif halvard silli




Re: UCA and Russian letter Ё

2012-12-21 Thread Leo Broukhis
On Fri, Dec 21, 2012 at 1:08 PM, Leif Halvard Silli
xn--mlform-...@xn--mlform-iua.no wrote:

 In «Tolkovïj slovar’ sovremennogo russkogo jazïka» from 2005
 («Dictionary over contempary Russian language»), has located words on Ё
 in its a separate category, consisting of exactly one word: Ёмкость.

This is either a mistake or a misunderstanding. There are a few dozen
words starting with Ё:
http://ru.wikisource.org/wiki/%D0%9E%D1%80%D1%84%D0%BE%D0%B3%D1%80%D0%B0%D1%84%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8%D0%B9_%D1%81%D0%BB%D0%BE%D0%B2%D0%B0%D1%80%D1%8C_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D1%8F%D0%B7%D1%8B%D0%BA%D0%B0_%28%D0%81%29

Some online dictionaries may require you to click on a word to expand
a word range.

 That, and the dictionary Leo pointed to, tell me that there is a
 difference between categorization and collation.

You're right. A primary difference is categorizing (e.g. when many
people have to check in to an event, the waiting lines may be
categorized by several primarily distinct letters of the last name), a
secondary difference isn't. Also, speaking of dictionary vs phone book
collation, I'd like to know how Ельцин vs Ёлкин would be sorted but I
don't know how to find out. During Soviet times, the White Pages
weren't accessible to the public.

 It is of course possible that some people would prefer treating “ё”
 as a primarily different letter. But it’s rather illogical to require
 that it be treated that way at the start of a word only. I don’t
 think collation rules need to accommodate such preferences.

 Right: To require it would be not be in tune with praxis.

I'm not in a rush. :)

Leo




RE: UCA and Russian letter Ё

2012-12-21 Thread Whistler, Ken
Leo Broukhis said:

 Granted, not yet, but by itself the argument is invalid. Unicode
 collation rules are descriptive;

I'm not sure what you mean by that. UTS #10 is a *specification* of an 
algorithm, with various options  for tailoring and parameterization which make 
it possible to accommodate various needs for particular cases. It is not 
intended as a descriptive mechanism.

Perhaps you are referring to LDML, which includes a formal mechanism for 
describing a particular collation in terms of the default table and tailoring 
options and parameterization options of the UCA.

 if, for example,  a language happens to sort accents backwards, this
 rule has to be - and is - accommodated despite its apparent
 illogicality;

Backwards accent secondary weighting was actually included primarily because of 
prior art in collation standards, because of the need to be able to synchronize 
the UCA algorithm with ISO 14651, and because it makes it easier  to explain 
how folks can implement versions of multi-level collation which can pass the 
conformance tests of the Canadian sorting standard, etc.

 along the same lines, if a language happens to make a distinction
 discussed in this thread, it has to be accommodated just as well.

No, I don't think so.

It is rather easy to come up with distinctions or collation requirements which 
simply cannot be accommodated within the intended bounds of the UCA. For 
example, sorting all numerical expressions mixed with text strictly by their 
numeric values, or sorting all (or some specified list) of abbreviations as if 
they were spelled out, and so forth.

Many lexicographical ordering rules cannot be fully accommodated within the 
context of the UCA algorithm, which is a multilevel *string comparison* 
specification, and not a dictionary ordering specification.

 
 My question is as follows: does UCA have to be modified (e.g. by
 adding another bit flag word-initial primary next to the existing
 backward secondary) to support the feature if it were to be
 implemented, or is there a way to achieve the new Russian online
 collation within the existing UCA without modifying  the strings to
 be sorted before the application of the algorithm?

I don't think there is any out-of-the-box way to use UCA so that an 
implementation would automatically recognize a word boundary context and weight 
characters conditionally based on that context. So no, I don't think you could 
get an implementation to do that without first marking up text with additional 
characters to indicate word boundaries and then tailoring the weight table to 
weight sequences including that markup accordingly.

This is actually derived trivially from the fact that UCA knows nothing 
whatsoever about word boundaries. At core, it is just a mechanism to take a 
string input and provide an output vector of collation weights. You would have 
to have to hook it up to a text segmentation algorithm to even identify 
words, and then that text segmentation algorithm would itself have to be 
tailored and tuned to whatever language you had in mind, because the criteria 
for identifying words will vary from language to language, and even 
orthography to orthography.

But there is another possible sense of the question, does UCA have to be 
modified... to support..., i.e. is the UTC somehow required to augment the 
algorithm to support some particular kind of behavior for a particular 
language's sorting rules, just because someone has turned up particular odd 
behavior. And I think the answer to that is clearly no. Oh, and by the way, I 
don't think LDML must (or should) be augmented to enable it to describe any and 
all lexicographical ordering practices, either. That isn't the function of LDML.

--Ken





RE: UCA and Russian letter Ё

2012-12-21 Thread Leif Halvard Silli
Joe, Fri, 21 Dec 2012 12:48:47 -0800:
 
 Fact is, again, that ёлка - in the wild - can be written ёлка and елка
 
 Though you need a better dictionary: it's the diminutive of ель (as 
 in Yel'tsin) meaning fir tree, and is the 4-letter word for 
 Christmas tree.

The dictionary of Dal,[1] says: «Ель, ели́на, умал. ёлка [snip]», which 
ought to mean that ёлка is a diminutive of ель. My impression is the 
same as yours with regard to the Christmas tree/New year tree meaning, 
but many dictionaries do list fir tree as the primary meaning of ёлка 
and Christmas/New year tree as a secondary meaning.

[1] http://en.wikipedia.org/wiki/Vladimir_Dal

С праздником!
-- 
leif halvard silli




Re: UCA and Russian letter Ё

2012-12-21 Thread Leif Halvard Silli
Leo Broukhis, Fri, 21 Dec 2012 13:43:14 -0800:
 On Fri, Dec 21, 2012 at 1:08 PM, Leif Halvard Silli
 xn--mlform-...@xn--mlform-iua.no wrote:
 
 In «Tolkovïj slovar’ sovremennogo russkogo jazïka» from 2005
 («Dictionary over contempary Russian language»), has located words on Ё
 in its a separate category, consisting of exactly one word: Ёмкость.
 
 This is either a mistake or a misunderstanding. [ snip ]

Not at all. THe dictionary I referred to is a dictionary on paper which 
only contains new words or words with changed meaning etc. Thus, a 
dictionary of hot words for the time being. That particular 
dictionary only found room for one such word on ё-. :-)

 That, and the dictionary Leo pointed to, tell me that there is a
 difference between categorization and collation.
 
 You're right. A primary difference is categorizing (e.g. when many
 people have to check in to an event, the waiting lines may be
 categorized by several primarily distinct letters of the last name), a
 secondary difference isn't. Also, speaking of dictionary vs phone book
 collation, I'd like to know how Ельцин vs Ёлкин would be sorted but I
 don't know how to find out. During Soviet times, the White Pages
 weren't accessible to the public.

I think that this is definitely one thing that can be affected by 
electronic media. But I just checked how Thunderbird sorts words and Ё- 
and Е- and it treats them as one and the same, even when the the Ё is 
the first letter of the word. Which to me makes sense in such an 
uncategorized medium as a list of e-mail since the user wants him- or 
herself to verify that he/she has seen all the message. However, I 
agree that in a dictionary etc, then it could probably make sense to 
have separate categories for Ё and Е. 

Question is whether categorization is a subject for collation algorithm.
-- 
leif halvard silli