Leo Broukhis said:

> Granted, not yet, but by itself the argument is invalid. Unicode
> collation rules are descriptive;

I'm not sure what you mean by that. UTS #10 is a *specification* of an 
algorithm, with various options  for tailoring and parameterization which make 
it possible to accommodate various needs for particular cases. It is not 
intended as a descriptive mechanism.

Perhaps you are referring to LDML, which includes a formal mechanism for 
describing a particular collation in terms of the default table and tailoring 
options and parameterization options of the UCA.

> if, for example,  a language happens to sort accents backwards, this
> rule has to be - and is - accommodated despite its apparent
> illogicality;

Backwards accent secondary weighting was actually included primarily because of 
prior art in collation standards, because of the need to be able to synchronize 
the UCA algorithm with ISO 14651, and because it makes it easier  to explain 
how folks can implement versions of multi-level collation which can pass the 
conformance tests of the Canadian sorting standard, etc.

> along the same lines, if a language happens to make a distinction
> discussed in this thread, it has to be accommodated just as well.

No, I don't think so.

It is rather easy to come up with distinctions or collation requirements which 
simply cannot be accommodated within the intended bounds of the UCA. For 
example, sorting all numerical expressions mixed with text strictly by their 
numeric values, or sorting all (or some specified list) of abbreviations as if 
they were spelled out, and so forth.

Many lexicographical ordering rules cannot be fully accommodated within the 
context of the UCA algorithm, which is a multilevel *string comparison* 
specification, and not a dictionary ordering specification.

> 
> My question is as follows: does UCA have to be modified (e.g. by
> adding another bit flag "word-initial primary" next to the existing
> "backward secondary") to support the feature if it were to be
> implemented, or is there a way to achieve the "new Russian online
> collation" within the existing UCA without modifying  the strings to
> be sorted before the application of the algorithm?

I don't think there is any out-of-the-box way to use UCA so that an 
implementation would automatically recognize a word boundary context and weight 
characters conditionally based on that context. So no, I don't think you could 
get an implementation to do that without first marking up text with additional 
characters to indicate word boundaries and then tailoring the weight table to 
weight sequences including that markup accordingly.

This is actually derived trivially from the fact that UCA knows nothing 
whatsoever about word boundaries. At core, it is just a mechanism to take a 
string input and provide an output vector of collation weights. You would have 
to have to hook it up to a text segmentation algorithm to even identify 
"words", and then that text segmentation algorithm would itself have to be 
tailored and tuned to whatever language you had in mind, because the criteria 
for identifying "words" will vary from language to language, and even 
orthography to orthography.

But there is another possible sense of the question, "does UCA have to be 
modified... to support...", i.e. is the UTC somehow required to augment the 
algorithm to support some particular kind of behavior for a particular 
language's sorting rules, just because someone has turned up particular odd 
behavior. And I think the answer to that is clearly no. Oh, and by the way, I 
don't think LDML must (or should) be augmented to enable it to describe any and 
all lexicographical ordering practices, either. That isn't the function of LDML.

--Ken



Reply via email to