My opinion is that BOTH the UCA algorithm AND the LDML formal decription of collations are just "Best known practices" to accomodate the collation (i.e. dictionary ordering AND string searches AND string comparisons).
But neither of them can accomodate all possible orders or weak comparisons systems for all languages. Notably, it cannot accomodate directly the contextual mutation of initial or medial letters in words. But true linguistic dictionaries have ordered their entries by grouping together in a single entry or in successive entries all variations of a word, including these initial mutations, or derivations (like conjugated verbs, grammatical declinations, plurals, or genders.) The UCA still works provided that there's a prior preprocessing that allows infering (or using some lookup for exceptions) another form of words (or of numbers) for which the multilevel algorithm needed to generate weights can then work. The UCA admits it, but LDML cannot describe these proprocessing rules with just rules used to assign weights to some groups of characters, independantly of what words could mean in the intended language. So the LDML collation rules cannot be sufficient for all cases, and there must be also another formal language for describing the preprocessing rules. For now this does not exist, but nothing prevent this to appear in some future as additional data, and LDML could be extended for decribing these preprocessing transforms. But some steps will remain : the initial normalization, the place where letter case is assigned (or not) some higher collation weight. the place where preprocessing can perform lookups using this simplified view based on the first steps which perform more than just a standard Unicode normalization, and then the last steps that are outputing the weights level per level. Collation is a very complex concept and it is not fully standardized for interchange in LDML. The most notable parts being the preprocessing steps. But note that even various dictionaries for the same language will vary in how they perform these preprocessings (for exemple a dictionnary may include and sort separately the derived terms, using a simpler rule not requiring this preprocessing, so it will have MORE entries, even if they are linking their actual definition to another entry ; and most dictionnaries do not include separate entries for derivations like regular conjugations or plurals or declinations, unless they are VERY irregular ; and most dictionnaries for languages that include the standard mutation rules for initial letters will NOT list separate entries for these mutations and users for example know that if they cannot find a word starting by these mutable letters, they will look for words starting by unmutated words). The same is true for languages that use agglutination : it is not possible to list all possible agglutinations and users need to know how to recognize the morphemes. Here again another thing is not described in collation rules : the breaking rules that allow the separation of words or morphemes. But collation may not work correctly without it, when agglutination implies also mutations between two agglutinated morphemes. This is clearly a limitation of LDML, but not of UCA itself. 2012/12/23 Leo Broukhis <[email protected]> > On Fri, Dec 21, 2012 at 1:49 PM, Whistler, Ken <[email protected]> > wrote: > > Leo Broukhis said: > > > >> Granted, not yet, but by itself the argument is invalid. Unicode > >> collation rules are descriptive; > > > > I'm not sure what you mean by that. UTS #10 is a *specification* of an > algorithm, with various options for tailoring and parameterization which > make it possible to accommodate various needs for particular cases. It is > not intended as a descriptive mechanism. > > What I meant is that the way its feature set which makes it flexible > enough had been decided was descriptive, or, if you will, adaptive, > following the pre-existing collation traditions in various languages > and/or pre-existing standards. > > > Perhaps you are referring to LDML, which includes a formal mechanism for > describing a particular collation in terms of the default table and > tailoring options and parameterization options of the UCA. > > You're right, LDML may be a better application of the word "descriptive". > > >> if, for example, a language happens to sort accents backwards, this > >> rule has to be - and is - accommodated despite its apparent > >> illogicality; > > > > Backwards accent secondary weighting was actually included primarily > because of prior art in collation standards, because of the need to be able > to synchronize the UCA algorithm with ISO 14651, and because it makes it > easier to explain how folks can implement versions of multi-level > collation which can pass the conformance tests of the Canadian sorting > standard, etc. > > > >> along the same lines, if a language happens to make a distinction > >> discussed in this thread, it has to be accommodated just as well. > > > > No, I don't think so. > > My question can be construed as a hypothetical: had the described > Ё-collation been a prior art in collation standards by the time of > development of LDML and the UCA, how different would they have been? > I'm hoping for an answer "not at all" or "very little", and "here's > how it could have been implemented: ...". > > > It is rather easy to come up with distinctions or collation requirements > which simply cannot be accommodated within the intended bounds of the UCA. > For example, sorting all numerical expressions mixed with text strictly by > their numeric values, or sorting all (or some specified list) of > abbreviations as if they were spelled out, and so forth. > > > Many lexicographical ordering rules cannot be fully accommodated within > the context of the UCA algorithm, which is a multilevel *string comparison* > specification, and not a dictionary ordering specification. > > That is true in general if rules happen to involve semantics, but > we're discussing a formal rule here. > Imagine that the backward accents feature was missing from LDML, e.g. > because it was an emerging trend rather than a standard way of > collation at the time of formalizing LDML, thus not included in LDML. > Would you have said the same about it today if someone had asked about > supporting it ? If not, why not? > > >> > >> My question is as follows: does UCA have to be modified (e.g. by > >> adding another bit flag "word-initial primary" next to the existing > >> "backward secondary") to support the feature if it were to be > >> implemented, or is there a way to achieve the "new Russian online > >> collation" within the existing UCA without modifying the strings to > >> be sorted before the application of the algorithm? > > > > I don't think there is any out-of-the-box way to use UCA so that an > implementation would automatically recognize a word boundary context and > weight characters conditionally based on that context. So no, I don't think > you could get an implementation to do that without first marking up text > with additional characters to indicate word boundaries and then tailoring > the weight table to weight sequences including that markup accordingly. > > My question was narrower: assuming that the strings being compared are > words, could it be supported without any markup? > (NB that the "backward accents" feature is also, strictly speaking, > word-based.) > > > But there is another possible sense of the question, "does UCA have to > be modified... to support...", i.e. is the UTC somehow required to augment > the algorithm to support some particular kind of behavior for a particular > language's sorting rules, just because someone has turned up particular odd > behavior. And I think the answer to that is clearly no. Oh, and by the way, > I don't think LDML must (or should) be augmented to enable it to describe > any and all lexicographical ordering practices, either. That isn't the > function of LDML. > > In other words, after adoption, LDML became prescriptive in the sense > "don't even think of inventing any sorting rules that cannot be > described by LDML as it stands; we're not going to augment it". The > Quebecois were very lucky, then. > > Leo > > >

