Leo asked:
My question was narrower: assuming that the strings being compared are
words, could it be supported without any markup?
... where it refers to conditional weighting based on the (identified) word
boundary. And the answer to that is no, unless the word boundary was explicitly
indicated with some kind of a markup character, and then the sequence of that
markup character plus the target character of interest (in this case Russian
Yo) was given a tailored contraction in the weight table which weighted it
differently from any Russian Yo not in that particular contraction sequence.
(NB that the backward accents feature is also, strictly speaking,
word-based.)
A correction here. The backwards accents feature in UCA is *not* word-based. As
for any other string being compared via the UCA mechanism, weights are simply
assigned to *all* characters in the string. The difference for weighting when
using the backwards accents feature is that secondary weight significance in
comparison is calculated from the end of the string, instead of the start of
the string. This works when comparing single words, but it is applied
indifferently to entire strings. And it gets the correct results, by the way.
Work it out: you take two strings containing entire phrases in French, which
only differ by accents on some word in the middle of the string. The only
difference in weights assigned will be for the secondary weights for those
accents, and if you use the backwards accents feature they will be calculated
from the end of the string.
Once again, let me emphasize: the UCA algorithm per se simply has no concept at
all of word boundaries. It applies strictly and only to string input, which
could contain *anything*.
In other words, after adoption, LDML became prescriptive in the sense
don't even think of inventing any sorting rules that cannot be
described by LDML as it stands; we're not going to augment it. The
Quebecois were very lucky, then.
No, I think that is an incorrect characterization of the situation for LDML. It
can be, and at times has been, augmented for new parameterizations which make
sense. Those changes, however, have to make sense within the overall context of
the way the multilevel weighting and string comparison algorithm works. The
basic issue here is that because UCA is a string weighting and comparison
algorithm, but does *not* have built in any kind of text segmentation logic
(whether to identify words, syllables, or any other language-specific segment),
it simply does not make sense to expect LDML to be augmented to describe
collation behavior that depends on conditional behavior at segmentation
boundaries. That is simply outside the scope of UCA and LDML. It isn't outside
the scope of the bigger issue of sorting and collation behavior in general, of
course -- it is just outside the scope of what UCA addresses.
Incidentally, for the record, backwards weighting of accents for French doesn't
have anything particular to do with Quebecois. It is a feature of *some*
influential French dictionary lexicographical ordering traditions -- in France
-- and not in others.
--Ken