2011/8/31 Mark Davis ☕ <[email protected]>: > It is not so easy. > Let's suppose that we had a special character to suppress expansions. We'd > still have to be able to specify what the alternate collation order should > be. DUCET has the following for æ. Suppressing the expansion would require > an alternative weighting. What would that be? Like Danish? After 'a'? That > would require extra structure to have the alternate value. > 00E6 ; [.15A3.0020.0004.00E6][.0000.015F.0004.00E6][.15FF.0020.001F.00E6] # > LATIN SMALL LETTER AE; QQKN > Moreover, it is not even clear that it is a good idea. > If the collation rules are uniform, then I can expect (say) for Danish to > always find XæY to sort after XzY in a long list. If the text could contain > these special characters and change the ordering, then I'd have to look in > two places, and know what the alternative was, and when it should be used, > and that the author of the text inserted the right characters, etc. > Moreover, expansions are not fundamentally different than other cases where > characters sort differently in different languages. > So in my view, this is a fringe feature, that would make the algorithms and > data structures more complex (and thus slower and possibly less robust).
My main concern is not about how the DUCET would be built, but how a tailoring can be built on top of the DUCET (or from the CLDR "root" which is different, and will remain different in some documented cases...) when it already contains these contractions. Note the effect of the ordering of collation rules: a contraction changes the behavior of expansions specified in a further tailoring rule. You can see a more precise exhibit of this effect in UTS#35 (LDML), section 5.14.7 (Expansions), when using the "sequence expansion syntax". Of course you can avoid this effect using the "normal expansion syntax" that explicitly separates the base collation element from its contextual "extension" (this syntax is more complex, and often completely misunderstood, in addition of requiring more maintenance). This has already caused me lots of troubles when trying to create a working tailoring based on the DUCET (or now based on the CLDR root). In fact I have other more general problems for tailorings, for which I think I'll design an alternate generic solution: this involves the tailoring of "variable"collation elements, and the higana-Katakana exception (they are exactly similar, the goal is to change the level at which some collation elements are differentiate), and I'm thinking about unifiying it as well with the tailoring of case differences (if one wants them to have lower differences than diacritics/variants differences, and also wants to moce these secondary differences to a tertiary differences). These are kinds of tailoring that are currently even impossible to perform with ICU (or with the current LDML specification, independantly of the XML or abbreviated syntax). I'm not much concerned by the XML syntax, I intend to use exclusively the abbreviated syntax (with some generic additions), simply because it requires much less maintenance and is much more readable (the XML syntax is best generated on the fly from the abbreviated syntax by a simple bot, but most UCA impelmentations will use another more compact form, based on lookup tables). -- Philippe.

