On Sat, 19 May 2012 01:12:17 +0100 Richard Wordingham <[email protected]> wrote:
> Just in case you haven't already thought of it, one reasonable scheme > would be to decompose input if and only if searching for contractions > or the input character could *hide* the start of a contraction, e.g. > one starting with a combining accent or the non-initial part of an > Indic vowel. You may think the suggestions about hiders is excessive, but a real example of hiding occurs when subjecting the current Lithuanian collation in CLDR, which has a humanly unreadable contraction making 0307+0301 collate the same as U+0301 so as to undo ill-effects of soft-dottedness, to arbitrary FCD strings. U+0117 LATIN SMALL LETTER E WITH DOT ABOVE is protected from this contraction because it is the subject of yet another contraction. However, even with full optimisation switched on, the ICU demonstrator sorts NFC & FCD string <U+0227 LATIN SMALL LETTER A WITH DOT ABOVE, U+0301> differently to its NFD equivalent <U+0061, U+0307, U+0301>, which, in accordance with collation rules, sorts identically to U+00E1 LATIN SMALL LETTER A WITH ACUTE. Toggling the normalisation setting has no effect on the ICU outcome. I don't if ICU needs another bug report. Formally, one could just handle it with 22 time 2 (case) times 3 (Lithuanian intonation accents) = 132 derived contractions as opposed to tagging 46 (44 if clever) characters as needing decomposition. For the general case, we ought to be able to express a rule such as 'ignore the countering of sof-dottedness', as in Lithuanian casing, but I don't see any finite method of expressing it under the UCA, just as handling Thai-style preposed vowels requires a great many contractions, and handling Lao collation gets even worse - <P,C,T,V> needs to be collated as though <C,T,P+V> (or, equivalently for well-formed text, <C,P+V,T>). (We also need not just 'backwards' as an option for Level 2, but a rule that a secondary difference before certain breaks takes precedence over a primary difference after them.) I spoke above of the ill-effects of soft-dottedness - I appreciate that not having soft-dottedness causes its own problems. Richard.

