On Wed, 6 Feb 2013 10:18:33 +0100 Philippe Verdy <verd...@wanadoo.fr> wrote:
> 2013/2/5 Richard Wordingham <richard.wording...@ntlworld.com>: > > Try doing UCA collation with <U+0302 COMBINING CIRCUMFLEX ACCENT, > > U+0067 LATIN SMALL LETTER G> being a collation element (with > > arbitrary collation elements) without doing normalisation. > > <0302, 0067> is defective, and its normalisation is still <0302, > 0067>, it is NOT canonically equivalent to <0067, 0302> > > I was not speaking about arbitrary collation elements containing > defective sequences, is is a real case ? This wasn't, but the mediaeval use of tilde to abbreviate a nasal consonant comes tantalisingly close. The CLDR collation has entries for for <U+0C82 KANNADA SIGN ANUSVARA, U+0C95 KANNADA LETTER KA> (a defective string) and other combinations making anusvara almost equivalent to the homorganic nasal. The European analogue would be to make <U+0303 COMBINING TILDE, U+0076 LATIN SMALL LETTER V> sort almost the same as <006E LATIN SMALL LETTER N, U+0076>, and then a repeated sequence of instances of U+1E7D LATIN SMALL LETTER V WITH TILDE would require canonical decomposition to collate in accordance with the rules. I've already mentioned Burmese as having defective sequences containing letters (category L). There is a third language in CLDR having such sequences, but these collating elements are only to support mistypings of U+0E33 THAI CHARACTER SARA AM. Richard.