Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

Richard Wordingham Thu, 07 Feb 2013 14:27:14 -0800

On Wed, 6 Feb 2013 10:18:33 +0100
Philippe Verdy <verd...@wanadoo.fr> wrote:


> 2013/2/5 Richard Wordingham <richard.wording...@ntlworld.com>:

> > Try doing UCA collation with <U+0302 COMBINING CIRCUMFLEX ACCENT,
> > U+0067 LATIN SMALL LETTER G> being a collation element (with
> > arbitrary collation elements) without doing normalisation.
> 
> <0302, 0067> is defective, and its normalisation is still <0302,
> 0067>, it is NOT canonically equivalent to <0067, 0302>
> 
> I was not speaking about arbitrary collation elements containing
> defective sequences, is is a real case ?

This wasn't, but the mediaeval use of tilde to abbreviate a nasal
consonant comes tantalisingly close.  The CLDR collation has entries for
for <U+0C82 KANNADA SIGN ANUSVARA, U+0C95 KANNADA LETTER KA> (a
defective string) and other combinations making anusvara almost
equivalent to the homorganic nasal.  The European analogue would be to
make <U+0303 COMBINING TILDE, U+0076 LATIN SMALL LETTER V> sort almost
the same as <006E LATIN SMALL LETTER N, U+0076>, and then a repeated
sequence of instances of U+1E7D LATIN SMALL LETTER V WITH TILDE would
require canonical decomposition to collate in accordance with the rules.

I've already mentioned Burmese as having defective sequences containing
letters (category L).  There is a third language in CLDR having such
sequences, but these collating elements are only to support mistypings
of U+0E33 THAI CHARACTER SARA AM.

Richard.

Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

Reply via email to