Re: Collation Grapheme Clusters and Canonical Equivalence

2019-10-18 Thread Richard Wordingham via Unicode
On Thu, 17 Oct 2019 23:11:55 +0100
Richard Wordingham via Unicode  wrote:

> There seems to be a Unicode non-compliance (C6) issue in the
> definition of collation grapheme clusters (defined in UTS#10 Section
> 9.9).  Using the DUCET collation, the canonically equivalent strings
> รู้  U+0E49 THAI CHARACTER MAI THO> and รัู 
> decompose into collation grapheme clusters in two different ways.
> The first decomposes into  and  and the
> second decomposes into  and .  

Correction:

One has to take the collating elements in NFD order, so the tone mark
(secondary weight) and the vowel (primary weight) also form a cluster,
so the division into clusters is , .  This
split respects canonical equivalence.

Replacement:

Now, one form of typo one may see in Thai is where the
vowel is typed twice.  Thai fonts often lack mark-to-mark positioning
for sequences that should not occur, so the two copies of the vowel may
be overlaid.  Proof-reading will not spot the mistake if the font or
layout engine does not assist.

Thus we can get  (417,000 raw Google
hits, the first 10 all good).  That splits into *three* collation
grapheme clusters - ,  and .  Its
canonical equivalence  splits into two
grapheme clusters, for to form a sequence of collating elements
without skipping starting at the U+0E49, one must take all three
characters.  Overall, we end up with *two* collation grapheme clusters,
 and .

> Thus UTS#18 RL3.2 'Tailored Grapheme Clusters', namely "To meet this
> requirement, an implementation shall provide for collation grapheme
> clusters matches based on a locale's collation order", requires
> canonically equivalent sequences to be interpreted differently.

Richard.



Collation Grapheme Clusters and Canonical Equivalence

2019-10-17 Thread Richard Wordingham via Unicode
There seems to be a Unicode non-compliance (C6) issue in the definition
of collation grapheme clusters (defined in UTS#10 Section 9.9).  Using
the DUCET collation, the canonically equivalent strings รู้  and รัู  decompose into collation
grapheme clusters in two different ways.  The first decomposes into
 and  and the second decomposes into  and .

Thus UTS#18 RL3.2 'Tailored Grapheme Clusters', namely "To meet this
requirement, an implementation shall provide for collation grapheme
clusters matches based on a locale's collation order", requires
canonically equivalent sequences to be interpreted differently.

Is this a known issue?

Should I report it against UTS#10 or UTS#18?

Is the phrase 'collation order' intended to preclude the use of search
collations?  Search collations allow one to find a collation grapheme
cluster starting with U+0E15 THAI CHARACTER TO TAO in its exemplifying
word เต่า .  DUCET splits it into , , but most (all?) CLDR search collations split
it into , , , matching the division
into grapheme clusters.

If we accept that in the Latin script Vietnamese tone marks have
primary weights (this only shows up with strings more than one
syllable long), I can produce more egregious examples based on the
various sequences canonically equivalent to U+1EAD LATIN SMALL LETTER A
WITH CIRCUMFLEX AND DOT BELOW or to U+1EDB LATIN SMALL LETTER O WITH
HORN AND ACUTE.

The root of the problem is the desire to match only contiguous
substrings.  This does not play nicely with canonical equivalence.

Richard.