On Thu, 17 Oct 2019 23:11:55 +0100
Richard Wordingham via Unicode wrote:
> There seems to be a Unicode non-compliance (C6) issue in the
> definition of collation grapheme clusters (defined in UTS#10 Section
> 9.9). Using the DUCET collation, the canonically equivalent strings
> รู้ U+0E49 THAI CHARACTER MAI THO> and รัู
> decompose into collation grapheme clusters in two different ways.
> The first decomposes into and and the
> second decomposes into and .
Correction:
One has to take the collating elements in NFD order, so the tone mark
(secondary weight) and the vowel (primary weight) also form a cluster,
so the division into clusters is , . This
split respects canonical equivalence.
Replacement:
Now, one form of typo one may see in Thai is where the
vowel is typed twice. Thai fonts often lack mark-to-mark positioning
for sequences that should not occur, so the two copies of the vowel may
be overlaid. Proof-reading will not spot the mistake if the font or
layout engine does not assist.
Thus we can get (417,000 raw Google
hits, the first 10 all good). That splits into *three* collation
grapheme clusters - , and . Its
canonical equivalence splits into two
grapheme clusters, for to form a sequence of collating elements
without skipping starting at the U+0E49, one must take all three
characters. Overall, we end up with *two* collation grapheme clusters,
and .
> Thus UTS#18 RL3.2 'Tailored Grapheme Clusters', namely "To meet this
> requirement, an implementation shall provide for collation grapheme
> clusters matches based on a locale's collation order", requires
> canonically equivalent sequences to be interpreted differently.
Richard.