Hi Richard,

I was looking again at your example where U+0344 causes bad results in
collation of FCD strings. See inline below.

On Tue, Feb 12, 2013 at 12:19 PM, Richard Wordingham <
[email protected]> wrote:

> On Mon, 11 Feb 2013 17:13:58 -0800
> Markus Scherer <[email protected]> wrote:
>
> > I would not revise FCD itself. For a number of processes, it is
> > sufficient as is. For collation it's not.
> >
> > About the Tibetan precomposed vowels:
> >
> > For the LDML spec, I submitted a CLDR ticket this morning:
> > http://unicode.org/cldr/trac/ticket/5667
>
> If we want to proceed along the current lines, then all we need is
> 'CFCD' (Collation FCD), which differs from FCD by excluding characters
> that decompose to two or more  characters of which none have canonical
> combining class zero.  The motivation for the sterner exclusion is
> provided by adding the following contrived collating elements to the
> a default collation:
>
> <U+03B1 GREEK SMALL LETTER ALPHA, U+0308 COMBINING DIAERESIS>
> <U+0301 COMBINING ACUTE ACCENT, U+0345 COMBINING GREEK YPOGEGRAMMENI>
>
> Proper canonical closure then requires contractions for:
> a) <U+03B1, U+0344 COMBINING GREEK DIALYTIKA TONOS> - this sequence is
> canonically equivalent to <U+03B1, U+0308, U+0301>,
> b) <U+03B1, U+0344, U+0345>, and
> c) <U+0344, U+0345>
>

This "proper canonical closure" assumes adding contractions for overlaps
between existing contractions and decomposition mappings.

Canonical closure will then also add the decompositions of b) and c):
d) <03B1, 0308, 0301, 0345>
e) <0308, 0301, 0345>

Now consider the sequence <U+03B1, U+0359 COMBINING ASTERISK BELOW,
> U+0344, U+0345>.  Using the extended set of contractions, this
> splits into the discontiguous collating elements <U+03B1, U+0344,
> U+0345> and <U+0359>.
>
> However, using the original contractions along with normalisation, we
> obtain the collating elements <U+03B1, U+0308>, <U+0359>, <U+0301,
> U+0345>, which in general will sort differently.
>

This is true when "using the original contractions", but I would argue that
the goal of canonical closure is that *with the canonically-closed
mappings* we get the same result for FCD input text (minus the Tibetan
composite vowels) as for NFD input text -- but it will get different
results for NFD input as an implementation without overlap closure.

In your example: With the canonical closure adding contraction d) we obtain
the collating elements <03B1, 0308, 0301, 0345>, <0359> which will collate
the same as the FCD version.

I think we should remove U+0344 from the FCD
exclusions<http://unicode.org/repos/cldr/trunk/specs/ldml/tr35-collation.html#Collation_Settings>where
I added it a few weeks ago. Instead, we should document that an
implementation (like ICU currently) which does not add the overlap
contractions will get some different FCD/NFD results, and an implementation
that does add the overlaps will get some different results for NFD than an
implementation that doesn't add the overlaps.

markus

Reply via email to