RE: FCD and Collation

Whistler, Ken Mon, 11 Feb 2013 17:24:37 -0800

> Does anyone feel up to rigorously justifying revisions to the concepts
> and algorithms of FCD and canonical closure?  Occasionally one will
> encounter cases where the canonical closure is infinite - in these
> cases, normalisation will be necessary regardless of the outcome of the
> FCD check.


Personally, no. One of the reasons I resisted incorporation of canonical 
enclosure in the basic UCA algorithm and in the DUCET table is because of its 
infinitesimal ROI. It complicates the table and its processing substantially, 
all in service of "fixing" edge cases of edge cases, which have to be dealt 
with in tailorings, anyway.

I think the current wording of Section 6.5 in UCA is appropriate as is. It 
doesn't say you must or should use FCD, but rather that you should do the right 
thing for strings that are in FCD, even if not normalizing. If that is hard or 
impossible for some edge case tailorings or for the weird (and deprecated) 
sequences in Tibetan, then those are the edge cases I am talking about which 
aren't worth handling in the basic algorithm.

> 
> Perhaps one could merely revise the definition of FCD, and devise a test
> for the adequacy of the current canonical closure.  If the collation
> fails this adequacy test, then again disabling normalisation should be
> prohibited.  (I would suggest that in these cases the normalisation
> setting should be overridden with only the gentlest of chidings.)

FCD isn't part of the Unicode Standard, or of UCA, for that matter. It is an 
implementation optimization promulgated in ICU. So tweaking its definition 
would be a matter for ICU, in my opinion.

As regards the normalization on/off parameter, although UCA mentions it as a 
possible tailoring one could do, it goes no further. The details of a 
definition of a normalization on/off parameter belong now to LDML and the 
CLDR-TC, and to their use of it in defining locales. Personally, I think it 
should stay that way.

I don't doubt that there are real issues in some collation tailorings defined 
in CLDR (or prospective problems for tailorings that someone might want to 
*add* to CLDR), but the issues around those should be handled in the CLDR-TC, I 
think.

> 
> A lazy option would be to wait (how long?) and then remove the option of no
> normalisation on the ground that sufficient computing power is
> available.

Unfortunately, I don't think that is ever going to be an option. This year, in 
2013, I still know engineers who are busy tweaking code for speed in databases 
because the C or C++ library implementations of memmove() are not fast enough 
for their taste! Anything as time-critical as basic string comparisons in 
sorting is always going to attract attention for optimization.

--Ken

> 
> Thoughts, anyone?
> 
> Richard.

RE: FCD and Collation

Reply via email to