On 7/6/2011 1:40 PM, Mark Davis ☕ wrote:

        The other two are special cases; they casefold together
        because of the
        way that the full case mapping is computed. Their equivalence is
        normally captured by a canonical-equivalent folding. Because
        the simple
        folding is only codepoint by codepoint, and only resulting in
        single
        code points, they can't be added.

    I didn't understand the sentence above.  But would it be fair to
    say that a plausible case could be made for FB06 folding to FB05
    simply, but that there really shouldn't be a simple fold for the
    other two cases?


Yes, that's what I mean. You can propose all three if you want, via the reporting form, but I think only #1 is a real possibility (IMO).

For those following along (or not), this has to do with entries in
CaseFolding.txt. The current relevant sections of CaseFolding.txt are:

FB05; F; 0073 0074; # LATIN SMALL LIGATURE LONG S T
FB06; F; 0073 0074; # LATIN SMALL LIGATURE ST

0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
1FD3; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA

03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS 1FE3; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA

What Karl is suggesting amounts to updating those entries to:

FB05; S; FB06; # LATIN SMALL LIGATURE LONG S T
FB05; F; 0073 0074; # LATIN SMALL LIGATURE LONG S T
FB06; F; 0073 0074; # LATIN SMALL LIGATURE ST

0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
1FD3; S; 0390; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
1FD3; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA

03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
1FE3; S; 03B0; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
1FE3; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA

Note that I think the plausible simple folding for the first group is FB05 *to* FB06, not vice versa.

As for the other two, taking the 0390/1FD3 pair as the example
we would have, currently, for simple case folding:

simpleCaseFold(0390) = 0390
simpleCaseFold(1FD3) = 1FD3

simpleCaseFold(NFD(0390)) = 03B9 0308 0301
simpleCaseFold(NFD(1FD3)) = 0390 0308 0301

and for full case folding:

CaseFold(0390) = 03B9 0308 0301
CaseFold(1FD3) = 03B9 0308 0301

CaseFold(NFD(0390)) = 03B9 0308 0301
CaseFold(NFD(1FD3)) = 0390 0308 0301

In all of these instances, because 1FD3 is canonically equivalent to 0390, the results of the folding are canonically equivalent. While there might not be any actual prohibition against adding a simple case folding of 1FD3 to 0390 explicitly in CaseFolding.txt, I don't see that it buys anybody anything. This is roughly the
same problem as, for example:

simpleCaseFold(00E1) = 00E1
simpleCaseFold(0061 0301) = 0061 0301

simpleCasefold(NFD(00E1) = 0061 0301
simpleCasefold(NFD(0061 0301) = 0061 0301

and noting that the results of the simpleCasefold of those two different sources are canonically equivalent, even if you don't do the normalization before the
case folding. An application which is doing case folding, but which isn't
checking for canonical equivalence is kinda out to lunch, anyway, as this
example demonstrates.

So while I don't quite understand Mark's claim that "they can't be added", I
would say that I agree at least that I don't see any point to adding them.

I'm not sure whether the FB05/FB06 instance is important enough to add
or not. Neither of those compabitility ligatures should ordinarily be used
in text, anyway, and it hard to see that an algorithmic neatness argument
buys much here in the way of actual utility.

--Ken


Reply via email to