On Wed, 16 May 2012 09:17:51 -0700 Markus Scherer <[email protected]> wrote:
> On Wed, May 16, 2012 at 1:24 AM, Richard Wordingham < > [email protected]> wrote: > > > Section 5.1 of the UCA says that one may have a parametric > > normalisation tailoring. > Section 5.1 is about runtime parameters/attributes applied > orthogonally to the specification of a Collation Element Table. For the tailorings 'strength', 'alternate', 'backwards', 'caseLevel', and variableTop, I agree. For 'numeric' I expect you're right, but I can imagine complications. The tailoring 'locale' is not orthogonal. The tailoring 'caseFirst' rather reshuffles the tertiary weights. I am not entirely convinced it is orthogonal, and I'm not sure how it should interact with the ordering of Danish 'aa', 'AA', 'Aa', 'aA', 'Å' and 'å'. It makes sense if all tailorings start with the DUCET (which is the only case that the UCA definition cares about) and it is applied before any reorderings of characters, but I am not sure that it is orthogonal. What if I choose to have A <<< a < b <<< B on top of a fixed ordering for the other pairs? Is it still orthogonal? Similar remarks apply to 'reorder'. What if I move 'Q' and 'q' into the Cyrillic sequence? (I've a recollection that this letter is used in Kurdish written in Cyrillic.) I have been wondering if U+0078 LATIN SMALL LETTER X should be made common script because of its use for displaying Lao vowels, but perhaps the principle of separation of scripts should lead to LAO LETTER SMALL X. I can conceive of complications for hiraganaQuaternary if one individually tailors quaternary weights! (The tertiary equality of some of the mathematical letters feels wrong to me, though there may be better ways of sorting that anomaly out than playing with quaternary weights.) > There is not as much to it as you seem to think. What > normalization=off does is turn off the first step of the UCA > algorithm, namely NFD normalization, and you should only do so if you > know or assume that your text is already normalized so step 1 would > be a no-op. Then it recommends that an implementation that offers > this get the correct results if text is in any form of FCD. I presume the UCA and the Unicode Locale Data Markup Language (LDML) are meant to be aligned. In the LDML definition (http://unicode.org/reports/tr35/#Collation_Elements), it says, "If 'on', then the normal [UCA] algorithm is used. If 'off', then all strings that are in [FCD] will sort correctly, but others will not necessarily sort correctly". 'Will' is stronger than 'should'. > The UCA conformance statement does not explicitly cover behavior under > these parameters, I believe, but if an implementation gets bad > results for input for which it purports to get good ones, then that's > a bug. The way I am now reading this is that if a collation is tailored with normalisation 'off', then it is the responsibility of the user to only use FCD strings, and if he does not he cannot rely on its definitions being honoured. How a tailorable implementation implements this is up to it - it might choose to ignore the optimisation opportunity and always perform the NFD normalisation, it might decompose but not reorder, or it might use some subtler techique such as decomposing Tibetan vowels and applying singleton decompositions (e.g. 212B ANGSTROM SIGN) and decomposing characters whose decomposition starts with 'A' or 'a' (for Danish sanity!). The mechanism chosen would have to depend on the tailorings applied. Is this interpretation correct? No-one has confirmed that a tailoring of normalisation need not have any effect. > (I don't dispute that solving 8052 is > > likely to solve 9323.) However, it would surprise me if the > > collation behaviour of <U+0FB2 U+0334 U+0F81> were changed. In so > > far as it is linguistically meaningful, it is an error in DUCET > > that it doesn't sort the same as <U+0FB2 U+0F81 U+0334>. (Of > > course, Tibetan collation in DUCET is already very wrong for > > Tibetan script languages.) Correction: "would NOT surprise me". > It's not "an error in DUCET" because UCA step 1 is to apply NFD in > which case these will become the same string. And <U+0FB2 U+0F81 > U+0334> does not pass the FCD test, so it will get at least reordered > (and maybe decomposed). The only problem is if we compare <U+0FB2 > U+0334 U+0F81> with <U+0FB2 U+0334 U+0F71 U+0F80> where an FCD-based > implementation should find the contraction in the first string (if it > checks for "leading ccc" not ccc) but not in the second. The addition > of the two missing prefix contractions requested in L2/12-131R will > fix that. And the absence of those prefix contractions is the error. (The problem arises because 0F71 has non-zero ccc.) An irritating consequence of adding a contraction for <U+0FB2 U+0F71> is that we THEN also need the hitherto redundant contractions of that prefix with the short vowels U+0F72 TIBETAN VOWEL SIGN I and U+0F74 TIBETAN VOWEL SIGN U to get subjoined RA plus non-consonantal long vowels to collate properly. > I took another look at allkeys.txt. As far as I can tell, the > problematic characters (trailing parts 0F72, 0F74, 0F80 of the > one-higher composite vowels) occur only in contractions that > correspond to the composites themselves and in contractions like 0FB2 > 0F80 which are ok: If we get input like 0FB2 0F81 we need not match > the second part of 0F81 because 0FB2 0F81 itself (and 0FB2 0F71 0F80) > is also a DUCET contraction. > > So probably the simplest way to deal with contractions that contain > 0F72, 0F74, 0F80 is to either forbid them in tailorings or to require > that there also be contractions that instead contain 0F73, 0F75, 0F81 > respectively. I think you mean the other way round. And I would read a contraction containing a Tibetan long vowel II, REVERSED II, UU, RR or LL as containing the length mark U+0F71 and the corresponding short vowel. Richard.

