On Sun, 10 Feb 2013 12:21:05 +0100 Philippe Verdy <verd...@wanadoo.fr> wrote:
> 2013/2/7 Richard Wordingham <richard.wording...@ntlworld.com>: > > You said, on 5 February, > > > > "A process can be FULLY conforming by preserving the canonical > > equivalence and treating ALL strings that are canonically > > equivalent, without having to normalize them in any recommanded > > form, or performing any reordering in its backing store, or it can > > choose to normalize to any other form that is convenient for that > > process (so it could be NFC or NFD, or something else)" > > > > There's no qualification there disqualifying collation at the > > secondary level from being a 'process' which may or may not be > > conforming. > > Citing this email, the restriction to primary level was included > before this sentence, and implied. The first mention of any restriction to the primary level was in the paragraph *following* the one I quoted above. > You just did not quote it along > with this. Be careful about taking senetencves out of their contexts, > when the whole thread started by spekaing about primary level only for > basic searches. > > OK there are some pathological cases but they are really constructed > and not made for modern languages (except a fex Indic ones as you > noted), but none of them that concern the Latin script (your <TILDE+V> > example collating like <N> is not an effective true example, it is > fully constructed and not found in the CLDR). 'Pathological' = not amenable to naïve processinɡ. Tibetan isn't in the CLDR yet, and several scripts have no representative yet, although the default collation is inappropriate for the major languages. I also note that there is as yet no Sanskrit collation! In short, CLDR is far from complete so far as collation is concerned. The example was <TILDE+v> collating like <nv>. > If you just consider the initial question, having to decompose letters > to "recompose" them in defective ways just to create rare single > collation elements remains a very borderline case for applications > like browsers that just perform plain-text search at primary level on > a web page. Even if the implementation really uses a full > decomposition, I doubt it even has any implemented tailoring that > would recognize those defective collation elements You're now making me wonder if Danish "<U+0061 LATIN SMALL LETTER A, U+00E1 LATIN SMALL LETTER A WITH ACUTE>" and <U+00E1, U+0061> would get the correct primary matching! Note that the acute accent serves as a punctuation mark in Danish. 'Defective' collation elements should not be a problem if one can force decomposition. What seems odd to me is the UCA rule that, for Danish, the string "aar", composed of collating elements "aa" and "r", should have a match in "baaar", which consists of collating elements "b", "aa", "a" and "r", in that order. There are two problems that NFD addresses - merger of base character and mark in one character, and the order of combining marks. For primary collation, merger becomes a problem whenever characters need to be split between collating elements. In Danish "aaa" is a problem because one has to choose between collating element sequences "aa" and "a" on one hand and "a" and "aa" on the other. The issue becomes clearer when one replaces "aa" with "å", which is only distinguished at the tertiary level. "aå" is a challenge for formally correct Danish collation if one does not decompose the characters. This problem one can solve at a formal level by adding many more collating elements. In general, however, one cannot solve the problem just by adding finitely many more collating elements. Order is a problem when one has collating elements composed of multiple characters of different non-zero canonical combining classes. In practice this could be solved by adding more collating elements, but in theory the number of combinations to be considered could be unbounded. The UCA defines the interpretation in terms of the NFD form, and occasionally it is necessary to reduce strings to NFD form to determine this interpretation. Only having to consider primary weights can reduce this problem, but it does not always remove the problem. Richard.