> Here's what happens exactly: Note the rules in CaseFolding.txt:
0049; C; 0069; # CAPITAL (dotless) I -> SMALL (soft-dotted) I 0049; T; 0131; # CAPITAL (dotless) I -> SMALL DOTLESS I 0130; F; 0069 0307; # CAPITAL I WITH DOT -> SMALL (soft-dotted) I, DOT 0130; T; 0069; # CAPITAL I WITH DOT -> SMALL (soft-dotted) I But also that the other 'i's are mapped to themselves by default. There's no explicit Casefolding mapping defined for them so we also have currently these defaults: 0069; C; 0069; # SMALL (soft-dotted) I -> SMALL (soft-dotted) I 0130; C; 0130; # CAPITAL I WITH DOT -> CAPITAL I WITH DOT 0131; C; 0131; # SMALL DOTLESS I -> SMALL DOTLESS I And we also have the explitly dotted Turkic lowercase i, whose folding is defined by the 5th of all rules above (thanks, there's no canonical equivalence between 0069 0307 and 0069): 0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT -> SMALL (soft-dotted) I, DOT And for the decomposition of the Turkic dotted uppercase I, case folding is defined by the 1st or 2nd of all rules above (note that 0049 0307 and 0130 should be canonically equivalent, and should produce identical case foldings with the 3rd or 4th rules above, to preserve canonical equivalence): 0049 0307; C; 0069 0307; # CAPITAL (dotless) I, DOT -> SMALL (soft-dotted) I, DOT 0049 0307; T; 0131 0307; # CAPITAL (dotless) I, DOT -> SMALL DOTLESS I, DOT ******************************************************** Now let's look at each CaseFolding type, and look at their result: ------------------------------------ (1) Mappings for Simple CaseFolding: ------------------------------------ (1.1) First class of source strings: 0131; C; 0131; # SMALL DOTLESS I -> SMALL DOTLESS I (1.2) Second class of source strings: 0049; C; 0069; # CAPITAL (dotless) I -> SMALL (soft-dotted) I 0069; C; 0069; # SMALL (soft-dotted) I -> SMALL (soft-dotted) I (1.3) Third class of source strings: 0130; C; 0130; # CAPITAL I WITH DOT -> CAPITAL I WITH DOT (1.4) Fourth class of source strings: 0049 0307; C; 0069 0307; # CAPITAL (dotless) I, DOT -> SMALL (soft-dotted) I, DOT 0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT -> SMALL (soft-dotted) I, DOT Do these classes resist (don't merge or split) with uppercase/titlecase or lowercase? (1.1) 0131; lower=0131 ; upper/title=0131 (1.2) 0049; lower=0069 ; upper/title=0049 (1.2) 0069; lower=0069 ; upper/title=0049 (1.3) 0130; lower=0130 ; upper/title=0130 (1.4) 0049 0307; lower=0069 0307; upper/title=0049 0307 (1.4) 0069 0307; lower=0069 0307; upper/title=0049 0307 OK, there's no merge, so no problem with Simple CaseFolding, which resist to case mappings. ------------------------------------ (2) Mappings for Turkic CaseFolding: ------------------------------------ (2.1) First class of source strings: 0131; C; 0131; # SMALL DOTLESS I -> SMALL DOTLESS I 0049; T; 0131; # CAPITAL (dotless) I -> SMALL DOTLESS I (2.2) Second class of source strings: 0069; C; 0069; # SMALL (soft-dotted) I -> SMALL (soft-dotted) I 0130; T; 0069; # CAPITAL I WITH DOT -> SMALL (soft-dotted) I (2.3) Third class of source strings: 0049 0307; T; 0131 0307; # CAPITAL (dotless) I, DOT -> SMALL DOTLESS I, DOT (2.4) Fourth class of source strings: 0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT -> SMALL (soft-dotted) I, DOT Do these classes resist (don't merge or split) with common uppercase/titlecase or lowercase mappings? (2.1) 0131; C; lower=0131 ; upper/title=0131 (2.1) 0049; C; lower=0069 ; upper/title=0049 (2.2) 0069; C; lower=0069 ; upper/title=0049 (2.2) 0130; C; lower=0130 ; upper/title=0130 (2.3) 0049 0307; C; lower=0069 0307; upper/title=0049 0307 (2.4) 0069 0307; C; lower=0069 0307; upper/title=0049 0307 Problem here: uppercase mappings do not follow case folding rules. We would also need Turkic-specific mappings for upper/title case: (2.1) 0131; T; upper/title=0049 (2.1) 0049; C; upper/title=0049 (2.2) 0069; T; upper/title=0130 (2.2) 0130; C; upper/title=0130 (2.3) 0049 0307; T; upper/title=0049 0307 (=0130 ?) (2.4) 0069 0307; T; upper/title=0130 0307 (=0130 ?) But we would need then to define canonical equivalence between 0130 and 0049 0307 and 0130 0307 to preserve canonical equivalence... So Turkic CaseFoldings would be broken, unless we say that Turkish texts should NOT be encoded with 0307, but only with 0049, 0069, 0130 or 0131. So Turkic CaseFolding rules should also avoid generating any 0307, whose behavior is not clear. If we just remove any 0307 from the Turkic texts, there is absolutely no problem with Turkic CaseFolding, provided that we also define Turkic-specific uppercase mappings as done above, and don't use the default locale-neutral uppercase mappings of the UCD. ------------------------------------ (3) Mappings for Full CaseFolding: ------------------------------------ (3.1) First class of source strings: 0131; C; 0131; # SMALL DOTLESS I -> SMALL DOTLESS I (3.2) Second class of source strings: 0049; C; 0069; # CAPITAL (dotless) I -> SMALL (soft-dotted) I 0069; C; 0069; # SMALL (soft-dotted) I -> SMALL (soft-dotted) I (3.3) Third class of source strings: 0130; F; 0069 0307; # CAPITAL I WITH DOT -> SMALL (soft-dotted) I, DOT 0049 0307; C; 0069 0307; # CAPITAL (dotless) I, DOT -> SMALL (soft-dotted) I, DOT 0069 0307; C; 0069 0307; # SMALL (soft-dotted) I, DOT -> SMALL (soft-dotted) I, DOT Do these classes resist (don't merge or split) with common uppercase/titlecase or lowercase mappings? (3.1) 0131; C; lower=0131 ; upper/title=0131 (3.2) 0049; C; lower=0069 ; upper/title=0049 (3.2) 0069; C; lower=0069 ; upper/title=0049 (3.3) 0130; C; lower=0130 ; upper/title=0130 (3.3) 0049 0307; C; lower=0069 0307; upper/title=0049 0307 (3.3) 0069 0307; C; lower=0069 0307; upper/title=0049 0307 Here the Full CaseFolding rules seems to be broken as they don't resist to uppercase mappings. There's only one way where they would be valid, only if uppercase mappings where also altered, so that the uppercase of 0130 (which is already uppercase) is 0049 0307 (impossible to do as uppercase mappings in the UCD are restricted to 1 character). The only remaining way to achieve it is to make them canonical equivalents to represent a uppercase dotted I. Thanks, we find this in the UCD, which defines exactly that canonical equivalence: 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069; Good. Full CaseFolding are not broken, but they require the support of canonical equivalence of decompositions for dotted uppercase I. Using Full CaseMapping correctly requires being able to use normalization on its output. However care must be taken because Turkic case may have been converted in the past to uppercase, using Turkic rules, and this information is lost if language is not clearly identifiable. __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>

