On 29 December 2012, James Cloos asked: >> Given (just) the data in 10646, Unicode and cldr, are there any >> locales where a case-insensitive match should be different than a >> case-preserving match of the results of lower-casing the two >> strings?
On Mon, 31 Dec 2012 23:29:48, "Whistler, Ken" <[email protected]> wrote: > 3. Regarding LDML and CLDR, somebody with specific expertise on CLDR > may have to jump in here, but while locales clearly *are* in the > scope of LDML and CLDR, there is currently little if anything they > have to say about specific case mapping rules. Mark Davis has answered this in part. However, there is one set of differences that have not been mentioned at all - digraphs treated as letters, e.g. in Welsh and Danish. The key problem with these, especially with "ng" in Welsh (where g < ng < h), is that sometimes the sequence is a digraph and sometimes not. With camel case words (and a good case for Welsh is Scottish surnames such as McHenry - 'ch' is a digraph in Welsh, but obviously not in this name), digraphs do not (exceptions, anyone?) straddle the case-marked boundaries. Accordingly, in Welsh we have 'ce' < 'ci' < 'ch', 'Ce' < 'Ci' < 'Ch', 'CE' < 'CI' < 'CH', but 'cE' < 'cH' < 'cI'. A solution, if you care greatly about correctness (CLDR does not), to preprocess sequences of lower case followed by upper case by inserting CGJ, i.e. U+034F COMBINING GRAPHEME JOINER. As far as I am aware, this only affects sequences of general category Ll followed by Lu. (I haven't checked CLDR for special collation rules for any sequences of Ll followed by Lu - do check before using my proposed solution.) For most languages, there are the problems that CGJ is not provided on keyboards and that CGJ is misrendered by old rendering systems. Richard.

