What are the definitions of upper and lower case for the caseFirst tailoring for the UCA and for LDML? I can't find any obvious definition.
My suspicion is that they are defined by assignment of the DUCET tertiary weights, UTS#10 Issue 23 (Version 6.1.0) Section 7.2. Although these largely correspond to the UCD, many weights look as though they have been corrected manually. For example, the <square> attribute must often be deduced from the appearance or the name of the character. A few weights I cannot fathom - for example, U+2122 TRADE MARK SIGN has a <super> compatibility decomposition as upper case letters, but has the tertiary weights for a non-uppercase decomposition of type <super>. If DUCET tertiary weights are the definition, there is still no statement of which values correspond to lowercase. ICU appears to believe that the lower case weights include the decomposition types of NONE, <wide>, <compat>, <font>, <circle>, <super>, <sub>, <final> and <square> (hex values 02, 03, 04, 05, 06, 14, 15, 19 and 1C) with no entry in the 'Case or kana subtype' column. Again, even the interpretation of uppercase in terms of weights is not certain, for the ISO/IEC 14651:2007 example of a tailoring for uppercase first does not adjust the collation elements with a tertiary weight of 1C, although they are listed as uppercase in Section 7.2 of the UCA and the standard collation table of ISO/IEC 14651 calls the weight MISCCAP. There are a few out and out anomalies in the tertiary weights of primary non-ignorables, even allowing for intelligent hand correction of the decompositions in UnicodeData.txt. Is a compliant implementation free to classify as lowercase or uppercase dependent on the appearance? For example, when caseFirst is set to uppercase, ICU orders U+1D34 MODIFIER LETTER CAPITAL H before U+0068 LATIN SMALL LETTER H, but anomalously order U+A7F8 MODIFIER LETTER CAPITAL H WITH STROKE *after* U+0127 LATIN SMALL LETTER H WITH STROKE becaue the latter's tertiary weight identifies it as <super> with no entry for 'Case or kana subtype' class. Is this behaviour required by the UCA + DUCET? What are the logical weights for the CaseLevel level? My guess is that they are 'ignore' (value 0), 'lower', 'upper' and 'neutral'. It appears that kana have case values - setting CaseFirst (with CaseLevel off) persuades ICU to order U+32DA CIRCLED KATAKANA SA and U+1F202 SQUARED KATAKANA SA before U+3055 HIRAGANA LETTER SA and U+30B5 KATAKANA LETTER SA instead of after them. Richard.