CaseFirst and CaseLevel Tailorings of UCA and LDML

Richard Wordingham Mon, 21 May 2012 16:49:33 -0700

What are the definitions of upper and lower case for the caseFirst
tailoring for the UCA and for LDML?  I can't find any obvious
definition.


My suspicion is that they are defined by assignment of the DUCET
tertiary weights, UTS#10 Issue 23 (Version 6.1.0) Section 7.2.
Although these largely correspond to the UCD, many weights look as
though they have been corrected manually.  For example, the <square>
attribute must often be deduced from the appearance or the name of the
character.  A few weights I cannot fathom - for example, U+2122 TRADE
MARK SIGN has a <super> compatibility decomposition as upper case
letters, but has the tertiary weights for a non-uppercase decomposition
of type <super>.

If DUCET tertiary weights are the definition, there is
still no statement of which values correspond to lowercase.  ICU
appears to believe that the lower case weights include the
decomposition types of NONE, <wide>, <compat>, <font>, <circle>,
<super>, <sub>, <final> and <square> (hex values 02, 03, 04, 05, 06,
14, 15, 19 and 1C) with no entry in the 'Case or kana subtype' column.

Again, even the interpretation of uppercase in terms of weights is not
certain, for the ISO/IEC 14651:2007 example of a tailoring for
uppercase first does not adjust the collation elements with a tertiary
weight of 1C, although they are listed as uppercase in Section 7.2 of
the UCA and the standard collation table of ISO/IEC 14651 calls the
weight MISCCAP.

There are a few out and out anomalies in the tertiary weights of
primary non-ignorables, even allowing for intelligent hand correction
of the decompositions in UnicodeData.txt.  Is a compliant
implementation free to classify as lowercase or uppercase dependent on
the appearance?  For example, when caseFirst is set to uppercase, ICU
orders U+1D34 MODIFIER LETTER CAPITAL H before U+0068 LATIN SMALL
LETTER H, but anomalously order U+A7F8 MODIFIER LETTER CAPITAL
H WITH STROKE *after* U+0127 LATIN SMALL LETTER H WITH STROKE becaue
the latter's tertiary weight identifies it as <super> with no entry for
'Case or kana subtype' class.  Is this behaviour required by the UCA +
DUCET?

What are the logical weights for the CaseLevel level?  My guess is that
they are 'ignore' (value 0), 'lower', 'upper' and 'neutral'.  It
appears that kana have case values - setting CaseFirst (with CaseLevel
off) persuades ICU to order U+32DA CIRCLED KATAKANA SA and U+1F202
SQUARED KATAKANA SA before U+3055 HIRAGANA LETTER SA and U+30B5
KATAKANA LETTER SA instead of after them.

Richard.

CaseFirst and CaseLevel Tailorings of UCA and LDML

Reply via email to