"Whistler, Ken" <[email protected]> wrote: |Steffen Daode Nurpmeso observed: |> Hello, in UAX #44 i read |> |> Simple_Titlecase_Mapping ... |> Note: If this field is null, then the Simple_Titlecase_Mapping |> is the same as the Simple_Uppercase_Mapping for this character. |> |> So a parser has to be aware of this, automatically falling back to |> the uppercase mapping (index 12) when there is no explicit |> titlecase mapping (index 14). |> |> Given this the following surprised me: |> |> ?0[steffen@sherwood unicode]$ <UnicodeData.txt awk 'BEGIN{FS=";"}\ |> {if (length($15) && $15 = $13) print}' |wc -l |> 1051 |> ?0[steffen@sherwood unicode]$ <UnicodeData.txt awk 'BEGIN{FS=";"}\ |> {if (length($15) && $15 != $13) print}' |wc -l |> 12 |> |> (I.e., 1051 times the redundant mapping is defined.) | |Prior to Unicode 5.2, the relevant documentation (in UCD.html) used |to say: | |The simple titlecase may be omitted in the data file if the titlecase is the |same as the uppercase.
This is interesting -- in [1] `Simple_Uppercase_Mapping' had a note stating Note: The simple uppercase is omitted in the data file if the uppercase is the same as the code point itself. [1] <http://www.unicode.org/Public/5.1.0/ucd/UCD.html> Similar for `Simple_Lowercase_Mapping'. |Someone correctly pointed out that that statement was ambiguous. |It was corrected to the current note, which is both correct and states |the intention of the simple titlecase mapping: that it be equivalent |to the simple uppercase mapping unless it isn't, in which case a different |explicit value will be in the field (the 12 cases you noted). | |The redundant titlecase mapping values were not *removed* from |the data file, as there was a significant chance that that would disrupt |parsers which had long been using conventions which expected |explicit values in the field. That is what i thought why they are still there, without knowing the history you have pointed out -- i became a bit curious. Interestingly, for Unicode 3.2 ([2]) the titlecase is also defined as Note: This field is omitted if the titlecase is the same as field 12. For full case mappings, see UAX #21 Case Mappings and SpecialCasing.txt. [2] <http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html> For 3.0 ([3]) no such constraint is defined at all, for neither of the three case mappings. [3] <http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html> Hmm. To me, this raises the question why these constraints were introduced at all. Imho either one adds constraints due to solid considerations, and enforces them after some period of backward compatibility, or there simply should be no constraints. There are parsers (i know of one) which use *only* UnicodeData.txt for generating tables (using patterns like `SPACE' etc. to join characters into sets; which seems to have been common practice in the past -- as in [3], „Case Mappings“: „derivable from the presence of the terms "CAPITAL" or "SMALL" in the character name“). If there is no such extensive guaranteed backward compatibility for UnicodeData.txt content already today then that should be noted (i wouldn't know where that is true?), but otherwise it cannot be that labour-intensive to drop these constraints again, since nothing had to be done at all? I.e., are these parsers already broken today? Just curious… |--Ken --steffen

