note that tolower() and toupper() can only work one 1-character level, it is not recommended for use for changing case of plain text. Its purpose should be limited to use cases where letters can be safely isolated from their context, for example when handling letters as numbers (e.g. section numbering).
For correct handling of locales, to upper and toupper should be replaced by strtolower and strtoupper (or their aliases) which will be able to process character clusters and contextual casing rules needed for a language or orthographic style (such as monotonic and polytonic Greek, or for specific locales intended for medieval texts or old classic scriptures). strupper and strlower can then perform MORE mappings that tolower and toupper cannot perform using only simple mappings. So precombined Greek letters with iota subscripts can only be converted by preserving the iota subscript (for which islower() and isupper() are BOTH false when it is encoded separately and not precombined). When a Greek letter precombined with a iota subscript is found, the letter case of this iota subscript should be ignored, and only the lettercase of the base letter will be considered, and this means that it will only be possible for toupper() and toupper() to map one orthographic style: the style that preserves the subscript but not the classic Greek or modern monotonic style that doesn't "know" anything about this "medieval" extension of the Greek alphabet, which was still in use in the begining of the 1970's (handling polytonic Greek with tolower() and toupper(), or with islower() and isupper() will not produce the correct result). For modern Greek, there's no use of this iota subscript, so we are in the same situation as classic Greek (before the Christian era), except that modern Greek still uses a few accents (notably the "tonos" equivalent in Unicode to the acute accent, even if its placement over Greek capitals is preferably before the letter rather than above it as it could be suggested by its assigned combining class). 2014-11-07 12:32 GMT+01:00 Mike FABIAN <[email protected]>: > Philippe Verdy <[email protected]> さんはかきました: > > > this is a "feature" of the Greek alphabet that the lowercase iota > subscript > > can be capitalized in two different ways : either as a subscript below > the > > uppercase main letter, or as a standard iota capitalized. The subscript > > form is a combining character, but not the non-subscript form. > > Laurentiu> All of the characters you enumerated are titlecase letters > Laurentiu> (gc=Lt) rather than uppercase letters (gc=Lu), > > U+1F80 ᾀ is something like ἀι and could be capitalized as ἈΙ or as ᾈ. > ᾈ is something like Ἀι so I understand now that ᾈ can be considered as > titlecase (gc=Lt). > Note that for modern Greek there's still a difficulty about the special final form of lowercase sigma: it is effectively lowercase (islower should return true), not titlecase, and toupper will map it to a standard capital Sigma. But the reverse conversion will only be able to convert the uppercase sigma to a standard lowercase sigma, ignoring the final form. To handle the final form correctly, don't use tolower() character per character, but use strtolower() and use a decent library that supports contextual rules (the same will be true for the German ess-tsett which was capitalized as a two S but not reversible, even if recently an "uppercase" variant of ess-tsett was added in Unicode, but it is still extremely rarely used: it is extremly difficult to determine how to convert a double capital S and most libraries will only convert it to a double lowercase s, and some locales deliberatly decide not to alter the lowercase ess-tsett with loupper or strtoupper; this is still correct if those libraries have not be updated to use the capital ess-tsett now supported in more recent versions of Unicode, but not found in any other legacy encodings). We still have a difficulty with the ampersand "&" because it has been encoded only as a symbol, assuming that for most used locales it is just used in isolation as an abbreviated form of a word. But in some locales it was still considered a letter and used everywhere "et" could be used including in abreviations like "etc." == "&c.", or in the middle of words like "caret" == "car&" or "comm&tre" == "commettre"). But the modern use of ampersand implies there's a word break before and after the symbol an we should have a separate encoding for "&" as a lowercase ligature, and we should even have an uppercase variant like the German ess-tsett, as there are glyphic variants of the ligature for uppercased titles where the modern "&" ampersand does not fit very well, or where it should be mapped to a non-ligatured "ET" letter pair, distinct from the mapping (with spaces around) to " ET " in French or to " AND " in English, as implied by the modern meaning of the current symbol as a separate word by itself. With a distinct encoding of the ligature, the common abreviation "etc." ligatured as "&c." would correctly map to uppercase "&C." with the uppercase ligature, or "ETC." without adding any space. Note that "&" was even considered in some classic European alphabets as an extra letter (with letter forms exhibiting more evidently its origin from "et"/"ET" ligatured), just like the German ess-tsett "ß", or the French "œ"/"Œ" (distinguised semantically from "oe"/"OE" letter pairs, which allow a syllable break in the middle and allow titlecasing as "Oe" : in French the titlecased common term "Oeuf" is semantically and graphically incorrect, it should be "Œuf" where "Œ" is fully uppercase in the ligature and not mixed-cased), or the Latin "æ"/"Æ" ligature (also used in other classic European languages) or the Dutch ligature "ij"/"IJ".
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

