Am 03.12.2024 um 02:51 schrieb Asmus Freytag via Unicode:
Rather than getting hung up on details of parsing one particular
part of one sentence, it would be more useful from Unicode's
perspective if someone (Daniel?) could sum up in a short document
base on this discussion where Unicode is behind the curve and to
make sure the support in CLDR is up to actual current practice and
not what it was 10 or 15 years ago.

Thank you very much for the idea. I could certainly sum up the arguments of the discussion (though I’m too busy to do it right now, you would have to have a few weeks’ patience), but I still haven’t understood where in the CLDR such casing information is stored. There are data subsets that have “casing” in the title, but they only say whether the days of the week, month names, language names, etc. are capitalized in a certain language. There is a field called “main examplars” that contains all the small letters (for German, including ß) and another field called “index examplars”, which for German does not even include Ä, Ö, and Ü. I surmise that this is only meant for numbering items using letters (where indeed you can have parts A, B, C, etc. of a book, but you would never have a “part Ä”). I cannot find any information saying something like a ↔ A, b ↔ B, etc.

For Turkish (https://www.unicode.org/cldr/charts/46/summary/tr.html), the “main letters” in the very first line are given as

[a b c ç d e f g ğ h ı iİ j k l m n o ö p r s ş t u ü v y z].

So there i and its capital counterpart İ are not separated by a space. But for German (https://www.unicode.org/cldr/charts/46/summary/de.html), the “main letters” are

[aä b c d e f g h i j k l m n oö p q r s ß t uü v w x y z],

where the missing space does not imply capitalization, so I guess changing this list to “… s ßẞ t …” would not automatically inform people that ß should be capitalized as ẞ.

In https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf on page 198 I find: “Examples of case tailorings which are not covered by data in SpecialCasing.txt include: […] Uppercasing of U+00DF ‘ß’ LATIN SMALL LETTER SHARP S to U+1E9E LATIN CAPITAL LETTER SHARP S[.] The preferred mechanism for defining tailored casing operations is the Unicode Common Locale Data Repository (CLDR), https://cldr.unicode.org, where tailorings such as these can be specified on a per-language basis, as needed.” So the idea is already there. On page 295 the problem with ß is addressed in detail, and right underneath it says, “Additional language-specific or orthography-specific contexts and casing behavior is specified in the Unicode Common Locale Data Repository (CLDR), https://cldr.unicode.org.” So does this already exist? Or where does it have to be added?

Can anybody help?

Best wishes,

Daniel

--
Prof. Dr. Daniel Bunčić
===============================================================
Slavisches Institut der Universität zu Köln
Weyertal 137, D-50931 Köln
Telefon:       +49 (0)221  470-90535
Sprechstunden: https://uni.koeln/ENZEB
E-Mail:        [email protected] = [email protected]
Threema:       https://threema.id/8M375R5K
===============================================================
Homepage:      http://daniel.buncic.de/
Academia:      http://uni-koeln.academia.edu/buncic
ResearchGate:  https://researchgate.net/profile/Daniel-Buncic-2
===============================================================

Reply via email to