Re: Dealing with Georgian capitalization in programming languages
Martin, On 10/9/2018 12:47 AM, Martin J. Dürst via Unicode wrote: - Using the 'capitalize' method to (try to) get the titlecase property of a MTAVRULI character. (There's no other way currently in Ruby to get the titlecase property.) There may be others. If you have some ideas, I'd appreciate to know about them. This lets me wonder why the UTC didn't simply declare the titlecase property of MTAVRULI to be mkhedruli. Was this considered or not? The way things are currently set up, there seems to be no benefit of MTAVRULI being its own titlecase, because in actual use, that requires additional processing. Titlecasing for Georgian was not completely thought through before Mtavruli was added. As I noted in my earlier comment on this thread, the titlecase mapping values for Mkhredruli were added late in the process, when it became clear that not doing so would result in inappropriate outcomes for existing Mkhredruli text. I don't think there is a fully-worked out position on this, but adding a Simple_Titlecase mapping for Mtavruli to Mkhedruli would, I suspect, just further muddy waters for implementers, because it would be in effect saying that an uppercase letter titlecases by shifting to its lowercase mapping. A headscratcher, at the very least. Note that with the current mappings as they are, Changes_When_Titlecased is False for all Mkhedruli and for all Mtavruli characters, which I think is the desired state of affairs. A titlecasing string operation of Mtavruli that does something other than just leave the string alone should, IMO, be documented as doing something extra and *should* have to do additional processing. --Ken
Aw: Re: Dealing with Georgian capitalization in programming languages
The capital ẞ (U+1E9E) has been officially approved by the Council for the German Language since July 2018. However, there is no word starting with ß, that means the character is only relevant for full-capitalized words. It may only stand alone in spaced type, when there is no available italic font-style. In the Ruby bug tracker that there is also an issue with Dutch ij → IJ. The dedicated ligatures IJ (U+0133) and ij (U+0133) are not recommended and thus never used, but leading ij must always be capitalized to IJ, as in IJSBERG → ijsberg → IJsberg. The actual problem is that the current capitalization algorithm is based on a regular grammar (type 3). It has to be adjusted for a context-sensitive (type 1) grammar. Regards, Marius On 2018/10/09 09:47, Martin J. Dürst wrote: > I have been thinking through this. It seems quite appealing. > > But I'm concerned there may be some edge cases. I have been able to come > up with two so far: > > - Applying this to a string starting with upper-case SZ (U+1E9E). > This may change SZ → ß → Ss. > - Using the 'capitalize' method to (try to) get the titlecase > property of a MTAVRULI character. (There's no other way > currently in Ruby to get the titlecase property.) > > There may be others. If you have some ideas, I'd appreciate to know > about them. > > This lets me wonder why the UTC didn't simply declare the titlecase > property of MTAVRULI to be mkhedruli. Was this considered or not? The > way things are currently set up, there seems to be no benefit of > MTAVRULI being its own titlecase, because in actual use, that requires > additional processing. > > Regards, Martin.
Re: Dealing with Georgian capitalization in programming languages
Hello Ken, others, On 2018/10/03 06:43, Ken Whistler wrote: But it seems to me that the problem you are citing can be avoided if you simply rethink what your "capitalize" means. It really should be conceived of as first lowercasing the *entire* string, and then titlecasing the *eligible* letters -- i.e., usually the first letter. (Note that this allows for the concept that titlecasing might then be localized on a per-writing-system basis -- the issue would devolve to determining what the rules are for "eligible" letters.) But the simple default would just be to titlecase the initial letter of each "word" segment of a string. Note that conceived this way, for the Georgian mappings, where the titlecase mapping for Mkhedruli is simply the letter itself, this approach ends up with: capitalize(mkhedrulistring) --> mkhedrulistring capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> mkhedrulistring Thus avoiding any mixed case. I have been thinking through this. It seems quite appealing. But I'm concerned there may be some edge cases. I have been able to come up with two so far: - Applying this to a string starting with upper-case SZ (U+1E9E). This may change SZ → ß → Ss. - Using the 'capitalize' method to (try to) get the titlecase property of a MTAVRULI character. (There's no other way currently in Ruby to get the titlecase property.) There may be others. If you have some ideas, I'd appreciate to know about them. This lets me wonder why the UTC didn't simply declare the titlecase property of MTAVRULI to be mkhedruli. Was this considered or not? The way things are currently set up, there seems to be no benefit of MTAVRULI being its own titlecase, because in actual use, that requires additional processing. Regards, Martin.
Re: Dealing with Georgian capitalization in programming languages
Ken, Markus, Many thanks for your ideas, which I noted at https://bugs.ruby-lang.org/issues/14839. Regards, Martin. On 2018/10/03 06:43, Ken Whistler wrote: On 10/2/2018 12:45 AM, Martin J. Dürst via Unicode wrote: My questions here are: - Has this been considered when Georgian Mtavruli was discussed in the UTC? Not explicitly, that I recall. The whole issue of titlecasing came up very late in the preparation of case mapping tables for Mtavruli and Mkhedruli for 11.0. But it seems to me that the problem you are citing can be avoided if you simply rethink what your "capitalize" means. It really should be conceived of as first lowercasing the *entire* string, and then titlecasing the *eligible* letters -- i.e., usually the first letter. (Note that this allows for the concept that titlecasing might then be localized on a per-writing-system basis -- the issue would devolve to determining what the rules are for "eligible" letters.) But the simple default would just be to titlecase the initial letter of each "word" segment of a string. Note that conceived this way, for the Georgian mappings, where the titlecase mapping for Mkhedruli is simply the letter itself, this approach ends up with: capitalize(mkhedrulistring) --> mkhedrulistring capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> mkhedrulistring Thus avoiding any mixed case.
Re: Dealing with Georgian capitalization in programming languages
On 10/2/2018 12:45 AM, Martin J. Dürst via Unicode wrote: capitalize: uppercase (or title-case) the first character of the string, lowercase the rest When I say "cause problems", I mean producing mixed-case output. I originally thought that 'capitalize' would be fine. It is fine for lowercase input: I stays lowercase because Unicode Data indicates that titlecase for lowercase Georgian letters is the letter itself. But it will produce the apparently undesirable Mixed Case for ALL UPPERCASE input. My questions here are: - Has this been considered when Georgian Mtavruli was discussed in the UTC? Not explicitly, that I recall. The whole issue of titlecasing came up very late in the preparation of case mapping tables for Mtavruli and Mkhedruli for 11.0. But it seems to me that the problem you are citing can be avoided if you simply rethink what your "capitalize" means. It really should be conceived of as first lowercasing the *entire* string, and then titlecasing the *eligible* letters -- i.e., usually the first letter. (Note that this allows for the concept that titlecasing might then be localized on a per-writing-system basis -- the issue would devolve to determining what the rules are for "eligible" letters.) But the simple default would just be to titlecase the initial letter of each "word" segment of a string. Note that conceived this way, for the Georgian mappings, where the titlecase mapping for Mkhedruli is simply the letter itself, this approach ends up with: capitalize(mkhedrulistring) --> mkhedrulistring capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> mkhedrulistring Thus avoiding any mixed case. --Ken
Re: Dealing with Georgian capitalization in programming languages
I see no easy way to convert ALL UPPERCASE text with consistant casing as there's no rule, except by using dictionnary lookups. In reality data should be input using default casing (as in dictionnary entries), independantly of their position in sentences, paragraphs or titles, and the contextual conversion of some or all characters to uppercase being done algorithmically (this is safe for conversion to ALL UPPERCASE, and quite reliable for conversion to Tile Case, with just a few dictionnary lookups for a small set of knows words per language. Note that title casing works differently in English (which is most often abusing by putting capitales on every word), while most other languages capitalize only selected words, or just the first selected word in French (in addition to the possible first letter of non-selected words such as definite and indefinite articles at start of the sentence). Capitalization of initials on every word is wrong in German which uses capitalisation even more strictly than French or Italian: when in doubts, do not perform any titlecasing, and allow data to provide the actual capitalization of titles directly (it is OK and even recommanded in German to have section headings, or even book titles, written as if they were in the middle of sentences, and you capitalize only titles and headings that are full sentences grammatically, but not simple nominal groups. So title casing should not even be promoted by the UCD standard (where it is in fact using only very basic, simplistic rules) and applicable only in some applications for some languages and in specific technical or rendering contexts. Le mar. 2 oct. 2018 à 22:21, Markus Scherer via Unicode a écrit : > On Tue, Oct 2, 2018 at 12:50 AM Martin J. Dürst via Unicode < > unicode@unicode.org> wrote: > >> ... The only >> operation that can cause problems is 'capitalize'. >> >> When I say "cause problems", I mean producing mixed-case output. I >> originally thought that 'capitalize' would be fine. It is fine for >> lowercase input: I stays lowercase because Unicode Data indicates that >> titlecase for lowercase Georgian letters is the letter itself. But it >> will produce the apparently undesirable Mixed Case for ALL UPPERCASE >> input. >> >> My questions here are: >> - Has this been considered when Georgian Mtavruli was discussed in the >>UTC? >> - How have any other implementers (ICU,...) addressed this, in >>particular the operation that's called 'capitalize' in Ruby? >> > > By default, ICU toTitle() functions titlecase at word boundaries (with > adjustment) and lowercase all else. > That is, we implement Unicode chapter 3.13 Default Case Conversions R3 > toTitlecase(x), except that we modified the default boundary adjustment. > > You can customize the boundaries (e.g., only the start of the string). > We have options for whether and how to adjust the boundaries (e.g., adjust > to the next cased letter) and for copying, not lowercasing, the other > characters. > See C++ and Java class CaseMap and the relevant options. > > markus >
Re: Dealing with Georgian capitalization in programming languages
On Tue, Oct 2, 2018 at 12:50 AM Martin J. Dürst via Unicode < unicode@unicode.org> wrote: > ... The only > operation that can cause problems is 'capitalize'. > > When I say "cause problems", I mean producing mixed-case output. I > originally thought that 'capitalize' would be fine. It is fine for > lowercase input: I stays lowercase because Unicode Data indicates that > titlecase for lowercase Georgian letters is the letter itself. But it > will produce the apparently undesirable Mixed Case for ALL UPPERCASE input. > > My questions here are: > - Has this been considered when Georgian Mtavruli was discussed in the >UTC? > - How have any other implementers (ICU,...) addressed this, in >particular the operation that's called 'capitalize' in Ruby? > By default, ICU toTitle() functions titlecase at word boundaries (with adjustment) and lowercase all else. That is, we implement Unicode chapter 3.13 Default Case Conversions R3 toTitlecase(x), except that we modified the default boundary adjustment. You can customize the boundaries (e.g., only the start of the string). We have options for whether and how to adjust the boundaries (e.g., adjust to the next cased letter) and for copying, not lowercasing, the other characters. See C++ and Java class CaseMap and the relevant options. markus