Re: Dealing with Georgian capitalization in programming languages

2018-10-09 Thread Ken Whistler via Unicode

Martin,

On 10/9/2018 12:47 AM, Martin J. Dürst via Unicode wrote:

- Using the 'capitalize' method to (try to) get the titlecase
  property of a MTAVRULI character. (There's no other way
  currently in Ruby to get the titlecase property.)

There may be others. If you have some ideas, I'd appreciate to know 
about them.


This lets me wonder why the UTC didn't simply declare the titlecase 
property of MTAVRULI to be mkhedruli. Was this considered or not? The 
way things are currently set up, there seems to be no benefit of 
MTAVRULI being its own titlecase, because in actual use, that requires 
additional processing.


Titlecasing for Georgian was not completely thought through before 
Mtavruli was added. As I noted in my earlier comment on this thread, the 
titlecase mapping values for Mkhredruli were added late in the process, 
when it became clear that not doing so would result in inappropriate 
outcomes for existing Mkhredruli text.


I don't think there is a fully-worked out position on this, but adding a 
Simple_Titlecase mapping for Mtavruli to Mkhedruli would, I suspect, 
just further muddy waters for implementers, because it would be in 
effect saying that an uppercase letter titlecases by shifting to its 
lowercase mapping. A headscratcher, at the very least.


Note that with the current mappings as they are, Changes_When_Titlecased 
is False for all Mkhedruli and for all Mtavruli characters, which I 
think is the desired state of affairs. A titlecasing string operation of 
Mtavruli that does something other than just leave the string alone 
should, IMO, be documented as doing something extra and *should* have to 
do additional processing.


--Ken



Aw: Re: Dealing with Georgian capitalization in programming languages

2018-10-09 Thread Marius Spix via Unicode
The capital ẞ (U+1E9E) has been officially approved by the Council for the 
German Language since July 2018. However, there is no word starting with ß, 
that means the character is only relevant for full-capitalized words. It may 
only stand alone in spaced type, when there is no available italic font-style.

In the Ruby bug tracker that there is also an issue with Dutch ij → IJ. The 
dedicated ligatures IJ (U+0133) and ij (U+0133) are not recommended and thus 
never used, but leading ij must always be capitalized to IJ, as in IJSBERG → 
ijsberg → IJsberg. The actual problem is that the current capitalization 
algorithm is based on a regular grammar (type 3). It has to be adjusted for a 
context-sensitive (type 1) grammar. 

Regards,

Marius

 

On 2018/10/09 09:47, Martin J. Dürst wrote:

> I have been thinking through this. It seems quite appealing.
> 
> But I'm concerned there may be some edge cases. I have been able to come
> up with two so far:
> 
> - Applying this to a string starting with upper-case SZ (U+1E9E).
> This may change SZ → ß → Ss.
> - Using the 'capitalize' method to (try to) get the titlecase
> property of a MTAVRULI character. (There's no other way
> currently in Ruby to get the titlecase property.)
> 
> There may be others. If you have some ideas, I'd appreciate to know
> about them.
> 
> This lets me wonder why the UTC didn't simply declare the titlecase
> property of MTAVRULI to be mkhedruli. Was this considered or not? The
> way things are currently set up, there seems to be no benefit of
> MTAVRULI being its own titlecase, because in actual use, that requires
> additional processing.
> 
> Regards, Martin.



Re: Dealing with Georgian capitalization in programming languages

2018-10-09 Thread Martin J. Dürst via Unicode

Hello Ken, others,

On 2018/10/03 06:43, Ken Whistler wrote:

But it seems to me that the problem you are citing can be avoided if you 
simply rethink what your "capitalize" means. It really should be 
conceived of as first lowercasing the *entire* string, and then 
titlecasing the *eligible* letters -- i.e., usually the first letter. 
(Note that this allows for the concept that titlecasing might then be 
localized on a per-writing-system basis -- the issue would devolve to 
determining what the rules are for "eligible" letters.) But the simple 
default would just be to titlecase the initial letter of each "word" 
segment of a string.


Note that conceived this way, for the Georgian mappings, where the 
titlecase mapping for Mkhedruli is simply the letter itself, this 
approach ends up with:


capitalize(mkhedrulistring) --> mkhedrulistring

capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> 
mkhedrulistring


Thus avoiding any mixed case.


I have been thinking through this. It seems quite appealing.

But I'm concerned there may be some edge cases. I have been able to come 
up with two so far:


- Applying this to a string starting with upper-case SZ (U+1E9E).
  This may change SZ → ß → Ss.
- Using the 'capitalize' method to (try to) get the titlecase
  property of a MTAVRULI character. (There's no other way
  currently in Ruby to get the titlecase property.)

There may be others. If you have some ideas, I'd appreciate to know 
about them.


This lets me wonder why the UTC didn't simply declare the titlecase 
property of MTAVRULI to be mkhedruli. Was this considered or not? The 
way things are currently set up, there seems to be no benefit of 
MTAVRULI being its own titlecase, because in actual use, that requires 
additional processing.


Regards,   Martin.


Re: Dealing with Georgian capitalization in programming languages

2018-10-04 Thread Martin J. Dürst via Unicode

Ken, Markus,

Many thanks for your ideas, which I noted at
https://bugs.ruby-lang.org/issues/14839.

Regards,   Martin.

On 2018/10/03 06:43, Ken Whistler wrote:


On 10/2/2018 12:45 AM, Martin J. Dürst via Unicode wrote:



My questions here are:
- Has this been considered when Georgian Mtavruli was discussed in the
  UTC?

Not explicitly, that I recall. The whole issue of titlecasing came up 
very late in the preparation of case mapping tables for Mtavruli and 
Mkhedruli for 11.0.


But it seems to me that the problem you are citing can be avoided if you 
simply rethink what your "capitalize" means. It really should be 
conceived of as first lowercasing the *entire* string, and then 
titlecasing the *eligible* letters -- i.e., usually the first letter. 
(Note that this allows for the concept that titlecasing might then be 
localized on a per-writing-system basis -- the issue would devolve to 
determining what the rules are for "eligible" letters.) But the simple 
default would just be to titlecase the initial letter of each "word" 
segment of a string.


Note that conceived this way, for the Georgian mappings, where the 
titlecase mapping for Mkhedruli is simply the letter itself, this 
approach ends up with:


capitalize(mkhedrulistring) --> mkhedrulistring

capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> 
mkhedrulistring


Thus avoiding any mixed case.




Re: Dealing with Georgian capitalization in programming languages

2018-10-02 Thread Ken Whistler via Unicode



On 10/2/2018 12:45 AM, Martin J. Dürst via Unicode wrote:
capitalize: uppercase (or title-case) the first character of the 
string, lowercase the rest



When I say "cause problems", I mean producing mixed-case output. I 
originally thought that 'capitalize' would be fine. It is fine for 
lowercase input: I stays lowercase because Unicode Data indicates that 
titlecase for lowercase Georgian letters is the letter itself. But it 
will produce the apparently undesirable Mixed Case for ALL UPPERCASE 
input.


My questions here are:
- Has this been considered when Georgian Mtavruli was discussed in the
  UTC?

Not explicitly, that I recall. The whole issue of titlecasing came up 
very late in the preparation of case mapping tables for Mtavruli and 
Mkhedruli for 11.0.


But it seems to me that the problem you are citing can be avoided if you 
simply rethink what your "capitalize" means. It really should be 
conceived of as first lowercasing the *entire* string, and then 
titlecasing the *eligible* letters -- i.e., usually the first letter. 
(Note that this allows for the concept that titlecasing might then be 
localized on a per-writing-system basis -- the issue would devolve to 
determining what the rules are for "eligible" letters.) But the simple 
default would just be to titlecase the initial letter of each "word" 
segment of a string.


Note that conceived this way, for the Georgian mappings, where the 
titlecase mapping for Mkhedruli is simply the letter itself, this 
approach ends up with:


capitalize(mkhedrulistring) --> mkhedrulistring

capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> 
mkhedrulistring


Thus avoiding any mixed case.

--Ken



Re: Dealing with Georgian capitalization in programming languages

2018-10-02 Thread Philippe Verdy via Unicode
I see no easy way to convert ALL UPPERCASE text with consistant casing as
there's no rule, except by using dictionnary lookups.
In reality data should be input using default casing (as in dictionnary
entries), independantly of their position in sentences, paragraphs or
titles, and the contextual conversion of some or all characters to
uppercase being done algorithmically (this is safe for conversion to ALL
UPPERCASE, and quite reliable for conversion to Tile Case, with just a few
dictionnary lookups for a small set of knows words per language.

Note that title casing works differently in English (which is most often
abusing by putting capitales on every word), while most other languages
capitalize only selected words, or just the first selected word in French
(in addition to the possible first letter of non-selected words such as
definite and indefinite articles at start of the sentence). Capitalization
of initials on every word is wrong in German which uses capitalisation even
more strictly than French or Italian: when in doubts, do not perform any
titlecasing, and allow data to provide the actual capitalization of titles
directly (it is OK and even recommanded in German to have section headings,
or even book titles, written as if they were in the middle of sentences,
and you capitalize only titles and headings that are full sentences
grammatically, but not simple nominal groups.

So title casing should not even be promoted by the UCD standard (where it
is in fact using only very basic, simplistic rules) and applicable only in
some applications for some languages and in specific technical or rendering
contexts.



Le mar. 2 oct. 2018 à 22:21, Markus Scherer via Unicode 
a écrit :

> On Tue, Oct 2, 2018 at 12:50 AM Martin J. Dürst via Unicode <
> unicode@unicode.org> wrote:
>
>> ... The only
>> operation that can cause problems is 'capitalize'.
>>
>> When I say "cause problems", I mean producing mixed-case output. I
>> originally thought that 'capitalize' would be fine. It is fine for
>> lowercase input: I stays lowercase because Unicode Data indicates that
>> titlecase for lowercase Georgian letters is the letter itself. But it
>> will produce the apparently undesirable Mixed Case for ALL UPPERCASE
>> input.
>>
>> My questions here are:
>> - Has this been considered when Georgian Mtavruli was discussed in the
>>UTC?
>> - How have any other implementers (ICU,...) addressed this, in
>>particular the operation that's called 'capitalize' in Ruby?
>>
>
> By default, ICU toTitle() functions titlecase at word boundaries (with
> adjustment) and lowercase all else.
> That is, we implement Unicode chapter 3.13 Default Case Conversions R3
> toTitlecase(x), except that we modified the default boundary adjustment.
>
> You can customize the boundaries (e.g., only the start of the string).
> We have options for whether and how to adjust the boundaries (e.g., adjust
> to the next cased letter) and for copying, not lowercasing, the other
> characters.
> See C++ and Java class CaseMap and the relevant options.
>
> markus
>


Re: Dealing with Georgian capitalization in programming languages

2018-10-02 Thread Markus Scherer via Unicode
On Tue, Oct 2, 2018 at 12:50 AM Martin J. Dürst via Unicode <
unicode@unicode.org> wrote:

> ... The only
> operation that can cause problems is 'capitalize'.
>
> When I say "cause problems", I mean producing mixed-case output. I
> originally thought that 'capitalize' would be fine. It is fine for
> lowercase input: I stays lowercase because Unicode Data indicates that
> titlecase for lowercase Georgian letters is the letter itself. But it
> will produce the apparently undesirable Mixed Case for ALL UPPERCASE input.
>
> My questions here are:
> - Has this been considered when Georgian Mtavruli was discussed in the
>UTC?
> - How have any other implementers (ICU,...) addressed this, in
>particular the operation that's called 'capitalize' in Ruby?
>

By default, ICU toTitle() functions titlecase at word boundaries (with
adjustment) and lowercase all else.
That is, we implement Unicode chapter 3.13 Default Case Conversions R3
toTitlecase(x), except that we modified the default boundary adjustment.

You can customize the boundaries (e.g., only the start of the string).
We have options for whether and how to adjust the boundaries (e.g., adjust
to the next cased letter) and for copying, not lowercasing, the other
characters.
See C++ and Java class CaseMap and the relevant options.

markus