Re: Swapcase for Titlecase characters

2016-03-29 Thread Kent Karlsson

Den 2016-03-19 17:40, skrev "Doug Ewell" :

> As one anecdote (which is even less like "data" than two anecdotes), I
> could not find any of the characters IJ ij DŽ Dž dž LJ Lj lj NJ Nj nj or their hex
(You missed the DZ "ligature" (which aren't really ligatures).)

As mentioned, for the IJ ij here (which sometimes ARE shown as ligatures,
mostly in signage), there is no "titlecase" variant for these (and thus
no problem for "swapcase"). For casing they behave just like Œ œ and Æ æ.
While we are off-topic for this thread... (but still on-topic for
this list):

I still think ij should have the "soft-dotted" property (and that
that property is finally implemented properly in various systems...).

> equivalents in any of the CLDR keyboard definitions.

I've heard that old typewriters used to have a key for IJ ij. Maybe it
should be reintroduced for Dutch computer keyboards, as well as used
(for Dutch) in autocorrects (IJ -> IJ, ij -> ij) or spell correctors
(looking at the whole word rather than just two letters, and then
not restricted to Dutch per se, but certain Dutch names regardless
of the language for the surrounding text). That, in turn, would
probably be a better approach than trying to have some special
handling of the sequence "ij" in case mapping (for Dutch alone).

/Kent K

> I'd imagine that 
> users just type the two characters separately, and that consequently
> most data in the real world is like that.
> 
> --
> Doug Ewell | http://ewellic.org | Thornton, CO  





Re: Swapcase for Titlecase characters

2016-03-21 Thread Doug Ewell

I wrote:


As one anecdote (which is even less like "data" than two anecdotes), I
could not find any of the characters IJ ij DŽ Dž dž LJ Lj lj NJ Nj nj or their
hex equivalents in any of the CLDR keyboard definitions. I'd imagine
that users just type the two characters separately, and that
consequently most data in the real world is like that.


Some off-list messages have helped to remind me that in the context of 
titlecase and swapcase, I should not have included IJ and ij (U+0132 and 
U+0133) in that list. There is clearly no question about how swapcase 
should handle those. Sorry for the distraction.


--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Swapcase for Titlecase characters

2016-03-20 Thread Doug Ewell

Otto Stolz wrote:


[ ... ] I'd imagine that users just type the two characters
[IJ or ij] separately, and that consequently most data in the real
world is like that.


For "IJ",
cf. .


I can't make Edge or Acrobat Reader DC jump to the bookmark (suggestions 
off-list, please), but I guess Otto referred to this passage, which ends 
with the point I was trying to make:



Another pair of characters, U+0133 LATIN SMALL LIGATURE IJ and its
uppercase version, was provided to support the digraph "ij" in Dutch,
often termed a "ligature" in discussions of Dutch orthography. When
adding intercharacter spacing for line justification, the "ij" is kept
as a unit, and the space between the i and j does not increase. In
titlecasing, both the i and the j are uppercased, as in the word
"IJsselmeer." Using a single code point might simplify software
support for such features; however, because a vast amount of Dutch
data is encoded without this digraph character, under most
circumstances one will encounter an  sequence.


--
Doug Ewell | http://ewellic.org | Thornton, CO  



Re: Swapcase for Titlecase characters

2016-03-20 Thread Otto Stolz

Hello,

Am 19.03.2016 um 17:40 schrieb Doug Ewell:

As one anecdote (which is even less like "data" than two anecdotes), I
could not find any of the characters IJ ij DŽ Dž dž LJ Lj lj NJ Nj nj or their hex
equivalents in any of the CLDR keyboard definitions. I'd imagine that
users just type the two characters separately, and that consequently
most data in the real world is like that.


For »IJ«,
cf. .

Regards,
  Otto




Re: Swapcase for Titlecase characters

2016-03-19 Thread Marcel Schneider
On Fri, Mar 18, 2016, 08:43:56, Martin J. Dürst  wrote:

> I'm working on extending the case conversion methods for the programming 
> language Ruby from the current ASCII only to cover all of Unicode.
> 
> Ruby comes with four methods for case conversion. Three of them, upcase, 
> downcase, and capitalize, are quite clear. But we have hit a question 
> for the forth method, swapcase.
> 
> What swapcase does is swap upper and lower case, so that e.g.
> 
> 'Unicode Standard'.swapcase => 'uNICODE sTANDARD'
> 
> I'm not sure myself where this method is actually used, but it also 
> exists in Python (and maybe Ruby got it from there).
> 
> 
> Now the question I have is: What to do for titlecase characters? Several 
> possibilities already have been floated:
> 
> a) Leave as is, because there are neither upper nor lower case.
> 
> b) Convert to upper (or lower), which may simplify implementation.
> 
> c) Decompose the character into upper and lower case components, and 
> apply swapcase to these.
> 
> 
> For example, 'Džinsi' (jeans) would become 'DžINSI' with a), 'DŽINSI' (or 
> 'džinsi') with b), and 'dŽINSI' with c). For another example, 'ᾨδή' would 
> become 'ᾨΔΉ' with a), 'ὨΙΔΉ' (or 'ᾠΔΉ') with b), and 'ὠΙΔΉ' with c).
> 
> It looks like Python 3 (3.4.3 in my case) is doing a). My guess is that 
> from an user expectation point of view, c) is best, so I'm tending to go 
> for c). There is no existing data from the Unicode Standard for this, 
> but it seems pretty straightforward.
> 
> But before I just implement something, I'd appreciate additional input, 
> in particular from users closer to the affected language communities.


As far as I can tell from my limited experience, the swapcase method is used 
only to convert “inverted titlecase” to titlecase. I call “inverted titlecase” 
the state of text produced by keyboard input while the caps lock toggle is 
accidentally on, and those words are “inversely capitalized” where the user 
pressed the shift modifier. Therefore such examples would be most useful.

Having said that, I know that this never occurs on many keyboards of 
English-speaking users who remapped that key to perform another action such as 
backspace, compose, or kana lock. Living myself in a country where the caps 
lock toggle is indispensable, I may be considered part of the aimed user 
communities, though unfortunately I donʼt speak Croatian nor Greek.

Looking at your examples, I would add a case that typically occurs for swapcase 
to be applied: ‘ᾠΔΉ’ (cited [erroneously] as a result of option b) that is to 
be converted to ‘ᾨδή’, and ‘džINSI’, that is to become ‘Džinsi’.

As about decomposing digraphs and ypogegrammeni to apply swapcase: That 
probably would be doing no good, as itʼs unnecessary and users wonʼt expect it.

I hope that helps.

Kind regards,

Marcel



Re: Swapcase for Titlecase characters

2016-03-19 Thread Marcel Schneider
On Sat Mar 19, 2016 12:54:51, Martin J. Dürst  wrote:

> On 2016/03/19 04:33, Marcel Schneider wrote:
> > On Fri, Mar 18, 2016, 08:43:56, Martin J. Dürst wrote:
> 
> >> b) Convert to upper (or lower), which may simplify implementation.
> 
> >> For example, 'Džinsi' (jeans) would become 'DžINSI' with a), 'DŽINSI' (or
> >> 'džinsi') with b), and 'dŽINSI' with c). For another example, 'ᾨδή' would
> >> become 'ᾨΔΉ' with a), 'ὨΙΔΉ' (or 'ᾠΔΉ') with b), and 'ὠΙΔΉ' with c).
> 
> > Looking at your examples, I would add a case that typically occurs for 
> > swapcase to be applied:
> 
> > ‘ᾠΔΉ’ (cited [erroneously] as a result of option b) that is to be converted 
> > to ‘ᾨδή’, and ‘džINSI’, that is to become ‘Džinsi’.
> 
> First, what do you mean with "erroneously"?

The intent of that bracketed word was just to give account of the fact that 
when ‘ᾨδή’ is converted to lower case as assumed in option “b-lower”, it 
becomes ‘ᾠδή’, while ‘ᾠΔΉ’ is a typical candidate for swapcase, thus I could 
reutilize it “as is” to illustrate the fourth case.

> 
> Second, did I get this right that your additional case (let's call it 
> d)) would cycle through the three options where available:
> lower -> title -> upper -> lower.

I’m afraid that swapcase as I saw it is not a roundtrip method, therefore I got 
some awkward moments today when I thought about how to implement it. As far as 
I could see, there are two pairs:

I: lowercase → titlecase (needed to correct the initials where the user pressed 
the shift modifier)
II: uppercase → lowercase (needed to correct the body of the words input while 
caps lock was on)

That typically matches what happens when caps lock is accidentally on and the 
user writes normally―on a keyboard that includes digraphs and uses the SGCaps 
feature for them, like this:

Modifier; None; Shift
CapsLock off; Lower; Title
CapsLock on; Upper; Lower

Correcting keyboard input done with the wrong caps lock state is the only 
situation I can see where swapcase is needed and thus is likely to be used. 
This is why the swapcase method is implemented in word processors, as a part of 
an optional autocorrect feature that neutralizes the effet of starting a 
sentence normally while caps lock is on: After completing the input of an 
uppercase word with an initial lowercase letter, the word is automatically 
swapcased and caps lock is turned off.

However now that I tested it with the digraph of the examples (input through 
the composer of the keyboard layout), it doesnʼt work at all in one word 
processor, while in another one it works but uppercases the initial lowercase 
digraph instead of titlecasing it. [That may be considered effects of 
“streamlined” implementations that drop the less frequent cases.]


I donʼt believe that it would be useful to make swapcase a roundtrip method, 
and anyway it would be weird because of the letters with three case forms. The 
case conversion cycle you draw above usually applies to words (and doesnʼt work 
correctly in neither of the two tested word processors when an initial DZ 
digraph is present), while most letters have identical values for 
Titlecase_Mapping and Uppercase_Mapping, and usually there is no means to flag 
them with “Titlecase_State”. This might be one more reason why current 
implementations of swapcase donʼt match the expected behavior for digraphs.


> 
> > As about decomposing digraphs and ypogegrammeni to apply swapcase: That 
> > probably would be doing no good,
> > as itʼs unnecessary and users wonʼt expect it.
> 
> Why do you say "users won't expect it"? For those users not aware of the 
> encoding internals, I'd indeed guess that's what users would expect, at 
> least in the Croatian case.

That depends on what is the expected result. If the swapcase method is to 
correct inverted casing, users wouldnʼt like to see the digraphs decomposed, 
the less as in the considered languages, the DZ digraph is a part of the 
alphabet between ‘D’ and ‘Đ’, so that users are really aware.

> For Greek, it may be different; it depends 
> on the extent to which the iota is seen as a letter vs. seen as a mark.

Here again the user inputs a precomposed letter, with iota subscript because he 
just wants a capitalized word, not an uppercase one. And here again the 
autocorrect doesnʼt work in one word processor, while in the other one it 
applies uppercasing with uppercase iota adscript―while the rest of the word is 
lowercase―instead of capitalization, with lowercase iota adscript or iota 
subcript, that depends on conventions and preferences.

Letʼs take that as a proof how hard it is to implement swapcase with digraph 
support.

I canʼt better conclude this reply than with Asmus Freytagʼs words on Fri, 1st 
Jan 2016 12:09:13 -0800: [1]

> Unicode aims to be expressive enough to model all plain text. That means, it 
> inherits the non-reducible complexity of text. Even the insight that the 
> complexity is non-reducible would be a big step forward.

Regards,

Marcel

[1] Re: 

Re: Swapcase for Titlecase characters

2016-03-19 Thread Doug Ewell

Martin J. Dürst wrote:


Now the question I have is: What to do for titlecase characters?
[ ... ]
For example, 'Džinsi' (jeans) would become 'DžINSI' with a), 'DŽINSI' (or
'džinsi') with b), and 'dŽINSI' with c).


For the Latin letters at least, my 0.02 cents' worth (you read that 
right) is that they are probably so infrequently used that option (b) 
would be just fine.


As one anecdote (which is even less like "data" than two anecdotes), I 
could not find any of the characters IJ ij DŽ Dž dž LJ Lj lj NJ Nj nj or their hex 
equivalents in any of the CLDR keyboard definitions. I'd imagine that 
users just type the two characters separately, and that consequently 
most data in the real world is like that.


--
Doug Ewell | http://ewellic.org | Thornton, CO  



Re: Swapcase for Titlecase characters

2016-03-19 Thread Martin J. Dürst

Thanks everybody for the feedback.

On 2016/03/19 04:33, Marcel Schneider wrote:

On Fri, Mar 18, 2016, 08:43:56, Martin J. Dürst  wrote:



b) Convert to upper (or lower), which may simplify implementation.



For example, 'Džinsi' (jeans) would become 'DžINSI' with a), 'DŽINSI' (or
'džinsi') with b), and 'dŽINSI' with c). For another example, 'ᾨδή' would
become 'ᾨΔΉ' with a), 'ὨΙΔΉ' (or 'ᾠΔΉ') with b), and 'ὠΙΔΉ' with c).



Looking at your examples, I would add a case that typically occurs for swapcase 
to be applied:



‘ᾠΔΉ’ (cited [erroneously] as a result of option b) that is to be converted to 
‘ᾨδή’, and ‘džINSI’, that is to become ‘Džinsi’.


First, what do you mean with "erroneously"?

Second, did I get this right that your additional case (let's call it 
d)) would cycle through the three options where available:

lower -> title -> upper -> lower.


As about decomposing digraphs and ypogegrammeni to apply swapcase: That 
probably would be doing no good,
as itʼs unnecessary and users wonʼt expect it.


Why do you say "users won't expect it"? For those users not aware of the 
encoding internals, I'd indeed guess that's what users would expect, at 
least in the Croatian case. For Greek, it may be different; it depends 
on the extent to which the iota is seen as a letter vs. seen as a mark.


Regards,   Martin.


Re: Swapcase for Titlecase characters

2016-03-19 Thread Mark Davis ☕️
The 'swapcase' just sounds bizarre. What on earth is it for? My inclination
would be to just do the simplest possible implementation that has the
expected results for the 1:1 case pairs, and whatever falls out from the
algorithm for the others.


Mark

On Sat, Mar 19, 2016 at 4:11 AM, Asmus Freytag (t) 
wrote:

> On 3/18/2016 12:33 PM, Marcel Schneider wrote:
>
> As about decomposing digraphs and ypogegrammeni to apply swapcase: That 
> probably would be doing no good, as itʼs unnecessary and users wonʼt expect 
> it.
>
>
> That was my intuition as well, but based on a different line of argument.
> If you add a feature to match behavior somewhere else, it rarely pays to
> make that perform "better", because it just means it's now different and no
> longer matches.
>
> The exception is a feature for which you can establish unambiguously that
> there is a metric of correctness or a widely (universally?) shared
> expectation by users as to the ideal behavior. In that case, being
> compatible with a broken feature (or a random implementation of one) may in
> fact be counter productive.
>
> The mere fact that you needed to ask here made me think that this would be
> unlikely to be one of those exceptions: because in that case, you would
> have easily be able to tap into a consensus that tells you what "better"
> means. (And it the feature would probably have been more widely
> implemented).
>
> This one is pretty bizarre on the face of it, but I like Marcel's
> suggestion as to its putative purpose.
>
> A./
>


Re: Swapcase for Titlecase characters

2016-03-19 Thread Asmus Freytag (t)

  
  
On 3/18/2016 12:33 PM, Marcel Schneider
  wrote:


  As about decomposing digraphs and ypogegrammeni to apply swapcase: That probably would be doing no good, as itʼs unnecessary and users wonʼt expect it.


That was my intuition as well, but based on a
  different line of argument. If you add a feature to match behavior
  somewhere else, it rarely pays to make that perform "better",
  because it just means it's now different and no longer matches.
  
  The exception is a feature for which you can establish
  unambiguously that there is a metric of correctness or a widely
  (universally?) shared expectation by users as to the ideal
  behavior. In that case, being compatible with a broken feature (or
  a random implementation of one) may in fact be counter productive.
  
  The mere fact that you needed to ask here made me think that this
  would be unlikely to be one of those exceptions: because in that
  case, you would have easily be able to tap into a consensus that
  tells you what "better" means. (And it the feature would probably
  have been more widely implemented).
  
  This one is pretty bizarre on the face of it, but I like Marcel's
  suggestion as to its putative purpose.
  
  A./