Re: Uppercase ß

2018-05-29 Thread Werner LEMBERG via Unicode


> Overlooked in this discussion is the fact that the revised
> orthography of 1996 introduces for the first time a systematic
> difference in pronunciation for the vowel preceding SS vs. ẞ (short
> vs. long).  As users of the old orthography age out, I would not be
> surprised if the SS fallback were to become less acceptable over
> time because it would be at odds with how the word is to be
> pronounced. I'm also confidently expecting the use of ALL CAPS to
> become (somewhat) more prevalent under the continued influence of
> English usage.

It's not that simple.

* `ß' is never used in Switzerland; it's always `ss' (and `SS').  Even
  ambiguous cases like `Masse' are always written like that.  This
  means that for Swiss users `ẞ' is even more alien than for most
  German and Austrian users.  In particular, there doesn't exist a
  `unity SS' in Swiss German at all!  For example, the word `Maße' if
  capitalized to `MASSE' is hyphenated as `MA-SSE' in Germany and
  Austria (since `SS' is treated in this case as a unity).  However,
  the word is hyphenated as `MAS-SE' in Switzerland, since `ss', as a
  replacement for `ß', is *not* treated as a unity.

* There are dialectic differences between northern and southern
  Germany (and Austria).  Example: `Geschoß' vs. `Geschoss', which
  means exactly the same – and both orthographies are allowed.  For
  such cases, `GESCHOSS' is a much better uppercase version since it
  covers both dialectic forms.

I very much dislike the approach that just for the sake of `simplistic
standardization for uppercase' the use if `ẞ' should be enforced in
German.  It's not the job of a language to fit computer usage.  It's
rather the job of computers to fit language usage.


Werner



Re: Uppercase ß

2018-05-29 Thread Otto Stolz via Unicode

Hello,
am 2018-05-29 um 10:15 Uhr hat Hans Åberg geschrieben:

Duden used one in 1957, but stated in 1984 that there is no uppercase version 
[1].


There used to bee two differnt orthographic dictionaries,
both called “Duden”:
► The Duden from Leipzig (DDR) had a captal “ß”, on the cover page
  of its 1957 edition.
► The Duden from Mannheim (FRG) never has featured a captal “ß”, IIRC.


So it would be interesting with a reference to an official version.


Neither Duden has been anything like an “official version” – never ever.
Until 1996, the only official German orthography was somewhat loosely
defined by a common decision of the Ministers of Education of the FRG,
with an additional remark saying: “In case of doubt, the spelling of the
latest edition of the Duden (i. e. the Mannheim version) will take
effect.”

Nowadays, the official version of the orthographic rules
can be found in:
;
the Uppercase-ß rule, particularily, is discussed in
,
under §25(E3); the latest version of the rule reads thusly:

E3: Bei Schreibung mit Großbuchstaben schreibt man SS.
Daneben ist auch die Verwendung des Großbuchstabens ẞ 
möglich. 

which means:
  When writing in all-caps, you write SS.
  Alternatively, the capital ẞ may be used.

So, the normal upper-case equivalent of German sharp-S
still is the double-S. The recently introduced capital sharp-S
is an optional alternative, but not the normal way of
uppercasing the sharp-S.

Best wishes,
   Otto Stolz




Re: Uppercase ß

2018-05-29 Thread Asmus Freytag (c) via Unicode

On 5/29/2018 12:15 PM, Werner LEMBERG wrote:

Overlooked in this discussion is the fact that the revised
orthography of 1996 introduces for the first time a systematic
difference in pronunciation for the vowel preceding SS vs. ẞ (short
vs. long).  As users of the old orthography age out, I would not be
surprised if the SS fallback were to become less acceptable over
time because it would be at odds with how the word is to be
pronounced. I'm also confidently expecting the use of ALL CAPS to
become (somewhat) more prevalent under the continued influence of
English usage.

It's not that simple.

* `ß' is never used in Switzerland; it's always `ss' (and `SS').  Even
   ambiguous cases like `Masse' are always written like that.  This
   means that for Swiss users `ẞ' is even more alien than for most
   German and Austrian users.  In particular, there doesn't exist a
   `unity SS' in Swiss German at all!  For example, the word `Maße' if
   capitalized to `MASSE' is hyphenated as `MA-SSE' in Germany and
   Austria (since `SS' is treated in this case as a unity).  However,
   the word is hyphenated as `MAS-SE' in Switzerland, since `ss', as a
   replacement for `ß', is *not* treated as a unity.


So the Swiss don't have that issue. What do they do for names?



* There are dialectic differences between northern and southern
   Germany (and Austria).  Example: `Geschoß' vs. `Geschoss', which
   means exactly the same – and both orthographies are allowed.  For
   such cases, `GESCHOSS' is a much better uppercase version since it
   covers both dialectic forms.
I don't see the claimed benefit; if you allow two different spellings in 
lowercase to
track the phonetic difference, then that would rather seem to support my 
argument
that there is now a tension in the orthography (for standard German) 
that may well

resolve itself by greater use of the distinct uppercase form.

Users who will end up "resolving" this would be those who grew up only 
with the
revised orthography. Older users are used to a different principle of 
selecting

between SS and ß and that isn't tied to pronunciation of preceding vowel.



I very much dislike the approach that just for the sake of `simplistic
standardization for uppercase' the use if `ẞ' should be enforced in
German.  It's not the job of a language to fit computer usage.  It's
rather the job of computers to fit language usage.

Hmm, don't see anyone calling for that in this discussion.

A./



 Werner





Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 16:13:43 -0600
Doug Ewell via Unicode  wrote:

> Richard Wordingham wrote:
> 
> > The effects of virama that spring to mind are:
> >
> > (a) Causing one or both letters on either side to change or combine
> > to indicate combination;
> >
> > (b) Appearing as a mark only if it does not affect one of the
> > letters on either side;
> >
> > (c) Causing a left matra to appear on the left of the sequence of
> > consonants joined by a sequence of non-visible viramas.  
> 
> Most of these don't apply to Tamil, of course.

They all apply to க்ஷே   TAMIL
SYLLABLE KSSEE.  There are four other named syllables where they all
apply.

Richard



Re: Uppercase ß

2018-05-29 Thread Asmus Freytag (c) via Unicode

On 5/29/2018 2:46 PM, Werner LEMBERG wrote:

I very much dislike the approach that just for the sake of
`simplistic standardization for uppercase' the use if `ẞ' should be
enforced in German.  [...]

Hmm, don't see anyone calling for that in this discussion.

Well, I hear an implicit ”Great, there is now an `ẞ' character!  Let's
use it as the uppercase version of `ß' everywhere so that this nasty
German peculiarity is finally gone.“


The ALL-CAPS "SS" really has little to recommend it, intrinsically. It 
is de-facto a fall-back; one that competed with "SZ" as used in 
telegrams (while they still were a thing). Not being able to know how to 
hyphenate MASSE without knowing the meaning of the word is also not 
something that I consider a "benefit".


Uppercase forms for `ß' have been kicking around in fonts for a long 
time as was documented around the time that the character was encoded. 
It is possible mainly because running text in ALL CAPS is  indeed 
uncommon (and in the time of Fraktur was effectively not viable because 
the Fraktur capitals don't lend themselves to it. (If SS had ever 
occurred in Title-Case, I doubt it would have survived as long, other 
than the "Swiss solution" of making it the only form, also in lower case).


Saving an uppercase form for a non-initial letter was a godsend on 
typewriters -- adding to the factors that made the "SS" solution 
acceptable. But sign writers, type designers and typesetters did not 
find it so universally attractive - also documented exhaustively.


With changing environment (starting with influence from Anglo-Saxon use 
of type and not ending with the way the character is treated in relation 
to phonetics) I've been expecting so see usage evolving; and not 
necessarily driven by software engineers.


A./



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Richard Wordingham via Unicode
On Tue, 29 May 2018 07:27:21 -0700
Ken Whistler via Unicode  wrote:

> On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote:
> > How would one know that they are misapplied?  And what if the
> > author of the text has broken your rules? Are such texts never to
> > be transcribed to pukka Unicode?  
> 
> Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, 
> Script=Latin) doesn't automatically make the Tamil vowel "inherit"
> the Latin script property value, nor should it.

It's the sort of process that gave us U+0310 COMBINING CANDRABINDU.
However, I see adding SE Asian dependent vowels to Latin letter x
(U+0078, Script=Latin) as rather tending to make 'x' Script=Common.
Others have disagreed quite vehemently.  I see the view that the base
character is U+00D7 MULTIPLICATION SIGN (InSC=Consonant_Placeholder) has
prevailed.  Serifed U+00D7 is quite common in manually typewritten
material; I remember it from school.  I'm not sure what script the
sequence  belongs to in OpenType
layout. I ought to find out for the benefit of Tai Tham fonts.

> That said, if someone decides they want that sequence, and their text
> as "broken my rules", so be it. I'm just not going to assume anything 
> particular about that text. Note that in terms of trying to determine 
> whether such a string is (naively) alphabetic, such a sequence
> doesn't interfere with the determination. On the other hand, a
> process concerned about text runs, script assignment, validity for
> domains, or other such issues *will* be sensitive to such a boundary
> -- and should not be overruled by some generic determination that
> combining marks inherit all the properties of their base.

When it comes to script runs for rendering, such a rule feels
oppressive; it is widely unenforced.  For example, I have found that
if my font treats U+0E4A THAI CHARACTER MAI TRI as a Tai Tham
character, it will generally render satisfactorily on a Tai Tham
character.  Presumably I can now use a few examples of the same
Northern Thai syllable on the same page in a published language-teaching
book as evidence for adding its clone to the Tai Tham script.  There
should also be some examples of U+0ECA LAO TONE MAI TI on Lao Tai Tham
syllables, but I haven't found any yet.  See the chart at the end of
"Exemple d’écriture ignorée par Unicode : l’écriture tham du Laos"
http://www.laosoftware.com/download/articleTALN.pdf for an implicit
claim of existence.

> > Even without knowing exactly what is wanted, it looks to me as
> > though it isn't.  If he wants to allow  as a
> > substring, which he should, then that fails because there is no
> > overlap between p{extender} and p{gc=Cf} or between p{diacritic}
> > and p{gc=Cf}.  
> 
> Yes, so if you are working with strings for Indic scripts (or for
> that matter, Arabic), you add Join_Control to the mix:
> 
> Alphabetic  ∪ Diacritic ∪ Extender ∪ Join_Control
> 
> gets you a decent approximation of what is (naively) expected to fall 
> within an "alphabetic" string for most scripts.

but won't work for collatable Welsh 'Llan͏gollen'!  (There's a CGJ
between the 'n' and the 'g'.)


One also needs Join_Control for fraktur German and, to my mind,
English 'Ca‍esar'.

> For those following along, Alphabetic is roughly meant to cover the
> ABC, かきくけこ,... plus ideographic elements of most scripts.
> Diacritic picks up most of the applied combining marks, including
> nuktas, viramas, and tone marks. Extender picks up spacing elements
> that indicate length, reduplication, iteration, etc. And joiners are,
> well, joiners.

'Diacritic' mostly includes marks with secondary collation weight;
those with primary weights, such as Indic dependent vowels, are mopped
up in Alphabetic.  (Removing diacritics is very much not the same
as removing combining marks.)

> If one wants finer categorization specifically for Indic scripts,
> then I would suggest turning to the Indic_Syllabic_Category property
> instead of a union of PropList.txt properties and/or some twiddling
> with General_Category values.

You'd still need to add gc=L to catch things like U+0971 DEVANAGARI SIGN
HIGH SPACING DOT (which starts syllables) and U+A8F4 DEVANAGARI SIGN
DOUBLE CANDRABINDU VIRAMA.  And you'd still miss U+0303 COMBINING TILDE
and U+0331 COMBINING MACRON BELOW from Thai script Pattani Malay - I
need to make another attempt to get them appropriate Indic syllabic
category values.

Richard.



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Doug Ewell via Unicode
Richard Wordingham wrote:

>>> The effects of virama that spring to mind are:
>>>
>>> (a) Causing one or both letters on either side to change or combine
>>> to indicate combination;
>>>
>>> (b) Appearing as a mark only if it does not affect one of the
>>> letters on either side;
>>>
>>> (c) Causing a left matra to appear on the left of the sequence of
>>> consonants joined by a sequence of non-visible viramas.
>>
>> Most of these don't apply to Tamil, of course.
>
> They all apply to க்ஷே  TAMIL
> SYLLABLE KSSEE. There are four other named syllables where they all
> apply.

And several others where they do not. TUS explains that visible
puḷḷi is the general rule in Tamil, and conjunct ligatures are the
exception.

I should have written "These mostly don't apply to Tamil, of course."

In any case, Ken has answered the real underlying question: a process
that checks whether each character in a sequence is "alphabetic" is
inappropriate for determining whether the sequence constitutes a word.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org




Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Richard Wordingham via Unicode
On Tue, 29 May 2018 14:03:25 -0700
Doug Ewell via Unicode  wrote:

> In any case, Ken has answered the real underlying question: a process
> that checks whether each character in a sequence is "alphabetic" is
> inappropriate for determining whether the sequence constitutes a word.

Back in the second post of the thread, I made the point that a
conformant Unicode process cannot always give a yes/no answer to the
question of whether all characters in a string are alphabetic.

What we seem to have established is that Unicode properties are not set
up to facilitate the identification of words.  Given that
spell-checkers work, we have taken a wrong turn.  Perhaps we should
reconsider "b⃝e⃝", which consists of two letters each inside its own
enclosing circle.  The spell-checker I'm using considers it a
misspelt word, rather than two symbols side by side.

Richard.



Re: Uppercase ß

2018-05-29 Thread Werner LEMBERG via Unicode


>> * `ß' is never used in Switzerland; it's always `ss' (and `SS').
>>   [...]
> 
> So the Swiss don't have that issue. What do they do for names?

Foreign names containing `ß' are treated as-is, AFAIK.  It's similar
to using, say, accents in some foreign names in English.

>>   For such cases, `GESCHOSS' is a much better uppercase version
>>   since it covers both dialectic forms.

... and Swiss people would use the same uppercase version...

> I don't see the claimed benefit; [...]
>
> Users who will end up "resolving" this would be those who grew up
> only with the revised orthography.

Indeed.

>> I very much dislike the approach that just for the sake of
>> `simplistic standardization for uppercase' the use if `ẞ' should be
>> enforced in German.  [...]
>
> Hmm, don't see anyone calling for that in this discussion.

Well, I hear an implicit ”Great, there is now an `ẞ' character!  Let's
use it as the uppercase version of `ß' everywhere so that this nasty
German peculiarity is finally gone.“

Maybe it's only me...


Werner



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 22:02:15 -0700
Ken Whistler via Unicode  wrote:

> On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote:
> > One of the general principles is that combining marks inherit the 
> > property of their base character.
> >
> > Normally, "inherited" should be the only property value for
> > combining marks.
> >
> > There have been some deviations from this over the years, for
> > various reasons, and there are some properties (such as general
> > category) where it is necessary to recognize the character as
> > combining, but the general principle still holds.
> >
> > Therefore, if you are trying to see whether a string is alphabetic, 
> > combining marks should be "transparent" to such an algorithm.  
> 
> Generally, good advice. But there are clear exceptions. For example,
> the enclosing combining marks for symbols are intended (basically) to
> make symbols of a sort. And many combining marks have explicit script 
> assigments, so they cannot simply willy-nilly inherit the script of a 
> base letter if they are misapplied, for example.

How would one know that they are misapplied?  And what if the author of
the text has broken your rules? Are such texts never to be transcribed
to pukka Unicode?

> This is why I recommend simply adding the Diacritic property into the 
> mix for testing a string. That is a closer approximation to the kind
> of naive "Is this string alphabetic?" question that SunaraRaman was
> asking about -- it picks up the correct subset of combining marks to
> union with the set of actual isAlphabetic characters, to produce more
> expected results. (Including, of course, the correct classification
> of all the viramas, stackers, and killers, as well as picking up all
> the nuktas.).
> 
> Folks, please examine the set of character for Diacritic and for 
> Extender in:
> 
> http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
> 
> to see what I'm talking about. The stuff you are looking for is
> already there.

Even without knowing exactly what is wanted, it looks to me as though
it isn't.  If he wants to allow  as a substring, which
he should, then that fails because there is no overlap between
p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}.  U+034F
COMBINING GRAPHEME JOINER is also missing, apparently deliberately in
the case of 'diacritic'. If one uses the definition of words in the
word break algorithm, one will end up accepting combinations of letter
plus enclosing circle or keycap.  (A fix to the word break algorithm
for that would be unpleasant.)

One hopes that the requirement doesn't include accepting all single
words.  Every properly spelt word containing U+0E46 THAI CHARACTER
MAIYAMOK will be rejected, as it will contain a space before the
U+0E46.  (I assume there are such words; certainly there are
dictionary entries with no corresponding entries without U+0E46,
such as "ตึ้ก ๆ".) At a lesser level, even English has a very few
words with spaces in them, and there is no solution but to list them.

Richard.



Re: Unicode characters unification

2018-05-29 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 21:40:49 -0700
Asmus Freytag via Unicode  wrote:

> But such exceptions prove the rule, which leads back to where we 
> started: the default position is that Unicode encodes a character 
> identity that is not the same as encoding the concept that said 
> character is used to represent in writing.

And the problem remains that of determining the 'identity'.  It is
rather like distinguishing species - biologists have dozens of
different concepts.

Richard.


Re: Uppercase ß

2018-05-29 Thread Hans Åberg via Unicode


> On 29 May 2018, at 07:30, Asmus Freytag via Unicode  
> wrote:
> 
> On 5/28/2018 6:30 AM, Hans Åberg via Unicode wrote:
>>> Unifying these would make a real mess of lower casing!
>>> 
>> German has a special sign ß for "ss", without upper capital version.
>> 
>> 
> You may want to retract the second part of that sentence.
> 
> An uppercase exists and it has formally been ruled as acceptable way to write 
> this letter (mostly an issue for ALL CAPS as ß does not occur in word-initial 
> position). 
> A./

Duden used one in 1957, but stated in 1984 that there is no uppercase version 
[1]. So it would be interesting with a reference to an official version.

1. https://en.wikipedia.org/wiki/ß





Re: Uppercase ß

2018-05-29 Thread Werner LEMBERG via Unicode


> The ALL-CAPS "SS" really has little to recommend it, intrinsically.
> It is de-facto a fall-back; one that competed with "SZ" as used in
> telegrams (while they still were a thing).

Well, the status of `ß' is indeed complicated, and the radical
solution used in Switzerland has certainly benefits.

> Not being able to know how to hyphenate MASSE without knowing the
> meaning of the word is also not something that I consider a
> "benefit".

I don't see much difference to the English example of `re-cord'
vs. `rec-ord'.  And Swiss people won't start to use `ẞ' just for
getting the right meaning...

> Uppercase forms for `ß' have been kicking around in fonts for a long
> time as was documented around the time that the character was
> encoded.

Yes, and it was never successful.  The introduction of `ẞ' into
Unicode a few years ago was mainly driven by experts, not something
that had big popularity before.

> With changing environment (starting with influence from Anglo-Saxon
> use of type and not ending with the way the character is treated in
> relation to phonetics) I've been expecting to see usage evolving;
> and not necessarily driven by software engineers.

Yes, let's see how everything will evolve.  Regardless of that,
software should support the status quo as good as possible.


Werner



Re: Uppercase ß

2018-05-29 Thread Werner LEMBERG via Unicode


> When looking for the lowercase ß LATIN SMALL LETTER SHARP S U+00DF
> in a MacOS Character Viewer, it does not give the uppercase version,
> for some reason.

Yes, and it will stay so, AFAIK.  The uppercase variant of `ß' is
`SS'.  `ẞ' is to be used mainly for names that contain `ß', and which
must be printed uppercase, for example in passports.  Here the
distinction is important, cf.

  Strauß vs. Strauss  →  STRAUẞ vs. STRAUSS

Since uppercasing is not common in typesetting German text (in
particular headers), the need to make a distinction between words like
`Masse' (mass) and `Maße' (dimensions) if written uppercase is rarely
necessary because it can usually deduced by context.


Werner



Re: Uppercase ß

2018-05-29 Thread Hans Åberg via Unicode


> On 29 May 2018, at 10:54, Martin J. Dürst  wrote:
> 
> On 2018/05/29 17:15, Hans Åberg via Unicode wrote:
>>> On 29 May 2018, at 07:30, Asmus Freytag via Unicode  
>>> wrote:
> 
>>> An uppercase exists and it has formally been ruled as acceptable way to 
>>> write this letter (mostly an issue for ALL CAPS as ß does not occur in 
>>> word-initial position).
>>> A./
>> Duden used one in 1957, but stated in 1984 that there is no uppercase 
>> version [1]. So it would be interesting with a reference to an official 
>> version.
>> 1. https://en.wikipedia.org/wiki/ß
> 
> The English wikipedia may not be fully up to date.
> See https://de.wikipedia.org/wiki/Großes_ß (second paragraph):
> 
> "Seit dem 29. Juni 2017 ist das ẞ Bestandteil der amtlichen deutschen 
> Rechtschreibung.[2][3]"
> 
> Translated to English: "Since June 29, 2017, the ẞ is part of the official 
> German orthography."
> 
> (As far as I remember the discussion (on this list?) last year, the ẞ 
> (uppercase ß) is allowed, but not required.)

And it is already in Unicode as ẞ LATIN CAPITAL LETTER SHARP S U+1E9E. When 
looking for the lowercase ß
LATIN SMALL LETTER SHARP S U+00DF in a MacOS Character Viewer, it does not give 
the uppercase version, for some reason.

The equivalence with "ss" shows up ICU Regular Expressions that do case 
insensitive matching where the cases have different length, so it should do 
that for the new character to, I gather.
  http://userguide.icu-project.org/strings/regexp





Re: Uppercase ß

2018-05-29 Thread Martin J. Dürst via Unicode

On 2018/05/29 17:15, Hans Åberg via Unicode wrote:



On 29 May 2018, at 07:30, Asmus Freytag via Unicode  wrote:



An uppercase exists and it has formally been ruled as acceptable way to write 
this letter (mostly an issue for ALL CAPS as ß does not occur in word-initial 
position).
A./


Duden used one in 1957, but stated in 1984 that there is no uppercase version 
[1]. So it would be interesting with a reference to an official version.

1. https://en.wikipedia.org/wiki/ß


The English wikipedia may not be fully up to date.
See https://de.wikipedia.org/wiki/Großes_ß (second paragraph):

"Seit dem 29. Juni 2017 ist das ẞ Bestandteil der amtlichen deutschen 
Rechtschreibung.[2][3]"


Translated to English: "Since June 29, 2017, the ẞ is part of the 
official German orthography."


(As far as I remember the discussion (on this list?) last year, the ẞ 
(uppercase ß) is allowed, but not required.)


Regards,   Martin.



Re: Uppercase ß

2018-05-29 Thread Hans Åberg via Unicode


> On 29 May 2018, at 11:17, Werner LEMBERG  wrote:
> 
>> When looking for the lowercase ß LATIN SMALL LETTER SHARP S U+00DF
>> in a MacOS Character Viewer, it does not give the uppercase version,
>> for some reason.
> 
> Yes, and it will stay so, AFAIK.  The uppercase variant of `ß' is
> `SS'.  `ẞ' is to be used mainly for names that contain `ß', and which
> must be printed uppercase, for example in passports.  Here the
> distinction is important, cf.
> 
>  Strauß vs. Strauss  →  STRAUẞ vs. STRAUSS
> 
> Since uppercasing is not common in typesetting German text (in
> particular headers), the need to make a distinction between words like
> `Masse' (mass) and `Maße' (dimensions) if written uppercase is rarely
> necessary because it can usually deduced by context.

If uppercasing is not common, one would think that setting it too ẞ would pose 
no problems, no that it is available.




Re: Uppercase ß

2018-05-29 Thread Hans Åberg via Unicode


> On 29 May 2018, at 12:55, Arthur Reutenauer  wrote:
> 
>> If uppercasing is not common, one would think that setting it too ẞ would 
>> pose no problems, no that it is available.
> 
>  It would, for reasons of stability.

The main point is what users of ẞ and ß would think, and Unicode to adjust 
accordingly.




Re: Uppercase ß

2018-05-29 Thread Hans Åberg via Unicode


> On 29 May 2018, at 14:47, Arthur Reutenauer  wrote:
> 
>> The main point is what users of ẞ and ß would think, and Unicode to adjust 
>> accordingly.
> 
>  Since users of ß would think that in the vast majority of cases, it
> ought to be uppercased to SS, I think you’re missing the main point.

No, you missed the point.





Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Ken Whistler via Unicode




On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote:

How would one know that they are misapplied?  And what if the author of
the text has broken your rules? Are such texts never to be transcribed
to pukka Unicode?


Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, 
Script=Latin) doesn't automatically make the Tamil vowel "inherit" the 
Latin script property value, nor should it.


That said, if someone decides they want that sequence, and their text as 
"broken my rules", so be it. I'm just not going to assume anything 
particular about that text. Note that in terms of trying to determine 
whether such a string is (naively) alphabetic, such a sequence doesn't 
interfere with the determination. On the other hand, a process concerned 
about text runs, script assignment, validity for domains, or other such 
issues *will* be sensitive to such a boundary -- and should not be 
overruled by some generic determination that combining marks inherit all 
the properties of their base.





Even without knowing exactly what is wanted, it looks to me as though
it isn't.  If he wants to allow  as a substring, which
he should, then that fails because there is no overlap between
p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}.


Yes, so if you are working with strings for Indic scripts (or for that 
matter, Arabic), you add Join_Control to the mix:


Alphabetic  ∪ Diacritic ∪ Extender ∪ Join_Control

gets you a decent approximation of what is (naively) expected to fall 
within an "alphabetic" string for most scripts.


For those following along, Alphabetic is roughly meant to cover the ABC, 
かきくけこ,... plus ideographic elements of most scripts. Diacritic picks up 
most of the applied combining marks, including nuktas, viramas, and tone 
marks. Extender picks up spacing elements that indicate length, 
reduplication, iteration, etc. And joiners are, well, joiners.


If one wants finer categorization specifically for Indic scripts, then I 
would suggest turning to the Indic_Syllabic_Category property instead of 
a union of PropList.txt properties and/or some twiddling with 
General_Category values.


--Ken





Re: Uppercase ß

2018-05-29 Thread Arthur Reutenauer via Unicode
> If uppercasing is not common, one would think that setting it too ẞ would 
> pose no problems, no that it is available.

  It would, for reasons of stability.

Arthur


Re: Uppercase ß

2018-05-29 Thread Arthur Reutenauer via Unicode
> The main point is what users of ẞ and ß would think, and Unicode to adjust 
> accordingly.

  Since users of ß would think that in the vast majority of cases, it
ought to be uppercased to SS, I think you’re missing the main point.

Arthur


Re: Uppercase ß

2018-05-29 Thread Asmus Freytag via Unicode

  
  
On 5/29/2018 5:57 AM, Hans Åberg via
  Unicode wrote:


  

  
On 29 May 2018, at 14:47, Arthur Reutenauer  wrote:



  The main point is what users of ẞ and ß would think, and Unicode to adjust accordingly.



 Since users of ß would think that in the vast majority of cases, it
ought to be uppercased to SS, I think you’re missing the main point.

  
  
No, you missed the point.






Overlooked in this discussion is the fact
that the revised orthography of 1996 introduces for the first
time a systematic difference in pronunciation for the vowel
preceding SS vs. ẞ (short vs. long).  As users of the
  old orthography age out, I would not be surprised if the SS
  fallback were to become less acceptable over time because it would
  be at odds with how the word is to be pronounced. I'm also
  confidently expecting the use of ALL CAPS to become (somewhat)
  more prevalent under the continued influence of English usage.
Languages and writing systems evolve.
A./



  



Re: Unicode characters unification

2018-05-29 Thread Asmus Freytag (c) via Unicode

On 5/29/2018 1:08 AM, Richard Wordingham wrote:

On Mon, 28 May 2018 21:40:49 -0700
Asmus Freytag via Unicode  wrote:


But such exceptions prove the rule, which leads back to where we
started: the default position is that Unicode encodes a character
identity that is not the same as encoding the concept that said
character is used to represent in writing.

And the problem remains that of determining the 'identity'.  It is
rather like distinguishing species - biologists have dozens of
different concepts.

Richard.


Totally. Never said that encoding is a simple algorithmic process. :)

A./