Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-28 Thread Asmus Freytag via Unicode

  
  
On 5/28/2018 6:30 AM, Hans Åberg via
  Unicode wrote:


  
Unifying these would make a real mess of lower casing!

  
  German has a special sign ß for "ss", without upper capital version.



You may want to retract the second part of
that sentence.
An uppercase exists and it has formally been
ruled as acceptable way to write this letter (mostly an issue
for ALL CAPS as ß does not occur in word-initial
  position). 

A./

  



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Ken Whistler via Unicode




On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote:
One of the general principles is that combining marks inherit the 
property of their base character.


Normally, "inherited" should be the only property value for combining 
marks.


There have been some deviations from this over the years, for various 
reasons, and there are some properties (such as general category) 
where it is necessary to recognize the character as combining, but the 
general principle still holds.


Therefore, if you are trying to see whether a string is alphabetic, 
combining marks should be "transparent" to such an algorithm.


Generally, good advice. But there are clear exceptions. For example, the 
enclosing combining marks for symbols are intended (basically) to make 
symbols of a sort. And many combining marks have explicit script 
assigments, so they cannot simply willy-nilly inherit the script of a 
base letter if they are misapplied, for example.


This is why I recommend simply adding the Diacritic property into the 
mix for testing a string. That is a closer approximation to the kind of 
naive "Is this string alphabetic?" question that SunaraRaman was asking 
about -- it picks up the correct subset of combining marks to union with 
the set of actual isAlphabetic characters, to produce more expected 
results. (Including, of course, the correct classification of all the 
viramas, stackers, and killers, as well as picking up all the nuktas.).


Folks, please examine the set of character for Diacritic and for 
Extender in:


http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

to see what I'm talking about. The stuff you are looking for is already 
there.


--Ken

P.S. And please don't start an argument about the fact that a "virama" 
isn't really a "diacritic". We know that, too. ;-)





Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Ken Whistler via Unicode



On 5/28/2018 9:23 PM, Martin J. Dürst via Unicode wrote:

Hello Sundar,

On 2018/05/28 04:27, SundaraRaman R via Unicode wrote:

Hi,

In languages like Ruby or Java
(https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), 


functions to check if a character is alphabetic do that by looking for
the 'Alphabetic'  property (defined true if it's in one of the L
categories, or Nl, or has 'Other_Alphabetic' property). When parsing
Tamil text, this works out well for independent vowels and consonants
(which are in Lo), and for most dependent signs (which are in Mc or Mn
but have the 'Other_Alphabetic' property), but the very common pulli 
(VIRAMA)

is neither in Lo nor has 'Other_Alphabetic', and so leads to
concluding any string containing it to be non-alphabetic.

This doesn't make sense to me since the Virama  “◌்” as much of an
alphabetic character as any of the "Dependent Vowel" characters which
have been given the 'Other_Alphabetic' property. Is there a rationale
behind this difference, or is it an oversight to be corrected?


I suggest submitting an error report via 
https://www.unicode.org/reporting.html. I haven't studied the issue in 
detail (sorry, just no time this week), but it sounds reasonable to 
give the VIRAMA the 'Other_Alphabetic' property.


Please don't. This is not an error in the Unicode property assignments, 
which have been stable in scope for Alphabetic for some time now.


The problem is in assuming that the Java or Ruby isAphabetic() API, 
which simply report the Unicode property value Alphabetic for a 
character, suffices for identifying a string as somehow "wordlike". It 
doesn't.


The approximation you are looking for is to add Diacritic to Alphabetic. 
That will automatically pull in all the nuktas and viramas/killers for 
Brahmi-derived scripts. It also will pull in the harakat for Arabic and 
similar abjads, which are also not Alphabetic in the property values. 
And it will pull in tone marks for various writing systems.


For good measure, also add Extender, which will pick up length marks and 
iteration marks.


Please do not assume that the Alphabetic property just automatically 
equates to "what I would write in a word". Or that it should be adjusted 
to somehow make that happen. It would be highly advisable to study *all* 
the UCD properties in more depth, before starting to report bugs in one 
or another simply because using a single property doesn't produce the 
string classification one assumes should be correct in a particular case.


Of course, to get a better approximation of what actually constitutes a 
"word" in a particular writing system, instead of using raw property 
API's, one should be using a WordBreak iterator, preferably one tailored 
for the language in question.


--Ken




I'd recommend to mention examples other than Tamil in your report 
(assuming they exist).


BTW, what's the method you are using in Ruby? If there's a problem in 
Ruby (which I don't think; it's just using Unicode data), then please 
make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I 
should be able to follow up on that.


Regards,   Martin.





Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Asmus Freytag via Unicode
One of the general principles is that combining marks inherit the 
property of their base character.


Normally, "inherited" should be the only property value for combining marks.

There have been some deviations from this over the years, for various 
reasons, and there are some properties (such as general category) where 
it is necessary to recognize the character as combining, but the general 
principle still holds.


Therefore, if you are trying to see whether a string is alphabetic, 
combining marks should be "transparent" to such an algorithm.


A./


On 5/28/2018 9:23 PM, Martin J. Dürst via Unicode wrote:

Hello Sundar,

On 2018/05/28 04:27, SundaraRaman R via Unicode wrote:

Hi,

In languages like Ruby or Java
(https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), 


functions to check if a character is alphabetic do that by looking for
the 'Alphabetic'  property (defined true if it's in one of the L
categories, or Nl, or has 'Other_Alphabetic' property). When parsing
Tamil text, this works out well for independent vowels and consonants
(which are in Lo), and for most dependent signs (which are in Mc or Mn
but have the 'Other_Alphabetic' property), but the very common pulli 
(VIRAMA)

is neither in Lo nor has 'Other_Alphabetic', and so leads to
concluding any string containing it to be non-alphabetic.

This doesn't make sense to me since the Virama  “◌்” as much of an
alphabetic character as any of the "Dependent Vowel" characters which
have been given the 'Other_Alphabetic' property. Is there a rationale
behind this difference, or is it an oversight to be corrected?


I suggest submitting an error report via 
https://www.unicode.org/reporting.html. I haven't studied the issue in 
detail (sorry, just no time this week), but it sounds reasonable to 
give the VIRAMA the 'Other_Alphabetic' property.


I'd recommend to mention examples other than Tamil in your report 
(assuming they exist).


BTW, what's the method you are using in Ruby? If there's a problem in 
Ruby (which I don't think; it's just using Unicode data), then please 
make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I 
should be able to follow up on that.


Regards,   Martin.





Re: Unicode characters unification

2018-05-28 Thread Asmus Freytag via Unicode
In the discussion leading up to this it has been implied that Unicode 
encodes / should encode concepts or pure shape. And there's been some 
confusion as to were concerns about sorting or legacy encodings fit in. 
Time to step back a bit:


Primarily the Unicode Standard encodes by character identity - something 
that is different from either the pure shape or the "concept denoted by 
the character".


For example, for most alphabetic characters, you could say that they 
stand for a more-or-less well-defined phonetic value. But Unicode does 
not encode such values directly, instead it encodes letters - which in 
turn get re-purposed for different sound values in each writing system.


Likewise, the various uses of period or comma are not separately encoded 
- potentially these marks are given mappings to specific functions for 
each writing system or notation using them.


Clearly these are not encoded to represent a single mapping to an 
external concept, and, as we will see, they are not necessarily encoded 
directly by shape.


Instead, the Unicode Standard encodes character identity; but there are 
a number of principled and some ad-hoc deviations from a purist 
implementation of that approach.


The first one is that of forcing a disunification by script. What 
constitutes a script can be argued over, especially as they all seem to 
have evolved from (or been created based on) predecessor scripts, so 
there are always pairs of scripts that have a lot in common. While an 
"Alpha" and an "A" do have much in common, it is best to recognize that 
their membership in different scripts leads to important differences so 
that it's not a stretch to say that they no longer share the same identity.


The next principled deviation is that of requiring case pairs to be 
unique. Bicameral scripts, (and some of the characters in them), 
acquired their lowercase at different times, so that the relation 
between the upper cases and the lower cases are different across 
scripts, and gives rise to some exceptional cases inside certain scripts.


This is one of the reasons to disunify certain bicameral scripts. But 
even inside scripts, there are case pairs that may share lowercase forms 
or may share uppercase forms, but said forms are disunified to make the 
pairs separate. The two first principles match users expectations in 
that case changes (largely) work as expected in plain text and that 
sorting also (largely) matches user expectation by default.


The third principle is to disunify characters based on line-breaking or 
line-layout properties. Implicit in that is the idea that plain text, 
and not markup, is the place to influence basic algorithms such as 
line-breaking and bidi layout (hence two sets of Arabic-Indic digits). 
Once can argue with that decision, but the fact is, there are too many 
places where text exist without the ability to apply markup to go 
entirely without that support.


The fourth principle is that of differential variability of appearance. 
For letters proper, their identity can be associated with a wide range 
of appearances from sparse to fanciful glyphs. If an entire piece of 
text (or even a given word) is set using a particular font style, 
context will enable the reader to identify the underlying letter, even 
if the shape is almost unrelated to the "archetypical shape" documented 
in the Standard.


When letters or marks get re-used in notational systems, though, the 
permissible range of variability changes dramatically - variations that 
do not change the meaning of a word in styled text, suddenly change the 
meaning of text in a certain notational system. Hence the disunification 
of certain letters or marks (but not all of them) in support of 
mathematical notation.


The fifth principle appears to be to disunify only as far as and only 
when necessary. The biggest downside of this principle is that it leads 
to "late" disunifications; some characters get disunified as the 
committee becomes aware of some issue, leading to the problem of legacy 
data. But it has usefully somewhat limited the further proliferation of 
characters of identical appearance.


The final principle is compatibility. This covers being able to 
round-trip from certain legacy encodings. This principle may force some 
disunifications that otherwise might not have happened, but it also 
isn't a panacea: there are legacy encodings that are mutually 
incompatible, so that one needs to make a choice which one to support. 
TeX being a "glyph based" system looses out here in comparison to legacy 
plain-text character encoding systems such as the 8859 series of ISO/IEC 
standards.


Some unification among punctuation marks in particular seem to have been 
made on a more ad-hoc basis. This issue is exacerbated by the fact that 
many such systems lack either the wide familiarity of standard writing 
systems (with their tolerance for glyph variation) nor the rigor of 
something like mathematical notation. This 

Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Martin J. Dürst via Unicode

Hello Sundar,

On 2018/05/28 04:27, SundaraRaman R via Unicode wrote:

Hi,

In languages like Ruby or Java
(https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)),
functions to check if a character is alphabetic do that by looking for
the 'Alphabetic'  property (defined true if it's in one of the L
categories, or Nl, or has 'Other_Alphabetic' property). When parsing
Tamil text, this works out well for independent vowels and consonants
(which are in Lo), and for most dependent signs (which are in Mc or Mn
but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA)
is neither in Lo nor has 'Other_Alphabetic', and so leads to
concluding any string containing it to be non-alphabetic.

This doesn't make sense to me since the Virama  “◌்” as much of an
alphabetic character as any of the "Dependent Vowel" characters which
have been given the 'Other_Alphabetic' property. Is there a rationale
behind this difference, or is it an oversight to be corrected?


I suggest submitting an error report via 
https://www.unicode.org/reporting.html. I haven't studied the issue in 
detail (sorry, just no time this week), but it sounds reasonable to give 
the VIRAMA the 'Other_Alphabetic' property.


I'd recommend to mention examples other than Tamil in your report 
(assuming they exist).


BTW, what's the method you are using in Ruby? If there's a problem in 
Ruby (which I don't think; it's just using Unicode data), then please 
make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I 
should be able to follow up on that.


Regards,   Martin.


Re: Unicode characters unification

2018-05-28 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 21:14:58 +0200
Hans Åberg via Unicode  wrote:

> > On 28 May 2018, at 21:01, Richard Wordingham via Unicode
> >  wrote:
> > 
> > On Mon, 28 May 2018 20:19:09 +0200
> > Hans Åberg via Unicode  wrote:
> >   
> >> Indistinguishable math styles Latin and Greek uppercase letters
> >> have been added, even though that was not so in for example TeX,
> >> and thus no encoding legacy to consider.  
> > 
> > They sort differently - one can have vaguely alphabetical indexes of
> > mathematical symbols.  They also have quite different compatibility
> > decompositions.
> > 
> > Does sorting offer an argument for encoding these symbols
> > differently. I'm not sure it's a strong arguments - how likely is
> > one to have a list where the difference matters?  
> 
> The main point is that they are not likely to be distinguishable when
> used side-by-side in the same formula. They could be of significance
> if using Greek names instead of letters, of length greater than one,
> then. But it is not wrong to add them, because it is easier than
> having to think through potential uses.

By these symbols, I meant the quarter-tone symbols.  Capital em and
capital mu, as symbols, need to be encoded separately for proper
sorting.

Richard. 



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Doug Ewell via Unicode

SundaraRaman R wrote:


but the very common pulli (VIRAMA)
is neither in Lo nor has 'Other_Alphabetic', and so leads to
concluding any string containing it to be non-alphabetic.


Is this definition part of Unicode? I thought the use of General 
Category to answer questions like "this sequence is a word" or "this 
string is alphabetic" was much more complex than that. (I'm not even 
sure what the latter means, for any script with any sort of combining 
mark.)


Richard Wordingham wrote:


The effects of virama that spring to mind are:

(a) Causing one or both letters on either side to change or combine to
indicate combination;

(b) Appearing as a mark only if it does not affect one of the letters
on either side;

(c) Causing a left matra to appear on the left of the sequence of
consonants joined by a sequence of non-visible viramas.


Most of these don't apply to Tamil, of course.

--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: Unicode characters unification

2018-05-28 Thread Hans Åberg via Unicode


> On 28 May 2018, at 21:38, Richard Wordingham 
>  wrote:
> 
> On Mon, 28 May 2018 21:14:58 +0200
> Hans Åberg via Unicode  wrote:
> 
>>> On 28 May 2018, at 21:01, Richard Wordingham via Unicode
>>>  wrote:
>>> 
>>> On Mon, 28 May 2018 20:19:09 +0200
>>> Hans Åberg via Unicode  wrote:
>>> 
 Indistinguishable math styles Latin and Greek uppercase letters
 have been added, even though that was not so in for example TeX,
 and thus no encoding legacy to consider.  
>>> 
>>> They sort differently - one can have vaguely alphabetical indexes of
>>> mathematical symbols.  They also have quite different compatibility
>>> decompositions.
>>> 
>>> Does sorting offer an argument for encoding these symbols
>>> differently. I'm not sure it's a strong arguments - how likely is
>>> one to have a list where the difference matters?  
>> 
>> The main point is that they are not likely to be distinguishable when
>> used side-by-side in the same formula. They could be of significance
>> if using Greek names instead of letters, of length greater than one,
>> then. But it is not wrong to add them, because it is easier than
>> having to think through potential uses.
> 
> By these symbols, I meant the quarter-tone symbols.  Capital em and
> capital mu, as symbols, need to be encoded separately for proper
> sorting.

Some of the math style letters are out of order for legacy reasons, so sorting 
may not work well.

SMuFL have different fonts for text and music engraving, but I can't think of 
any use of sorting them.





Re: Unicode characters unification

2018-05-28 Thread Hans Åberg via Unicode


> On 28 May 2018, at 21:01, Richard Wordingham via Unicode 
>  wrote:
> 
> On Mon, 28 May 2018 20:19:09 +0200
> Hans Åberg via Unicode  wrote:
> 
>> Indistinguishable math styles Latin and Greek uppercase letters have
>> been added, even though that was not so in for example TeX, and thus
>> no encoding legacy to consider.
> 
> They sort differently - one can have vaguely alphabetical indexes of
> mathematical symbols.  They also have quite different compatibility
> decompositions.
> 
> Does sorting offer an argument for encoding these symbols differently.
> I'm not sure it's a strong arguments - how likely is one to have a list
> where the difference matters?

The main point is that they are not likely to be distinguishable when used 
side-by-side in the same formula. They could be of significance if using Greek 
names instead of letters, of length greater than one, then. But it is not wrong 
to add them, because it is easier than having to think through potential uses.





Re: Unicode characters unification

2018-05-28 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 20:19:09 +0200
Hans Åberg via Unicode  wrote:


> Indistinguishable math styles Latin and Greek uppercase letters have
> been added, even though that was not so in for example TeX, and thus
> no encoding legacy to consider.

They sort differently - one can have vaguely alphabetical indexes of
mathematical symbols.  They also have quite different compatibility
decompositions.

Does sorting offer an argument for encoding these symbols differently.
I'm not sure it's a strong arguments - how likely is one to have a list
where the difference matters?

Richard.

 




Re: Unicode characters unification

2018-05-28 Thread Hans Åberg via Unicode


> On 28 May 2018, at 19:18, Richard Wordingham via Unicode 
>  wrote:
> 
> On Mon, 28 May 2018 17:54:47 +0200
> Hans Åberg via Unicode  wrote:
> 
>>> On 28 May 2018, at 17:00, Richard Wordingham via Unicode
>>>  wrote:
>>> 
>>> On Mon, 28 May 2018 15:30:55 +0200
>>> Hans Åberg via Unicode  wrote:
> 
 German has a special sign ß for "ss", without upper capital
 version.  
>>> 
>>> That doesn't prevent upper-casing - you just have to know your
>>> audience.
>> 
>> That would be the same if the Greek and Latin uppercase letters would
>> have been unified: One would need to know the context.
> 
> I've seen a commutation diagram with both U+004D LATIN CAPITAL LETTER
> M and U+039C GREEK CAPITAL LETTER MU on it.  I only knew the difference
> because I listened to what the lecturer said.

Indistinguishable math styles Latin and Greek uppercase letters have been 
added, even though that was not so in for example TeX, and thus no encoding 
legacy to consider.





Re: Unicode characters unification

2018-05-28 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 17:54:47 +0200
Hans Åberg via Unicode  wrote:

> > On 28 May 2018, at 17:00, Richard Wordingham via Unicode
> >  wrote:
> > 
> > On Mon, 28 May 2018 15:30:55 +0200
> > Hans Åberg via Unicode  wrote:

> >> German has a special sign ß for "ss", without upper capital
> >> version.  
> > 
> > That doesn't prevent upper-casing - you just have to know your
> > audience.
> 
> That would be the same if the Greek and Latin uppercase letters would
> have been unified: One would need to know the context.

I've seen a commutation diagram with both U+004D LATIN CAPITAL LETTER
M and U+039C GREEK CAPITAL LETTER MU on it.  I only knew the difference
because I listened to what the lecturer said.

> > For the
> > same reason, there are two utter confusables in THE Latin SCRIPT for
> > 00D0 LATIN CAPITAL LETTER ETH.  

> The stuff is likely added for computer legacy, if there were separate
> encodings for those.

Unlikely.  U+00F0 LATIN SMALL LETTER ETH and U+0256 LATIN SMALL LETTER
D WITH TAIL contrast in the IPA.  The difference between U+0111 LATIN
SMALL LETTER D WITH STROKE and U+00F0 LATIN SMALL LETTER ETH may have
been debated.

Richard.



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 20:03:11 +0530
SundaraRaman R via Unicode  wrote:

> Hi, thanks for your reply.
> 
> > There is only one character with a canonical combining class of 9
> > that is included as other_alphabetic, namely U+0E3A THAI CHARACTER
> > PHINTHU. That last had any of the other properties of viramas back
> > in Unicode 1.0; the characters that triggered such behaviours were
> > permanently removed in Unicode 1.1.  
> 
> I didn't understand the second sentence here, could you clarify?

Sorry, I messed that system up.  It should have read, "The last time
that that had any of the other properties of viramas back
in Unicode 1.0;"

> What
> do you mean by "any of the other properties" here?

The effects of virama that spring to mind are:

(a) Causing one or both letters on either side to change or combine to
indicate combination;

(b) Appearing as a mark only if it does not affect one of the letters
on either side;

(c) Causing a left matra to appear on the left of the sequence of
consonants joined by a sequence of non-visible viramas.

> And "triggered such
> behaviours" seems to imply having them in other_alphabetic had
> negative consequences, could you give an example of what that might
> be?

Nowadays, the Thai syllable ไตร, normatively pronounced /trai/, is
only encoded , and the character
U+0E3A is always visible when used; for most routine purposes it is
little different to U+0E38 THAI CHARACTER SARA U.  However, in Unicode
1.0, while  was rendered as at present, the same
visible string could also be encoded as   - no glyph would be
rendered for U+0E3A. If one wanted the official Sanskritised Pali
version, one could type ไตฺร  as at
present.  One could also encode it as .

Weirdly, I couldn't have used the phonetically ordered vowel to type a
monk's name ending in มฺโม , as  would have been rendered as
โมฺม.

As the non-phonetic virama-like behaviours of U+0E3A are only mentioned
under the heading 'Alternate Ordering', I can only presume that they
were triggered by the phonetic order vowel signs, U+0E70 to U+0E74.

It is possible that U+0E3A acquired the alphabetic property because it
ceased to behave like a virama.  Alternatively, it may have acquired
the alphabetic property because of its use in the compound vowels of
minority languages.

> But in the case of Tamil, I'm curious why most other combining Tamil
> marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a
> character barely used in Tamil text, has combining class 0 and is
> included in Other_Alphabetic, but the visually similar and  similarly
> positioned pulli is not. In this particular case, is it a historical
> accident that these got assigned this way, or is there a rationale
> behind these? Would it at all be possible to get this changed in the
> upcoming Unicode standard?

Tamil has usually been treated as just another Indian Indic script.
U+0E3A is the only virama-like character with the property of being
'alphabetic'.

I can't see a change making it into Unicode 11.0.  It requires too much
careful thought.  Besides, anything that considered  as
alphabetic should also considerer  as alphabetic - they
should be mostly interchangeable in Tamil.

> > I fear that the correct test for what you want is to split text into
> > words and check that each word begins with an alphabetic
> > character.  
> 
> Do you mean "each grapheme cluster begins with an alphabetic
> character" here? It seems to me (in my very limited Unicode knowledge)
> that such a test, going through grapheme clusters and checking the
> first codepoint in each, would also ensure the text is full
> alphabetic.

Not directly.  Is the string "mark2mark" alphabetic?  It constitutes a
single word.  My suggested simplification would say 'no', as it
contains '2'; perhaps my simplification is wrong.

> And it has the advantage that more languages have a
> (relatively) easy way for splitting text into grapheme clusters, than
> for checking minor Unicode properties like WordBreak, so this one
> might be easier to implement. Does this test anywhere in the ballpark
> of being right?

Yes, it's close to being right.  Note that simple approximations for SE
Asian word-breaking (e.g. treating SE Asian characters as
alphabetic) should work well for your application.

Richard.



Re: Unicode characters unification

2018-05-28 Thread Hans Åberg via Unicode

> On 28 May 2018, at 17:00, Richard Wordingham via Unicode 
>  wrote:
> 
> On Mon, 28 May 2018 15:30:55 +0200
> Hans Åberg via Unicode  wrote:
> 
>>> On 28 May 2018, at 15:10, Richard Wordingham via Unicode
>>>  wrote:
>>> 
>>> On Mon, 28 May 2018 10:08:30 +0200
>>> Hans Åberg via Unicode  wrote:
>>> 
 It is not about precision, but concepts. Like B, Β, and В, which
 could have been unified, but are not.  
>>> 
>>> Unifying these would make a real mess of lower casing!  
>> 
>> German has a special sign ß for "ss", without upper capital version.
> 
> That doesn't prevent upper-casing - you just have to know your
> audience.  

That would be the same if the Greek and Latin uppercase letters would have been 
unified: One would need to know the context.

> The three letters like 'B' have very different lower case
> forms, and very few would agree that they were the same letter.  

They were the same in the Uncial script, but evolved to be viewed as different. 
That is common with math symbols: something available evolving into separate 
symbols.

> For the
> same reason, there are two utter confusables in THE Latin SCRIPT for
> 00D0 LATIN CAPITAL LETTER ETH.

The stuff is likely added for computer legacy, if there were separate encodings 
for those.

> More notably though, one just has to run
> the risk of getting a culturally incorrect upper case when rendering
> U+014A LATIN CAPITAL LETTER ENG; whether the three alternatives are the
> same letter is debatable.

Unified CJK Ideographs differ by stroke order.





Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-28 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 15:30:55 +0200
Hans Åberg via Unicode  wrote:

> > On 28 May 2018, at 15:10, Richard Wordingham via Unicode
> >  wrote:
> > 
> > On Mon, 28 May 2018 10:08:30 +0200
> > Hans Åberg via Unicode  wrote:
> >   
> >> It is not about precision, but concepts. Like B, Β, and В, which
> >> could have been unified, but are not.  
> > 
> > Unifying these would make a real mess of lower casing!  
> 
> German has a special sign ß for "ss", without upper capital version.

That doesn't prevent upper-casing - you just have to know your
audience.  The three letters like 'B' have very different lower case
forms, and very few would agree that they were the same letter.  For the
same reason, there are two utter confusables in THE Latin SCRIPT for
00D0 LATIN CAPITAL LETTER ETH. More notably though, one just has to run
the risk of getting a culturally incorrect upper case when rendering
U+014A LATIN CAPITAL LETTER ENG; whether the three alternatives are the
same letter is debatable.

Richard.



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread SundaraRaman R via Unicode
Hi, thanks for your reply.

> There is only one character with a canonical combining class of 9 that
> is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU.
> That last had any of the other properties of viramas back in Unicode
> 1.0; the characters that triggered such behaviours were permanently
> removed in Unicode 1.1.

I didn't understand the second sentence here, could you clarify? What
do you mean by "any of the other properties" here? And "triggered such
behaviours" seems to imply having them in other_alphabetic had
negative consequences, could you give an example of what that might
be?

> There are some notable absences from the combining marks included.
> Significant absences include ZWJ, ZWNJ and CGJ.
>
> However, a non-erroneous *conformant* Unicode process cannot
> always determine whether a string, given only that it is a string, is
> composed only of alphabetic characters.  The answer would be 'yes' for
>  but 'no' for the canonically
> equivalent !
> (U+0327 is not included as alphabetic either.)
>
> There is at least one combination of Latin letter and combining mark
> that occurs in the normal orthography of a natural language and does not
> have a precomposed equivalent.

Ah, that's somewhat unfortunate that such a quick and easy alphabetic
check is not possible in the general case, but I can understand how it
might be weird to give the Alphabetic property to a ZWJ or ZWNJ.

But in the case of Tamil, I'm curious why most other combining Tamil
marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a
character barely used in Tamil text, has combining class 0 and is
included in Other_Alphabetic, but the visually similar and  similarly
positioned pulli is not. In this particular case, is it a historical
accident that these got assigned this way, or is there a rationale
behind these? Would it at all be possible to get this changed in the
upcoming Unicode standard?

(By the way, I'm happy to get a link to read through for any of my
questions here. I just find it quite hard to search for and find past
discussions and decision rationales regarding these, not knowing how
and where to search for them.)


> I fear that the correct test for what you want is to split text into
> words and check that each word begins with an alphabetic character.

Do you mean "each grapheme cluster begins with an alphabetic
character" here? It seems to me (in my very limited Unicode knowledge)
that such a test, going through grapheme clusters and checking the
first codepoint in each, would also ensure the text is full
alphabetic. And it has the advantage that more languages have a
(relatively) easy way for splitting text into grapheme clusters, than
for checking minor Unicode properties like WordBreak, so this one
might be easier to implement. Does this test anywhere in the ballpark
of being right?

Regards,
Sundar


Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-28 Thread Hans Åberg via Unicode

> On 28 May 2018, at 15:10, Richard Wordingham via Unicode 
>  wrote:
> 
> On Mon, 28 May 2018 10:08:30 +0200
> Hans Åberg via Unicode  wrote:
> 
>> It is not about precision, but concepts. Like B, Β, and В, which
>> could have been unified, but are not.
> 
> Unifying these would make a real mess of lower casing!

German has a special sign ß for "ss", without upper capital version.

> What is the context in which the Arab use would benefit from having a
> different encoding?

Maybe if they decide to change the glyph, then what already is encoded would 
get the right appearance. But SMuFL might have had other reasons: the glyphs 
should probably be designed together. And it is simple, as one does not need to 
investigate their uses too much. For example, the Turkish AEU sharps are 
microtonal, not the ordinary ones. So if the Turkish accidentals have their own 
code points, one can change that later.




Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-28 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 10:08:30 +0200
Hans Åberg via Unicode  wrote:

> > On 28 May 2018, at 03:39, Garth Wallace  wrote:
> > The fact that they do not denote the same width in cents in Arabic
> > music as they do in Western modern classical does not matter. That
> > sort of precision is not inherent to the written symbols.  
> 
> It is not about precision, but concepts. Like B, Β, and В, which
> could have been unified, but are not.

Unifying these would make a real mess of lower casing!

What is the context in which the Arab use would benefit from having a
different encoding?

Richard.



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 00:57:03 +0530
SundaraRaman R via Unicode  wrote:

> Hi,
> 
> In languages like Ruby or Java
> (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)),
> functions to check if a character is alphabetic do that by looking for
> the 'Alphabetic'  property (defined true if it's in one of the L
> categories, or Nl, or has 'Other_Alphabetic' property). When parsing
> Tamil text, this works out well for independent vowels and consonants
> (which are in Lo), and for most dependent signs (which are in Mc or Mn
> but have the 'Other_Alphabetic' property), but the very common pulli
> (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to
> concluding any string containing it to be non-alphabetic.
> 
> This doesn't make sense to me since the Virama  “◌்” as much of an
> alphabetic character as any of the "Dependent Vowel" characters which
> have been given the 'Other_Alphabetic' property. Is there a rationale
> behind this difference, or is it an oversight to be corrected?

There is only one character with a canonical combining class of 9 that
is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU.
That last had any of the other properties of viramas back in Unicode
1.0; the characters that triggered such behaviours were permanently
removed in Unicode 1.1.

There are some notable absences from the combining marks included.
Significant absences include ZWJ, ZWNJ and CGJ.

However, a non-erroneous *conformant* Unicode process cannot
always determine whether a string, given only that it is a string, is
composed only of alphabetic characters.  The answer would be 'yes' for
 but 'no' for the canonically
equivalent !
(U+0327 is not included as alphabetic either.)

There is at least one combination of Latin letter and combining mark
that occurs in the normal orthography of a natural language and does not
have a precomposed equivalent.

I fear that the correct test for what you want is to split text into
words and check that each word begins with an alphabetic character.
That test can be made by a conformant process.  I think, but have not
checked, that the test an be simplified to:

(a) Check that the first character is alphabetic.

(b) Ignore every character with a WordBreak property of Extend or ZWJ

(c) Check that all other characters are alphabetic.

Richard.



Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-28 Thread Hans Åberg via Unicode

> On 28 May 2018, at 11:05, Julian Bradfield via Unicode  
> wrote:
> 
> On 2018-05-28, Hans Åberg via Unicode  wrote:
>>> On 28 May 2018, at 03:39, Garth Wallace  wrote:
 On Sun, May 27, 2018 at 3:36 PM, Hans Åberg  wrote:
 The flats and sharps of Arabic music are semantically the same as in 
 Western music, departing from Pythagorean tuning, then, but the microtonal 
 accidentals are different: they simply reused some that were available.
> ...
>>> The fact that they do not denote the same width in cents in Arabic music as 
>>> they do in Western modern classical does not matter. That sort of precision 
>>> is not inherent to the written symbols.
>> 
>> It is not about precision, but concepts. Like B, Β, and В, which could have 
>> been unified, but are not.
> 
> Latin, Greek, Cyrillic etc. could not have been unified, because of the
> requirement to have round-trip compatibility with previous encodings.

Indeed, in Unicode because of that, which I pointed out.

> It is also, of course, convenient for many reasons to have the notion
> of "script" hard-coded into unicode code-points, instead of in
> higher-level mark-up where it arguably belongs - just as, when
> copyright finally expires, it will be convenient to have Tolkien's
> runes disunified from historical runes (which is the line taken by the
> proposal waiting for that day). Whether it is so convenient to have a
> "music script" notion hard-coded is presumably what this argument is
> about. It's not obvious to me that musical notation is something that
> carries the "script" baggage in the same way that writing systems do.

Indeed, that is what I also pointed out. So I suggested to contact the SMuFL 
people which might inform about the underlying reasoning, and then make a 
decision about what might be suitable for Unicode. They probably have them 
separate for the same reason as for scripts: originally different fonts 
encodings, but those are not official, and in addition it is for music 
engraving, and not writing in text files.





Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-28 Thread Julian Bradfield via Unicode
On 2018-05-28, Hans Åberg via Unicode  wrote:
>> On 28 May 2018, at 03:39, Garth Wallace  wrote:
>>> On Sun, May 27, 2018 at 3:36 PM, Hans Åberg  wrote:
>>> The flats and sharps of Arabic music are semantically the same as in 
>>> Western music, departing from Pythagorean tuning, then, but the microtonal 
>>> accidentals are different: they simply reused some that were available.
...
>> The fact that they do not denote the same width in cents in Arabic music as 
>> they do in Western modern classical does not matter. That sort of precision 
>> is not inherent to the written symbols.
>
> It is not about precision, but concepts. Like B, Β, and В, which could have 
> been unified, but are not.

Latin, Greek, Cyrillic etc. could not have been unified, because of the
requirement to have round-trip compatibility with previous encodings.

It is also, of course, convenient for many reasons to have the notion
of "script" hard-coded into unicode code-points, instead of in
higher-level mark-up where it arguably belongs - just as, when
copyright finally expires, it will be convenient to have Tolkien's
runes disunified from historical runes (which is the line taken by the
proposal waiting for that day). Whether it is so convenient to have a
"music script" notion hard-coded is presumably what this argument is
about. It's not obvious to me that musical notation is something that
carries the "script" baggage in the same way that writing systems do.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-28 Thread Hans Åberg via Unicode

> On 28 May 2018, at 03:39, Garth Wallace  wrote:
> 
>> On Sun, May 27, 2018 at 3:36 PM, Hans Åberg  wrote:
>> The flats and sharps of Arabic music are semantically the same as in Western 
>> music, departing from Pythagorean tuning, then, but the microtonal 
>> accidentals are different: they simply reused some that were available.
>> 
> But they aren't different! They are the same symbols. They are, as you 
> yourself say, reused.

Historically, yes, but not necessarily now.

> The fact that they do not denote the same width in cents in Arabic music as 
> they do in Western modern classical does not matter. That sort of precision 
> is not inherent to the written symbols.

It is not about precision, but concepts. Like B, Β, and В, which could have 
been unified, but are not.

> By contrast, Persian music notation invented new microtonal accidentals, 
> called the koron and sori, and my impression is that their average value, as 
> measured by Hormoz Farhat in his thesis, is also usable in Arabic music. For 
> comparison, I have posted the Arabic maqam in Helmholtz-Ellis notation [1] 
> using this value; note that one actually needs two extra microtonal 
> accidentals—Arabic microtonal notation is in fact not complete.
> 
> The E24 exact quarter-tones are suitable for making a piano sound badly out 
> of tune. Compare that with the accordion in [2], Farid El Atrache - 
> "Noura-Noura".
> 
> 1. https://lists.gnu.org/archive/html/lilypond-user/2016-02/msg00607.html
> 2. https://www.youtube.com/watch?v=fvp6fo7tfpk
> > 
> 
> I don't really see how this is relevant. Nobody is claiming that the koron 
> and sori accidentals are the same symbols as the Arabic half-sharp and flat 
> with crossbar. They look entirely different. 

Arabic music simply happens to use Western style accidentals for concepts 
similar to Persian music rather than Western music.