Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols
On 5/28/2018 6:30 AM, Hans Åberg via Unicode wrote: Unifying these would make a real mess of lower casing! German has a special sign ß for "ss", without upper capital version. You may want to retract the second part of that sentence. An uppercase exists and it has formally been ruled as acceptable way to write this letter (mostly an issue for ALL CAPS as ß does not occur in word-initial position). A./
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote: One of the general principles is that combining marks inherit the property of their base character. Normally, "inherited" should be the only property value for combining marks. There have been some deviations from this over the years, for various reasons, and there are some properties (such as general category) where it is necessary to recognize the character as combining, but the general principle still holds. Therefore, if you are trying to see whether a string is alphabetic, combining marks should be "transparent" to such an algorithm. Generally, good advice. But there are clear exceptions. For example, the enclosing combining marks for symbols are intended (basically) to make symbols of a sort. And many combining marks have explicit script assigments, so they cannot simply willy-nilly inherit the script of a base letter if they are misapplied, for example. This is why I recommend simply adding the Diacritic property into the mix for testing a string. That is a closer approximation to the kind of naive "Is this string alphabetic?" question that SunaraRaman was asking about -- it picks up the correct subset of combining marks to union with the set of actual isAlphabetic characters, to produce more expected results. (Including, of course, the correct classification of all the viramas, stackers, and killers, as well as picking up all the nuktas.). Folks, please examine the set of character for Diacritic and for Extender in: http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt to see what I'm talking about. The stuff you are looking for is already there. --Ken P.S. And please don't start an argument about the fact that a "virama" isn't really a "diacritic". We know that, too. ;-)
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On 5/28/2018 9:23 PM, Martin J. Dürst via Unicode wrote: Hello Sundar, On 2018/05/28 04:27, SundaraRaman R via Unicode wrote: Hi, In languages like Ruby or Java (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), functions to check if a character is alphabetic do that by looking for the 'Alphabetic' property (defined true if it's in one of the L categories, or Nl, or has 'Other_Alphabetic' property). When parsing Tamil text, this works out well for independent vowels and consonants (which are in Lo), and for most dependent signs (which are in Mc or Mn but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to concluding any string containing it to be non-alphabetic. This doesn't make sense to me since the Virama “◌்” as much of an alphabetic character as any of the "Dependent Vowel" characters which have been given the 'Other_Alphabetic' property. Is there a rationale behind this difference, or is it an oversight to be corrected? I suggest submitting an error report via https://www.unicode.org/reporting.html. I haven't studied the issue in detail (sorry, just no time this week), but it sounds reasonable to give the VIRAMA the 'Other_Alphabetic' property. Please don't. This is not an error in the Unicode property assignments, which have been stable in scope for Alphabetic for some time now. The problem is in assuming that the Java or Ruby isAphabetic() API, which simply report the Unicode property value Alphabetic for a character, suffices for identifying a string as somehow "wordlike". It doesn't. The approximation you are looking for is to add Diacritic to Alphabetic. That will automatically pull in all the nuktas and viramas/killers for Brahmi-derived scripts. It also will pull in the harakat for Arabic and similar abjads, which are also not Alphabetic in the property values. And it will pull in tone marks for various writing systems. For good measure, also add Extender, which will pick up length marks and iteration marks. Please do not assume that the Alphabetic property just automatically equates to "what I would write in a word". Or that it should be adjusted to somehow make that happen. It would be highly advisable to study *all* the UCD properties in more depth, before starting to report bugs in one or another simply because using a single property doesn't produce the string classification one assumes should be correct in a particular case. Of course, to get a better approximation of what actually constitutes a "word" in a particular writing system, instead of using raw property API's, one should be using a WordBreak iterator, preferably one tailored for the language in question. --Ken I'd recommend to mention examples other than Tamil in your report (assuming they exist). BTW, what's the method you are using in Ruby? If there's a problem in Ruby (which I don't think; it's just using Unicode data), then please make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I should be able to follow up on that. Regards, Martin.
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
One of the general principles is that combining marks inherit the property of their base character. Normally, "inherited" should be the only property value for combining marks. There have been some deviations from this over the years, for various reasons, and there are some properties (such as general category) where it is necessary to recognize the character as combining, but the general principle still holds. Therefore, if you are trying to see whether a string is alphabetic, combining marks should be "transparent" to such an algorithm. A./ On 5/28/2018 9:23 PM, Martin J. Dürst via Unicode wrote: Hello Sundar, On 2018/05/28 04:27, SundaraRaman R via Unicode wrote: Hi, In languages like Ruby or Java (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), functions to check if a character is alphabetic do that by looking for the 'Alphabetic' property (defined true if it's in one of the L categories, or Nl, or has 'Other_Alphabetic' property). When parsing Tamil text, this works out well for independent vowels and consonants (which are in Lo), and for most dependent signs (which are in Mc or Mn but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to concluding any string containing it to be non-alphabetic. This doesn't make sense to me since the Virama “◌்” as much of an alphabetic character as any of the "Dependent Vowel" characters which have been given the 'Other_Alphabetic' property. Is there a rationale behind this difference, or is it an oversight to be corrected? I suggest submitting an error report via https://www.unicode.org/reporting.html. I haven't studied the issue in detail (sorry, just no time this week), but it sounds reasonable to give the VIRAMA the 'Other_Alphabetic' property. I'd recommend to mention examples other than Tamil in your report (assuming they exist). BTW, what's the method you are using in Ruby? If there's a problem in Ruby (which I don't think; it's just using Unicode data), then please make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I should be able to follow up on that. Regards, Martin.
Re: Unicode characters unification
In the discussion leading up to this it has been implied that Unicode encodes / should encode concepts or pure shape. And there's been some confusion as to were concerns about sorting or legacy encodings fit in. Time to step back a bit: Primarily the Unicode Standard encodes by character identity - something that is different from either the pure shape or the "concept denoted by the character". For example, for most alphabetic characters, you could say that they stand for a more-or-less well-defined phonetic value. But Unicode does not encode such values directly, instead it encodes letters - which in turn get re-purposed for different sound values in each writing system. Likewise, the various uses of period or comma are not separately encoded - potentially these marks are given mappings to specific functions for each writing system or notation using them. Clearly these are not encoded to represent a single mapping to an external concept, and, as we will see, they are not necessarily encoded directly by shape. Instead, the Unicode Standard encodes character identity; but there are a number of principled and some ad-hoc deviations from a purist implementation of that approach. The first one is that of forcing a disunification by script. What constitutes a script can be argued over, especially as they all seem to have evolved from (or been created based on) predecessor scripts, so there are always pairs of scripts that have a lot in common. While an "Alpha" and an "A" do have much in common, it is best to recognize that their membership in different scripts leads to important differences so that it's not a stretch to say that they no longer share the same identity. The next principled deviation is that of requiring case pairs to be unique. Bicameral scripts, (and some of the characters in them), acquired their lowercase at different times, so that the relation between the upper cases and the lower cases are different across scripts, and gives rise to some exceptional cases inside certain scripts. This is one of the reasons to disunify certain bicameral scripts. But even inside scripts, there are case pairs that may share lowercase forms or may share uppercase forms, but said forms are disunified to make the pairs separate. The two first principles match users expectations in that case changes (largely) work as expected in plain text and that sorting also (largely) matches user expectation by default. The third principle is to disunify characters based on line-breaking or line-layout properties. Implicit in that is the idea that plain text, and not markup, is the place to influence basic algorithms such as line-breaking and bidi layout (hence two sets of Arabic-Indic digits). Once can argue with that decision, but the fact is, there are too many places where text exist without the ability to apply markup to go entirely without that support. The fourth principle is that of differential variability of appearance. For letters proper, their identity can be associated with a wide range of appearances from sparse to fanciful glyphs. If an entire piece of text (or even a given word) is set using a particular font style, context will enable the reader to identify the underlying letter, even if the shape is almost unrelated to the "archetypical shape" documented in the Standard. When letters or marks get re-used in notational systems, though, the permissible range of variability changes dramatically - variations that do not change the meaning of a word in styled text, suddenly change the meaning of text in a certain notational system. Hence the disunification of certain letters or marks (but not all of them) in support of mathematical notation. The fifth principle appears to be to disunify only as far as and only when necessary. The biggest downside of this principle is that it leads to "late" disunifications; some characters get disunified as the committee becomes aware of some issue, leading to the problem of legacy data. But it has usefully somewhat limited the further proliferation of characters of identical appearance. The final principle is compatibility. This covers being able to round-trip from certain legacy encodings. This principle may force some disunifications that otherwise might not have happened, but it also isn't a panacea: there are legacy encodings that are mutually incompatible, so that one needs to make a choice which one to support. TeX being a "glyph based" system looses out here in comparison to legacy plain-text character encoding systems such as the 8859 series of ISO/IEC standards. Some unification among punctuation marks in particular seem to have been made on a more ad-hoc basis. This issue is exacerbated by the fact that many such systems lack either the wide familiarity of standard writing systems (with their tolerance for glyph variation) nor the rigor of something like mathematical notation. This
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
Hello Sundar, On 2018/05/28 04:27, SundaraRaman R via Unicode wrote: Hi, In languages like Ruby or Java (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), functions to check if a character is alphabetic do that by looking for the 'Alphabetic' property (defined true if it's in one of the L categories, or Nl, or has 'Other_Alphabetic' property). When parsing Tamil text, this works out well for independent vowels and consonants (which are in Lo), and for most dependent signs (which are in Mc or Mn but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to concluding any string containing it to be non-alphabetic. This doesn't make sense to me since the Virama “◌்” as much of an alphabetic character as any of the "Dependent Vowel" characters which have been given the 'Other_Alphabetic' property. Is there a rationale behind this difference, or is it an oversight to be corrected? I suggest submitting an error report via https://www.unicode.org/reporting.html. I haven't studied the issue in detail (sorry, just no time this week), but it sounds reasonable to give the VIRAMA the 'Other_Alphabetic' property. I'd recommend to mention examples other than Tamil in your report (assuming they exist). BTW, what's the method you are using in Ruby? If there's a problem in Ruby (which I don't think; it's just using Unicode data), then please make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I should be able to follow up on that. Regards, Martin.
Re: Unicode characters unification
On Mon, 28 May 2018 21:14:58 +0200 Hans Åberg via Unicode wrote: > > On 28 May 2018, at 21:01, Richard Wordingham via Unicode > > wrote: > > > > On Mon, 28 May 2018 20:19:09 +0200 > > Hans Åberg via Unicode wrote: > > > >> Indistinguishable math styles Latin and Greek uppercase letters > >> have been added, even though that was not so in for example TeX, > >> and thus no encoding legacy to consider. > > > > They sort differently - one can have vaguely alphabetical indexes of > > mathematical symbols. They also have quite different compatibility > > decompositions. > > > > Does sorting offer an argument for encoding these symbols > > differently. I'm not sure it's a strong arguments - how likely is > > one to have a list where the difference matters? > > The main point is that they are not likely to be distinguishable when > used side-by-side in the same formula. They could be of significance > if using Greek names instead of letters, of length greater than one, > then. But it is not wrong to add them, because it is easier than > having to think through potential uses. By these symbols, I meant the quarter-tone symbols. Capital em and capital mu, as symbols, need to be encoded separately for proper sorting. Richard.
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
SundaraRaman R wrote: but the very common pulli (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to concluding any string containing it to be non-alphabetic. Is this definition part of Unicode? I thought the use of General Category to answer questions like "this sequence is a word" or "this string is alphabetic" was much more complex than that. (I'm not even sure what the latter means, for any script with any sort of combining mark.) Richard Wordingham wrote: The effects of virama that spring to mind are: (a) Causing one or both letters on either side to change or combine to indicate combination; (b) Appearing as a mark only if it does not affect one of the letters on either side; (c) Causing a left matra to appear on the left of the sequence of consonants joined by a sequence of non-visible viramas. Most of these don't apply to Tamil, of course. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Unicode characters unification
> On 28 May 2018, at 21:38, Richard Wordingham > wrote: > > On Mon, 28 May 2018 21:14:58 +0200 > Hans Åberg via Unicode wrote: > >>> On 28 May 2018, at 21:01, Richard Wordingham via Unicode >>> wrote: >>> >>> On Mon, 28 May 2018 20:19:09 +0200 >>> Hans Åberg via Unicode wrote: >>> Indistinguishable math styles Latin and Greek uppercase letters have been added, even though that was not so in for example TeX, and thus no encoding legacy to consider. >>> >>> They sort differently - one can have vaguely alphabetical indexes of >>> mathematical symbols. They also have quite different compatibility >>> decompositions. >>> >>> Does sorting offer an argument for encoding these symbols >>> differently. I'm not sure it's a strong arguments - how likely is >>> one to have a list where the difference matters? >> >> The main point is that they are not likely to be distinguishable when >> used side-by-side in the same formula. They could be of significance >> if using Greek names instead of letters, of length greater than one, >> then. But it is not wrong to add them, because it is easier than >> having to think through potential uses. > > By these symbols, I meant the quarter-tone symbols. Capital em and > capital mu, as symbols, need to be encoded separately for proper > sorting. Some of the math style letters are out of order for legacy reasons, so sorting may not work well. SMuFL have different fonts for text and music engraving, but I can't think of any use of sorting them.
Re: Unicode characters unification
> On 28 May 2018, at 21:01, Richard Wordingham via Unicode > wrote: > > On Mon, 28 May 2018 20:19:09 +0200 > Hans Åberg via Unicode wrote: > >> Indistinguishable math styles Latin and Greek uppercase letters have >> been added, even though that was not so in for example TeX, and thus >> no encoding legacy to consider. > > They sort differently - one can have vaguely alphabetical indexes of > mathematical symbols. They also have quite different compatibility > decompositions. > > Does sorting offer an argument for encoding these symbols differently. > I'm not sure it's a strong arguments - how likely is one to have a list > where the difference matters? The main point is that they are not likely to be distinguishable when used side-by-side in the same formula. They could be of significance if using Greek names instead of letters, of length greater than one, then. But it is not wrong to add them, because it is easier than having to think through potential uses.
Re: Unicode characters unification
On Mon, 28 May 2018 20:19:09 +0200 Hans Åberg via Unicode wrote: > Indistinguishable math styles Latin and Greek uppercase letters have > been added, even though that was not so in for example TeX, and thus > no encoding legacy to consider. They sort differently - one can have vaguely alphabetical indexes of mathematical symbols. They also have quite different compatibility decompositions. Does sorting offer an argument for encoding these symbols differently. I'm not sure it's a strong arguments - how likely is one to have a list where the difference matters? Richard.
Re: Unicode characters unification
> On 28 May 2018, at 19:18, Richard Wordingham via Unicode > wrote: > > On Mon, 28 May 2018 17:54:47 +0200 > Hans Åberg via Unicode wrote: > >>> On 28 May 2018, at 17:00, Richard Wordingham via Unicode >>> wrote: >>> >>> On Mon, 28 May 2018 15:30:55 +0200 >>> Hans Åberg via Unicode wrote: > German has a special sign ß for "ss", without upper capital version. >>> >>> That doesn't prevent upper-casing - you just have to know your >>> audience. >> >> That would be the same if the Greek and Latin uppercase letters would >> have been unified: One would need to know the context. > > I've seen a commutation diagram with both U+004D LATIN CAPITAL LETTER > M and U+039C GREEK CAPITAL LETTER MU on it. I only knew the difference > because I listened to what the lecturer said. Indistinguishable math styles Latin and Greek uppercase letters have been added, even though that was not so in for example TeX, and thus no encoding legacy to consider.
Re: Unicode characters unification
On Mon, 28 May 2018 17:54:47 +0200 Hans Åberg via Unicode wrote: > > On 28 May 2018, at 17:00, Richard Wordingham via Unicode > > wrote: > > > > On Mon, 28 May 2018 15:30:55 +0200 > > Hans Åberg via Unicode wrote: > >> German has a special sign ß for "ss", without upper capital > >> version. > > > > That doesn't prevent upper-casing - you just have to know your > > audience. > > That would be the same if the Greek and Latin uppercase letters would > have been unified: One would need to know the context. I've seen a commutation diagram with both U+004D LATIN CAPITAL LETTER M and U+039C GREEK CAPITAL LETTER MU on it. I only knew the difference because I listened to what the lecturer said. > > For the > > same reason, there are two utter confusables in THE Latin SCRIPT for > > 00D0 LATIN CAPITAL LETTER ETH. > The stuff is likely added for computer legacy, if there were separate > encodings for those. Unlikely. U+00F0 LATIN SMALL LETTER ETH and U+0256 LATIN SMALL LETTER D WITH TAIL contrast in the IPA. The difference between U+0111 LATIN SMALL LETTER D WITH STROKE and U+00F0 LATIN SMALL LETTER ETH may have been debated. Richard.
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On Mon, 28 May 2018 20:03:11 +0530 SundaraRaman R via Unicode wrote: > Hi, thanks for your reply. > > > There is only one character with a canonical combining class of 9 > > that is included as other_alphabetic, namely U+0E3A THAI CHARACTER > > PHINTHU. That last had any of the other properties of viramas back > > in Unicode 1.0; the characters that triggered such behaviours were > > permanently removed in Unicode 1.1. > > I didn't understand the second sentence here, could you clarify? Sorry, I messed that system up. It should have read, "The last time that that had any of the other properties of viramas back in Unicode 1.0;" > What > do you mean by "any of the other properties" here? The effects of virama that spring to mind are: (a) Causing one or both letters on either side to change or combine to indicate combination; (b) Appearing as a mark only if it does not affect one of the letters on either side; (c) Causing a left matra to appear on the left of the sequence of consonants joined by a sequence of non-visible viramas. > And "triggered such > behaviours" seems to imply having them in other_alphabetic had > negative consequences, could you give an example of what that might > be? Nowadays, the Thai syllable ไตร, normatively pronounced /trai/, is only encoded , and the character U+0E3A is always visible when used; for most routine purposes it is little different to U+0E38 THAI CHARACTER SARA U. However, in Unicode 1.0, while was rendered as at present, the same visible string could also be encoded as - no glyph would be rendered for U+0E3A. If one wanted the official Sanskritised Pali version, one could type ไตฺร as at present. One could also encode it as . Weirdly, I couldn't have used the phonetically ordered vowel to type a monk's name ending in มฺโม , as would have been rendered as โมฺม. As the non-phonetic virama-like behaviours of U+0E3A are only mentioned under the heading 'Alternate Ordering', I can only presume that they were triggered by the phonetic order vowel signs, U+0E70 to U+0E74. It is possible that U+0E3A acquired the alphabetic property because it ceased to behave like a virama. Alternatively, it may have acquired the alphabetic property because of its use in the compound vowels of minority languages. > But in the case of Tamil, I'm curious why most other combining Tamil > marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a > character barely used in Tamil text, has combining class 0 and is > included in Other_Alphabetic, but the visually similar and similarly > positioned pulli is not. In this particular case, is it a historical > accident that these got assigned this way, or is there a rationale > behind these? Would it at all be possible to get this changed in the > upcoming Unicode standard? Tamil has usually been treated as just another Indian Indic script. U+0E3A is the only virama-like character with the property of being 'alphabetic'. I can't see a change making it into Unicode 11.0. It requires too much careful thought. Besides, anything that considered as alphabetic should also considerer as alphabetic - they should be mostly interchangeable in Tamil. > > I fear that the correct test for what you want is to split text into > > words and check that each word begins with an alphabetic > > character. > > Do you mean "each grapheme cluster begins with an alphabetic > character" here? It seems to me (in my very limited Unicode knowledge) > that such a test, going through grapheme clusters and checking the > first codepoint in each, would also ensure the text is full > alphabetic. Not directly. Is the string "mark2mark" alphabetic? It constitutes a single word. My suggested simplification would say 'no', as it contains '2'; perhaps my simplification is wrong. > And it has the advantage that more languages have a > (relatively) easy way for splitting text into grapheme clusters, than > for checking minor Unicode properties like WordBreak, so this one > might be easier to implement. Does this test anywhere in the ballpark > of being right? Yes, it's close to being right. Note that simple approximations for SE Asian word-breaking (e.g. treating SE Asian characters as alphabetic) should work well for your application. Richard.
Re: Unicode characters unification
> On 28 May 2018, at 17:00, Richard Wordingham via Unicode > wrote: > > On Mon, 28 May 2018 15:30:55 +0200 > Hans Åberg via Unicode wrote: > >>> On 28 May 2018, at 15:10, Richard Wordingham via Unicode >>> wrote: >>> >>> On Mon, 28 May 2018 10:08:30 +0200 >>> Hans Åberg via Unicode wrote: >>> It is not about precision, but concepts. Like B, Β, and В, which could have been unified, but are not. >>> >>> Unifying these would make a real mess of lower casing! >> >> German has a special sign ß for "ss", without upper capital version. > > That doesn't prevent upper-casing - you just have to know your > audience. That would be the same if the Greek and Latin uppercase letters would have been unified: One would need to know the context. > The three letters like 'B' have very different lower case > forms, and very few would agree that they were the same letter. They were the same in the Uncial script, but evolved to be viewed as different. That is common with math symbols: something available evolving into separate symbols. > For the > same reason, there are two utter confusables in THE Latin SCRIPT for > 00D0 LATIN CAPITAL LETTER ETH. The stuff is likely added for computer legacy, if there were separate encodings for those. > More notably though, one just has to run > the risk of getting a culturally incorrect upper case when rendering > U+014A LATIN CAPITAL LETTER ENG; whether the three alternatives are the > same letter is debatable. Unified CJK Ideographs differ by stroke order.
Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols
On Mon, 28 May 2018 15:30:55 +0200 Hans Åberg via Unicode wrote: > > On 28 May 2018, at 15:10, Richard Wordingham via Unicode > > wrote: > > > > On Mon, 28 May 2018 10:08:30 +0200 > > Hans Åberg via Unicode wrote: > > > >> It is not about precision, but concepts. Like B, Β, and В, which > >> could have been unified, but are not. > > > > Unifying these would make a real mess of lower casing! > > German has a special sign ß for "ss", without upper capital version. That doesn't prevent upper-casing - you just have to know your audience. The three letters like 'B' have very different lower case forms, and very few would agree that they were the same letter. For the same reason, there are two utter confusables in THE Latin SCRIPT for 00D0 LATIN CAPITAL LETTER ETH. More notably though, one just has to run the risk of getting a culturally incorrect upper case when rendering U+014A LATIN CAPITAL LETTER ENG; whether the three alternatives are the same letter is debatable. Richard.
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
Hi, thanks for your reply. > There is only one character with a canonical combining class of 9 that > is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU. > That last had any of the other properties of viramas back in Unicode > 1.0; the characters that triggered such behaviours were permanently > removed in Unicode 1.1. I didn't understand the second sentence here, could you clarify? What do you mean by "any of the other properties" here? And "triggered such behaviours" seems to imply having them in other_alphabetic had negative consequences, could you give an example of what that might be? > There are some notable absences from the combining marks included. > Significant absences include ZWJ, ZWNJ and CGJ. > > However, a non-erroneous *conformant* Unicode process cannot > always determine whether a string, given only that it is a string, is > composed only of alphabetic characters. The answer would be 'yes' for > but 'no' for the canonically > equivalent ! > (U+0327 is not included as alphabetic either.) > > There is at least one combination of Latin letter and combining mark > that occurs in the normal orthography of a natural language and does not > have a precomposed equivalent. Ah, that's somewhat unfortunate that such a quick and easy alphabetic check is not possible in the general case, but I can understand how it might be weird to give the Alphabetic property to a ZWJ or ZWNJ. But in the case of Tamil, I'm curious why most other combining Tamil marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a character barely used in Tamil text, has combining class 0 and is included in Other_Alphabetic, but the visually similar and similarly positioned pulli is not. In this particular case, is it a historical accident that these got assigned this way, or is there a rationale behind these? Would it at all be possible to get this changed in the upcoming Unicode standard? (By the way, I'm happy to get a link to read through for any of my questions here. I just find it quite hard to search for and find past discussions and decision rationales regarding these, not knowing how and where to search for them.) > I fear that the correct test for what you want is to split text into > words and check that each word begins with an alphabetic character. Do you mean "each grapheme cluster begins with an alphabetic character" here? It seems to me (in my very limited Unicode knowledge) that such a test, going through grapheme clusters and checking the first codepoint in each, would also ensure the text is full alphabetic. And it has the advantage that more languages have a (relatively) easy way for splitting text into grapheme clusters, than for checking minor Unicode properties like WordBreak, so this one might be easier to implement. Does this test anywhere in the ballpark of being right? Regards, Sundar
Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols
> On 28 May 2018, at 15:10, Richard Wordingham via Unicode > wrote: > > On Mon, 28 May 2018 10:08:30 +0200 > Hans Åberg via Unicode wrote: > >> It is not about precision, but concepts. Like B, Β, and В, which >> could have been unified, but are not. > > Unifying these would make a real mess of lower casing! German has a special sign ß for "ss", without upper capital version. > What is the context in which the Arab use would benefit from having a > different encoding? Maybe if they decide to change the glyph, then what already is encoded would get the right appearance. But SMuFL might have had other reasons: the glyphs should probably be designed together. And it is simple, as one does not need to investigate their uses too much. For example, the Turkish AEU sharps are microtonal, not the ordinary ones. So if the Turkish accidentals have their own code points, one can change that later.
Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols
On Mon, 28 May 2018 10:08:30 +0200 Hans Åberg via Unicode wrote: > > On 28 May 2018, at 03:39, Garth Wallace wrote: > > The fact that they do not denote the same width in cents in Arabic > > music as they do in Western modern classical does not matter. That > > sort of precision is not inherent to the written symbols. > > It is not about precision, but concepts. Like B, Β, and В, which > could have been unified, but are not. Unifying these would make a real mess of lower casing! What is the context in which the Arab use would benefit from having a different encoding? Richard.
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On Mon, 28 May 2018 00:57:03 +0530 SundaraRaman R via Unicode wrote: > Hi, > > In languages like Ruby or Java > (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), > functions to check if a character is alphabetic do that by looking for > the 'Alphabetic' property (defined true if it's in one of the L > categories, or Nl, or has 'Other_Alphabetic' property). When parsing > Tamil text, this works out well for independent vowels and consonants > (which are in Lo), and for most dependent signs (which are in Mc or Mn > but have the 'Other_Alphabetic' property), but the very common pulli > (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to > concluding any string containing it to be non-alphabetic. > > This doesn't make sense to me since the Virama “◌்” as much of an > alphabetic character as any of the "Dependent Vowel" characters which > have been given the 'Other_Alphabetic' property. Is there a rationale > behind this difference, or is it an oversight to be corrected? There is only one character with a canonical combining class of 9 that is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU. That last had any of the other properties of viramas back in Unicode 1.0; the characters that triggered such behaviours were permanently removed in Unicode 1.1. There are some notable absences from the combining marks included. Significant absences include ZWJ, ZWNJ and CGJ. However, a non-erroneous *conformant* Unicode process cannot always determine whether a string, given only that it is a string, is composed only of alphabetic characters. The answer would be 'yes' for but 'no' for the canonically equivalent ! (U+0327 is not included as alphabetic either.) There is at least one combination of Latin letter and combining mark that occurs in the normal orthography of a natural language and does not have a precomposed equivalent. I fear that the correct test for what you want is to split text into words and check that each word begins with an alphabetic character. That test can be made by a conformant process. I think, but have not checked, that the test an be simplified to: (a) Check that the first character is alphabetic. (b) Ignore every character with a WordBreak property of Extend or ZWJ (c) Check that all other characters are alphabetic. Richard.
Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols
> On 28 May 2018, at 11:05, Julian Bradfield via Unicode > wrote: > > On 2018-05-28, Hans Åberg via Unicode wrote: >>> On 28 May 2018, at 03:39, Garth Wallace wrote: On Sun, May 27, 2018 at 3:36 PM, Hans Åberg wrote: The flats and sharps of Arabic music are semantically the same as in Western music, departing from Pythagorean tuning, then, but the microtonal accidentals are different: they simply reused some that were available. > ... >>> The fact that they do not denote the same width in cents in Arabic music as >>> they do in Western modern classical does not matter. That sort of precision >>> is not inherent to the written symbols. >> >> It is not about precision, but concepts. Like B, Β, and В, which could have >> been unified, but are not. > > Latin, Greek, Cyrillic etc. could not have been unified, because of the > requirement to have round-trip compatibility with previous encodings. Indeed, in Unicode because of that, which I pointed out. > It is also, of course, convenient for many reasons to have the notion > of "script" hard-coded into unicode code-points, instead of in > higher-level mark-up where it arguably belongs - just as, when > copyright finally expires, it will be convenient to have Tolkien's > runes disunified from historical runes (which is the line taken by the > proposal waiting for that day). Whether it is so convenient to have a > "music script" notion hard-coded is presumably what this argument is > about. It's not obvious to me that musical notation is something that > carries the "script" baggage in the same way that writing systems do. Indeed, that is what I also pointed out. So I suggested to contact the SMuFL people which might inform about the underlying reasoning, and then make a decision about what might be suitable for Unicode. They probably have them separate for the same reason as for scripts: originally different fonts encodings, but those are not official, and in addition it is for music engraving, and not writing in text files.
Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols
On 2018-05-28, Hans Åberg via Unicode wrote: >> On 28 May 2018, at 03:39, Garth Wallace wrote: >>> On Sun, May 27, 2018 at 3:36 PM, Hans Åberg wrote: >>> The flats and sharps of Arabic music are semantically the same as in >>> Western music, departing from Pythagorean tuning, then, but the microtonal >>> accidentals are different: they simply reused some that were available. ... >> The fact that they do not denote the same width in cents in Arabic music as >> they do in Western modern classical does not matter. That sort of precision >> is not inherent to the written symbols. > > It is not about precision, but concepts. Like B, Β, and В, which could have > been unified, but are not. Latin, Greek, Cyrillic etc. could not have been unified, because of the requirement to have round-trip compatibility with previous encodings. It is also, of course, convenient for many reasons to have the notion of "script" hard-coded into unicode code-points, instead of in higher-level mark-up where it arguably belongs - just as, when copyright finally expires, it will be convenient to have Tolkien's runes disunified from historical runes (which is the line taken by the proposal waiting for that day). Whether it is so convenient to have a "music script" notion hard-coded is presumably what this argument is about. It's not obvious to me that musical notation is something that carries the "script" baggage in the same way that writing systems do. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols
> On 28 May 2018, at 03:39, Garth Wallace wrote: > >> On Sun, May 27, 2018 at 3:36 PM, Hans Åberg wrote: >> The flats and sharps of Arabic music are semantically the same as in Western >> music, departing from Pythagorean tuning, then, but the microtonal >> accidentals are different: they simply reused some that were available. >> > But they aren't different! They are the same symbols. They are, as you > yourself say, reused. Historically, yes, but not necessarily now. > The fact that they do not denote the same width in cents in Arabic music as > they do in Western modern classical does not matter. That sort of precision > is not inherent to the written symbols. It is not about precision, but concepts. Like B, Β, and В, which could have been unified, but are not. > By contrast, Persian music notation invented new microtonal accidentals, > called the koron and sori, and my impression is that their average value, as > measured by Hormoz Farhat in his thesis, is also usable in Arabic music. For > comparison, I have posted the Arabic maqam in Helmholtz-Ellis notation [1] > using this value; note that one actually needs two extra microtonal > accidentals—Arabic microtonal notation is in fact not complete. > > The E24 exact quarter-tones are suitable for making a piano sound badly out > of tune. Compare that with the accordion in [2], Farid El Atrache - > "Noura-Noura". > > 1. https://lists.gnu.org/archive/html/lilypond-user/2016-02/msg00607.html > 2. https://www.youtube.com/watch?v=fvp6fo7tfpk > > > > I don't really see how this is relevant. Nobody is claiming that the koron > and sori accidentals are the same symbols as the Arabic half-sharp and flat > with crossbar. They look entirely different. Arabic music simply happens to use Western style accidentals for concepts similar to Persian music rather than Western music.