Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara
On Sat, 4 Jan 2020 22:15:59 + James Kass via Unicode wrote: > For the Grantha examples above, Grantha (1) displays much better > here. It seems daft to put a spacing character between a base > character and any mark which is supposed to combine with the base > character. Although it's not related to this issue, that happens in the USE scheme. It puts vowels before vowel modifiers, which has this problem if any of the vowel modifiers precede a vowel in visual order, as happens in Thai and closely related writing systems. Richard.
Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara
On 2020-01-04 12:50 PM, Richard Wordingham via Unicode wrote: dev2: कः꣡ dev3: क꣡ः Grantha: (1) 𑌕𑍧𑌃 (2) 𑌕𑌃𑍧 The second Grantha spelling is enabled by a Harfbuzz-only change to the USE categorisations. It treats Grantha visarga and spacing anusvara as though inpc=Top rather than inpc=Right. As I am using Ubuntu 16.04, this override isn't supported in applications that use the system HarfBuzz library, such as my email client. We are now establishing incompatible Devanagari font-specific encodings fully compliant with TUS! This seems to be a very bad approach. And apparently it isn't limited to the Devanagari script. For the Grantha examples above, Grantha (1) displays much better here. It seems daft to put a spacing character between a base character and any mark which is supposed to combine with the base character.
Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara
On Thu, 2 Jan 2020 20:20:34 + Richard Wordingham via Unicode wrote: > There's a project whose basis I can't find to convert Indian Indic > rendering at least to use the USE. Now, according to the > specification of the USE, visarga, anusvara and cantillation marks > are all classified as vowel modifiers, and are so ordered relative to > one another in the Indian Indic order: left, top, bottom, right. So, > the problem should already be solved for Grantha, and, if the plans > come to fruition, will work with a font whose Devanagari script tag > is 'dev3'. However, I may have overlooked a set of overrides to the > USE categorisations. I've now knocked up a partial* representation* of a Devanagari dev3 and a Grantha font (which I'm dubbing 'Mock Indic 3'). The supported orders of COMBINING DIGIT ONE and VISARGA, as in Firefox on Linux, are: dev2: कः꣡ dev3: क꣡ः Grantha: (1) 𑌕𑍧𑌃 (2) 𑌕𑌃𑍧 The second Grantha spelling is enabled by a Harfbuzz-only change to the USE categorisations. It treats Grantha visarga and spacing anusvara as though inpc=Top rather than inpc=Right. As I am using Ubuntu 16.04, this override isn't supported in applications that use the system HarfBuzz library, such as my email client. We are now establishing incompatible Devanagari font-specific encodings fully compliant with TUS! Richard. * Partial = much is not handled * Representation = glyphs are wrong, merely showing arrangement. (I've actually re-used a Tai Tham font.)
Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara
On Thu, 2 Jan 2020 15:07:04 -0800 Norbert Lindenberg wrote: >> On Jan 2, 2020, at 12:20, Richard Wordingham via Unicode >> wrote: >> So, the problem should already be solved for Grantha, and, >> if the plans come to fruition, will work with a font whose >> Devanagari script tag is 'dev3'. However, I may have overlooked a >> set of overrides to the USE categorisations. > You can create Indic 3 fonts that get processed by the USE today, and > use them with Harfbuzz (Chrome, Firefox, Android, …) and with > CoreText (Apple platforms). I don’t know if anybody has already > created such fonts. > https://lindenbergsoftware.com/en/notes/brahmic-script-support-in-opentype/ Is there a script tag registry, or is it now a free-for-all as with font names? (I suppose it is implicitly constrained by what the individual renderers recognise.) The nearest to a registry I can find is at https://docs.microsoft.com/en-us/typography/opentype/spec/ttoreg, but that appears to be limited to what Microsoft supports - "The tag registry defines the OpenType Layout tags that Microsoft supports". None of the Indic 3 script tags are there. Richard.
Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara
On Thu, 2 Jan 2020 07:52:55 + James Kass via Unicode wrote: > > I've been looking at Microsoft's specification of Devanagari > > character order. In > > > https://docs.microsoft.com/en-us/typography/script-development/devanagari, > > the consonant syllable ends > > > > [N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)] > > > > where > > N is nukta > > A is anudatta (U+0952) > > H is halant/virama > > M is matra > > SM is syllable modifier signs > > VD is vedic > > > > "Syllable modifier signs" and "vedic" are not defined. It appears > > that SM includes U+0903 DEVANAGARI SIGN VISARGA. > > What action should Microsoft take to satisfy the needs of the user > community? > 1. No action, maintain status quo. > 2. Swap SM and VD in the specs ordering. > 3. Make new category PS (post-syllable) and move VISARGA/ANUSVARA > there. > 4. ? There's a project whose basis I can't find to convert Indian Indic rendering at least to use the USE. Now, according to the specification of the USE, visarga, anusvara and cantillation marks are all classified as vowel modifiers, and are so ordered relative to one another in the Indian Indic order: left, top, bottom, right. So, the problem should already be solved for Grantha, and, if the plans come to fruition, will work with a font whose Devanagari script tag is 'dev3'. However, I may have overlooked a set of overrides to the USE categorisations. > What kind of impact would there be on existing data if Microsoft > revised the ordering? A good question that *I* can't answer. > Or should Unicode encode a new character like ZERO-WIDTH INVISIBLE > DOTTED CIRCLE so that users can suppress unwanted and unexpected > dotted circles by adding superfluous characters to the text stream? It would be useful to be able to suppress inappropriate dotted circles without disrespecting the character identity of U+25CC. (Doable in HarfBuzz, but not in OpenType.) There's actually been a suggestion that dotted circles should be applied after global substitutions have been applied, so as to prevent the overcoming of renderer faults. On Sat, 21 Dec 2019 11:57:53 +0530 Shriramana Sharma via Unicode wrote: > This is all the more so since in some Vedic contexts (Sama Gana) the > visarga is far separated from the syllable by other syllables like > digits (themselves carrying combining marks) or spacing anusvara, as > seen in examples from my Grantha proposal L2/09-372 p 40. I presume you referring to the middle picture. I'm having difficulty reading it. Could you please tell us its transcription and encoding. A minimal change would be to extend the range of base characters to include digits - I'm surprised matras don't frequently get added to them. Richard.
Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara
On 2020-01-02 1:04 AM, Richard Wordingham wrote in a thread deriving from this one, > Have you found a definition of the ISCII handling of Vedic characters? No. It would be helpful. ISCII apparently wasn't really used much. It would also be helpful to know the encoding order in any legacy ISCII data using the Vedic characters with respect to VISARGA/ANUSVARA. Although such legacy data seems unlikely, I'd expect VISARGA/ANUSVARA to be entered/stored post-syllable. > I've been looking at Microsoft's specification of Devanagari character > order. In > https://docs.microsoft.com/en-us/typography/script-development/devanagari, > the consonant syllable ends > > [N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)] > > where > N is nukta > A is anudatta (U+0952) > H is halant/virama > M is matra > SM is syllable modifier signs > VD is vedic > > "Syllable modifier signs" and "vedic" are not defined. It appears that > SM includes U+0903 DEVANAGARI SIGN VISARGA. What action should Microsoft take to satisfy the needs of the user community? 1. No action, maintain status quo. 2. Swap SM and VD in the specs ordering. 3. Make new category PS (post-syllable) and move VISARGA/ANUSVARA there. 4. ? What kind of impact would there be on existing data if Microsoft revised the ordering? Or should Unicode encode a new character like ZERO-WIDTH INVISIBLE DOTTED CIRCLE so that users can suppress unwanted and unexpected dotted circles by adding superfluous characters to the text stream? > I note that even ग॒ः is > given a dotted circle by HarfBuzz. Same on Win 7. And (गः॒) breaks the mark positioning as expected.
Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)
On Wed, 1 Jan 2020 20:11:04 + James Kass via Unicode wrote: > On 2020-01-01 11:17 AM, Richard Wordingham via Unicode wrote: > > > That's exactly the sort of mess that jack-booted renderers are > > trying to minimise. Their principle is that there should be only > > one encoding per shape, though to be fair: > > > > 1) some renderers accept canonical equivalents. > > 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ), > > collating (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ). > > 3) Superseded chillu encodings are still supported. > > There was never any need for atomic chillu form characters. > The > principle of only one encoding per shape is best achieved when every > shape gets an atomic encoding. I should have written per-word shape. I should also have added that most renderers attempt to handle Mongolian, despite its encoding Middle Mongolian phonetics rather than characters. Also, they don't attempt to sort the Arabic script per-language subsets out, which leads to a bad mess at Wiktionary when Unicode characters differ only in a few forms. > Glyph-based encoding is incompatible > with Unicode character encoding principles. Visual encoding sometimes works - phonetic order for Thai is so complicated that it is unsurprising that its definition is partly missing from Unicode 1.0. The official history hides behind incompatibility with the Thai national standard, but phonetic order was simply too complicated for Thai. Additionally, Thais don't agree on where preposed vowels go relative to Pali consonant clusters - they don't agree that all of them should appear in the middle of the cluster. (I suppose the positioning rule could have been made a stylistic feature of fonts.) An analogue is Lao collation. While syllable boundaries can overwhelmingly be discerned in modern Lao, Lao collations are too complicated to be accepted for ICU if they are to support anything but single syllables. CLDR collation (interpreted as a specification with the normal use of specification language for the form of definitions) can just cope, whereas the UCA can't, but the tables are huge. Richard.
Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)
On Wed, 1 Jan 2020 23:09:49 + James Kass via Unicode wrote: > On 2020-01-01 8:11 PM, James Kass wrote: > > It’s too bad that ISCII didn’t accomodate the needs of Vedic > > Sanskrit, but here we are. > > Sorry, that might be wrong to say. It's possible that it's Unicode's > adaptation of ISCII that hinders Vedic Sanskrit. Have you found a definition of the ISCII handling of Vedic characters? The problem lies in Unicode's failure to standardise the encoding of Devanagari text. But for the consistent failure to include a standardisation of text in a script in TUS, one might wonder if the original idea was to duck the issue by resorting to canonical equivalence. I've been looking at Microsoft's specification of Devanagari character order. In https://docs.microsoft.com/en-us/typography/script-development/devanagari, the consonant syllable ends [N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)] where N is nukta A is anudatta (U+0952) H is halant/virama M is matra SM is syllable modifier signs VD is vedic "Syllable modifier signs" and "vedic" are not defined. It appears that SM includes U+0903 DEVANAGARI SIGN VISARGA. I note that even ग॒ः is given a dotted circle by HarfBuzz. Now, this might not be an entirely fair test; I suspect anudatta is assigned this position because originally the Sindhi implosives were encoded as consonant plus nukta and anudatta, though rendering still fails with HarfBuzz when nukta is inserted (ग़॒ः). Richard.
Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)
On 2020-01-01 8:11 PM, James Kass wrote: It’s too bad that ISCII didn’t accomodate the needs of Vedic Sanskrit, but here we are. Sorry, that might be wrong to say. It's possible that it's Unicode's adaptation of ISCII that hinders Vedic Sanskrit.
One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)
On 2020-01-01 11:17 AM, Richard Wordingham via Unicode wrote: > That's exactly the sort of mess that jack-booted renderers are trying > to minimise. Their principle is that there should be only one encoding > per shape, though to be fair: > > 1) some renderers accept canonical equivalents. > 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ), collating > (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ). > 3) Superseded chillu encodings are still supported. There was never any need for atomic chillu form characters. The principle of only one encoding per shape is best achieved when every shape gets an atomic encoding. Glyph-based encoding is incompatible with Unicode character encoding principles. It’s too bad that ISCII didn’t accomodate the needs of Vedic Sanskrit, but here we are.
Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara
On Wed, 1 Jan 2020 01:19:02 + James Kass via Unicode wrote: > A workaround until some kind of satisfactory adjustment is made might > be to simply use COLON for VISARGA. Or... > > VISARGA ⇒ U+02F8 MODIFIER LETTER RAISED COLON > ANUSVARA⇒U+02D9 DOT ABOVE > > ...as long as the font(s) included both those characters. > > य॑ यॆ॑ > > य॑ं -- anusvara last > यॆ॑ं -- " > > य॑: -- colon last > यॆ॑: -- " > > य॑˸ -- raised colon modifier last > यॆ॑˸ -- " > > य॑˙ -- spacing dot above last > यॆ॑˙ -- " > That's exactly the sort of mess that jack-booted renderers are trying to minimise. Their principle is that there should be only one encoding per shape, though to be fair: 1) some renderers accept canonical equivalents. 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ), collating (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ). 3) Superseded chillu encodings are still supported. Richard.
Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara
A workaround until some kind of satisfactory adjustment is made might be to simply use COLON for VISARGA. Or... VISARGA ⇒ U+02F8 MODIFIER LETTER RAISED COLON ANUSVARA⇒U+02D9 DOT ABOVE ...as long as the font(s) included both those characters. य॑ यॆ॑ य॑ं -- anusvara last यॆ॑ं -- " य॑: -- colon last यॆ॑: -- " य॑˸ -- raised colon modifier last यॆ॑˸ -- " य॑˙ -- spacing dot above last यॆ॑˙ -- "
Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara
On 2019-12-21 6:27 AM, Shriramana Sharma via Unicode wrote: However, even the simplest Vedic sequence (not involving Sama Vedic or multiple tone marker combinations) like दे॒वेभ्य॑ः throws up a dotted circle, and one is expected (see developer feedback in that bug report) to input the visarga before tone markers, hoping the software is intelligent enough to skip over the visarga (or anusvara) place the tone marker over the preceding syllable correctly. Why it is necessary to put the visarga first in input only to have to skip over it in shaping is beyond me. य॔ यॆ॔ य॔ः -- visarga last यॆ॔ः -- " यः॔ -- visarga before accent (U+0954) यॆः॔ -- " य॑ यॆ॑ य॑ः -- visarga last यॆ॑ः -- " यः॑ visarga before svarita (U+0951) यॆः॑ " U+0951 and U+0954 have canonical combining class of 230. Putting VISARGA (CCC=0) after those CCC=230 marks generates the dotted circle for VISARGA. Putting VISARGA before those CCC=230 marks generates the dotted circle for U+0954 but drops the dotted circle for U+0951. In both cases where VISARGA comes before, the mark positioning is broken. (Mangal font, Win 7) As far as I can tell, the simplest solution would be for the Indic shaping engines to suppress the dotted circle for VISARGA (or ANUSVARA) where appropriate. Entering/storing VISARGA or ANUSVARA at the end of the syllable makes sense since that's where it goes, visually and logically.
Long standing problem with Vedic tone markers and post-base visarga/anusvara
https://github.com/harfbuzz/harfbuzz/issues/2017 should provide the context for this. Ever since the early days of Devanagari Unicode, scholars like me dealing with Vedic Sanskrit orthography have been experiencing this problem, but chalked it upto early days and consequent insufficient support for Vedic sequences. Even now, Vedic support even on the font side is quite limited, and we also find limitations on the software side. So I hope it's time to fix them one by one. The issue I would like to discuss now is as follows: # SEMANTIC DISSOCIATION OF THE VISARGA FROM THE SYLLABLE In Vedic, syllables that carry tone markers – which are mostly above-base or below-base – often have to take a visarga, which is always post-base. In this case, the sequence intuitive to native scholars like me is: + + This is because the tone marker indicates the tone of the syllable (or its vowel) and the visarga is a separate aspirated sound *after* the syllable to which the tone marker doesn't apply. In fact, the only reason the visarga sign is analysed as a combining mark rather than a separate letter is that it is not used in isolation without a preceding syllable. Otherwise ie linguistically it doesn't modify the preceding syllable in any way. Anyhow, the point is that the tone marker should come before the visarga because it semantically applies to the preceding syllable and not the visarga. This is all the more so since in some Vedic contexts (Sama Gana) the visarga is far separated from the syllable by other syllables like digits (themselves carrying combining marks) or spacing anusvara, as seen in examples from my Grantha proposal L2/09-372 p 40. So the visarga is semantically quite dissociated from the preceding syllable unlikely the tone marker which is intimately associated with it. # SAME APPLICABLE TO THE ANUSVARA The same argument is also applicable to the anusvara as it also represents a nasal sound separate from the preceding syllable. (The candrabindu OTOH nasalises the preceding syllable itself.) The above Grantha proposal page also shows an example where an anusvara is orthographically separated from the preceding syllable by three characters: a tone marker + avagraha + digit. L2/15-178 shows that in equivalent contexts of Devanagari the digit 0 is used as a substitute since the Devanagari anusvara is non-spacing. All this goes to the dissociation from the syllable of the anusvara – just like the visarga – compared to tone markers. So to be consistent, even in case of Devanagari (or such script) where the anusvara is non-spacing, the sequence when a tone marker is also involved puts the tone marker first, as mentioned before: + + # CURRENT SITUATION INCOMPATIBLE WITH ABOVE However, even the simplest Vedic sequence (not involving Sama Vedic or multiple tone marker combinations) like दे॒वेभ्य॑ः throws up a dotted circle, and one is expected (see developer feedback in that bug report) to input the visarga before tone markers, hoping the software is intelligent enough to skip over the visarga (or anusvara) place the tone marker over the preceding syllable correctly. Why it is necessary to put the visarga first in input only to have to skip over it in shaping is beyond me. So makes sense neither from a linguistic nor technological perspective to push the tone markers to the end of the syllable. Even the developers acknowledge that non-spacing marks are normally (ie outside Indic) input before spacing ones. However, they say “we can't support that in this particular case because this is how Microsoft does it and we have to follow suit to ensure people get the same shaping for the same input”, notwithstanding the fact that the expectation to put the visarga/anusvara first is non-sensical as explained above. So everyone is looking to Microsoft Uniscribe (or whatever its successor is) to fix things first before they can follow. I figured that if this is discussed and decided here, everyone can fix it at the same time. -- Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा 𑀰𑁆𑀭𑀻𑀭𑀫𑀡𑀰𑀭𑁆𑀫𑀸