Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-04 Thread Richard Wordingham via Unicode
On Sat, 4 Jan 2020 22:15:59 +
James Kass via Unicode  wrote:

> For the Grantha examples above, Grantha (1) displays much better
> here. It seems daft to put a spacing character between a base
> character and any mark which is supposed to combine with the base
> character.

Although it's not related to this issue, that happens in the USE scheme.
It puts vowels before vowel modifiers, which has this problem if any of
the vowel modifiers precede a vowel in visual order, as happens in Thai
and closely related writing systems. 

Richard.


Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-04 Thread James Kass via Unicode



On 2020-01-04 12:50 PM, Richard Wordingham via Unicode wrote:

dev2: कः꣡ 

dev3: क꣡ः  
Grantha: (1) 𑌕𑍧𑌃 
  (2) 𑌕𑌃𑍧 
The second Grantha spelling is enabled by a Harfbuzz-only change to
the USE categorisations.  It treats Grantha visarga and spacing
anusvara as though inpc=Top rather than inpc=Right.  As I am using
Ubuntu 16.04, this override isn't supported in applications that use the
system HarfBuzz library, such as my email client.

We are now establishing incompatible Devanagari font-specific
encodings fully compliant with TUS!
This seems to be a very bad approach.  And apparently it isn't limited 
to the Devanagari script.


For the Grantha examples above, Grantha (1) displays much better here.  
It seems daft to put a spacing character between a base character and 
any mark which is supposed to combine with the base character.





Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-04 Thread Richard Wordingham via Unicode
On Thu, 2 Jan 2020 20:20:34 +
Richard Wordingham via Unicode  wrote:

> There's a project whose basis I can't find to convert Indian Indic
> rendering at least to use the USE.  Now, according to the
> specification of the USE, visarga, anusvara and cantillation marks
> are all classified as vowel modifiers, and are so ordered relative to
> one another in the Indian Indic order: left, top, bottom, right.  So,
> the problem should already be solved for Grantha, and, if the plans
> come to fruition, will work with a font whose Devanagari script tag
> is 'dev3'.  However, I may have overlooked a set of overrides to the
> USE categorisations.

I've now knocked up a partial* representation* of a Devanagari dev3 and
a Grantha font (which I'm dubbing 'Mock Indic 3').  The supported
orders of COMBINING DIGIT ONE and VISARGA, as in Firefox on
Linux, are:

dev2: कः꣡ 

dev3: क꣡ः  
Grantha: (1) 𑌕𑍧𑌃 
 (2) 𑌕𑌃𑍧 
The second Grantha spelling is enabled by a Harfbuzz-only change to
the USE categorisations.  It treats Grantha visarga and spacing
anusvara as though inpc=Top rather than inpc=Right.  As I am using
Ubuntu 16.04, this override isn't supported in applications that use the
system HarfBuzz library, such as my email client.

We are now establishing incompatible Devanagari font-specific
encodings fully compliant with TUS!

Richard.

* Partial = much is not handled
* Representation = glyphs are wrong, merely showing arrangement.  (I've
  actually re-used a Tai Tham font.)



Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-02 Thread Richard Wordingham via Unicode
On Thu, 2 Jan 2020 15:07:04 -0800
Norbert Lindenberg  wrote:

>> On Jan 2, 2020, at 12:20, Richard Wordingham via Unicode
>>  wrote:  

>> So, the problem should already be solved for Grantha, and,
>> if the plans come to fruition, will work with a font whose
>> Devanagari script tag is 'dev3'.  However, I may have overlooked a
>> set of overrides to the USE categorisations.  

> You can create Indic 3 fonts that get processed by the USE today, and
> use them with Harfbuzz (Chrome, Firefox, Android, …) and with
> CoreText (Apple platforms). I don’t know if anybody has already
> created such fonts.
> https://lindenbergsoftware.com/en/notes/brahmic-script-support-in-opentype/

Is there a script tag registry, or is it now a free-for-all as with font
names?  (I suppose it is implicitly constrained by what the individual
renderers recognise.)

The nearest to a registry I can find is at
https://docs.microsoft.com/en-us/typography/opentype/spec/ttoreg, but
that appears to be limited to what Microsoft supports - "The tag
registry defines the OpenType Layout tags that Microsoft supports".
None of the Indic 3 script tags are there.

Richard.




Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-02 Thread Richard Wordingham via Unicode
On Thu, 2 Jan 2020 07:52:55 +
James Kass via Unicode  wrote:

>  > I've been looking at Microsoft's specification of Devanagari
>  > character order.  In
>  >   
> https://docs.microsoft.com/en-us/typography/script-development/devanagari,
>  > the consonant syllable ends
>  >
>  > [N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)]
>  >
>  > where
>  > N is nukta
>  > A is anudatta (U+0952)
>  > H is halant/virama
>  > M is matra
>  > SM is syllable modifier signs
>  > VD is vedic
>  >
>  > "Syllable modifier signs" and "vedic" are not defined.  It appears
>  > that SM includes U+0903 DEVANAGARI SIGN VISARGA.  
> 
> What action should Microsoft take to satisfy the needs of the user 
> community?
> 1.  No action, maintain status quo.
> 2.  Swap SM and VD in the specs ordering.
> 3.  Make new category PS (post-syllable) and move VISARGA/ANUSVARA
> there.
> 4.  ?

There's a project whose basis I can't find to convert Indian Indic
rendering at least to use the USE.  Now, according to the specification
of the USE, visarga, anusvara and cantillation marks are all classified
as vowel modifiers, and are so ordered relative to one another in the
Indian Indic order: left, top, bottom, right.  So, the problem should
already be solved for Grantha, and, if the plans come to fruition, will
work with a font whose Devanagari script tag is 'dev3'.  However, I may
have overlooked a set of overrides to the USE categorisations.

> What kind of impact would there be on existing data if Microsoft
> revised the ordering?

A good question that *I* can't answer.

> Or should Unicode encode a new character like ZERO-WIDTH INVISIBLE 
> DOTTED CIRCLE so that users can suppress unwanted and unexpected
> dotted circles by adding superfluous characters to the text stream?

It would be useful to be able to suppress inappropriate dotted circles
without disrespecting the character identity of U+25CC.  (Doable
in HarfBuzz, but not in OpenType.)  There's actually been a suggestion
that dotted circles should be applied after global substitutions have
been applied, so as to prevent the overcoming of renderer faults.

On Sat, 21 Dec 2019 11:57:53 +0530
Shriramana Sharma via Unicode  wrote:

> This is all the more so since in some Vedic contexts (Sama Gana) the
> visarga is far separated from the syllable by other syllables like
> digits (themselves carrying combining marks) or spacing anusvara, as
> seen in examples from my Grantha proposal L2/09-372 p 40.

I presume you referring to the middle picture.  I'm having difficulty
reading it.  Could you please tell us its transcription and encoding.

A minimal change would be to extend the range of base characters to
include digits - I'm surprised matras don't frequently get added to
them.

Richard.



Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-01 Thread James Kass via Unicode



On 2020-01-02 1:04 AM, Richard Wordingham wrote in a thread deriving 
from this one,


> Have you found a definition of the ISCII handling of Vedic characters?

No.  It would be helpful.  ISCII apparently wasn't really used much.  It 
would also be helpful to know the encoding order in any legacy ISCII 
data using the Vedic characters with respect to VISARGA/ANUSVARA.  
Although such legacy data seems unlikely, I'd expect VISARGA/ANUSVARA to 
be entered/stored post-syllable.


> I've been looking at Microsoft's specification of Devanagari character
> order.  In
> 
https://docs.microsoft.com/en-us/typography/script-development/devanagari,

> the consonant syllable ends
>
> [N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)]
>
> where
> N is nukta
> A is anudatta (U+0952)
> H is halant/virama
> M is matra
> SM is syllable modifier signs
> VD is vedic
>
> "Syllable modifier signs" and "vedic" are not defined.  It appears that
> SM includes U+0903 DEVANAGARI SIGN VISARGA.

What action should Microsoft take to satisfy the needs of the user 
community?

1.  No action, maintain status quo.
2.  Swap SM and VD in the specs ordering.
3.  Make new category PS (post-syllable) and move VISARGA/ANUSVARA there.
4.  ?

What kind of impact would there be on existing data if Microsoft revised 
the ordering?


Or should Unicode encode a new character like ZERO-WIDTH INVISIBLE 
DOTTED CIRCLE so that users can suppress unwanted and unexpected dotted 
circles by adding superfluous characters to the text stream?


> I note that even ग॒ः  is
> given a dotted circle by HarfBuzz.

Same on Win 7.  And  (गः॒) 
breaks the mark positioning as expected.




Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread Richard Wordingham via Unicode
On Wed, 1 Jan 2020 20:11:04 +
James Kass via Unicode  wrote:

> On 2020-01-01 11:17 AM, Richard Wordingham via Unicode wrote:
> 
>  > That's exactly the sort of mess that jack-booted renderers are
>  > trying to minimise.  Their principle is that there should be only
>  > one encoding per shape, though to be fair:
>  >
>  > 1) some renderers accept canonical equivalents.
>  > 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ),
>  > collating (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ).
>  > 3) Superseded chillu encodings are still supported.  
> 
> There was never any need for atomic chillu form characters.  

> The 
> principle of only one encoding per shape is best achieved when every 
> shape gets an atomic encoding.

I should have written per-word shape.  I should also have added that
most renderers attempt to handle Mongolian, despite its encoding
Middle Mongolian phonetics rather than characters. Also, they don't
attempt to sort the Arabic script per-language subsets out, which
leads to a bad mess at Wiktionary when Unicode characters differ only in
a few forms.

> Glyph-based encoding is incompatible 
> with Unicode character encoding principles.

Visual encoding sometimes works - phonetic order for Thai is so
complicated that it is unsurprising that its definition is partly
missing from Unicode 1.0.  The official history hides behind
incompatibility with the Thai national standard, but phonetic order was
simply too complicated for Thai.  Additionally, Thais don't agree on
where preposed vowels go relative to Pali consonant clusters - they
don't agree that all of them should appear in the middle of the
cluster.  (I suppose the positioning rule could have been made a
stylistic feature of fonts.)

An analogue is Lao collation.  While syllable boundaries can
overwhelmingly be discerned in modern Lao, Lao collations are too
complicated to be accepted for ICU if they are to support anything but
single syllables.  CLDR collation (interpreted as a specification with
the normal use of specification language for the form of definitions)
can just cope, whereas the UCA can't, but the tables are huge. 

Richard.



Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread Richard Wordingham via Unicode
On Wed, 1 Jan 2020 23:09:49 +
James Kass via Unicode  wrote:

> On 2020-01-01 8:11 PM, James Kass wrote:
> > It’s too bad that ISCII didn’t accomodate the needs of Vedic
> > Sanskrit, but here we are.  
> 
> Sorry, that might be wrong to say.  It's possible that it's Unicode's 
> adaptation of ISCII that hinders Vedic Sanskrit.

Have you found a definition of the ISCII handling of Vedic characters?

The problem lies in Unicode's failure to standardise the encoding of
Devanagari text.  But for the consistent failure to include a
standardisation of text in a script in TUS, one might wonder if the
original idea was to duck the issue by resorting to canonical
equivalence.

I've been looking at Microsoft's specification of Devanagari character
order.  In
https://docs.microsoft.com/en-us/typography/script-development/devanagari,
the consonant syllable ends

[N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)]

where
N is nukta
A is anudatta (U+0952)
H is halant/virama
M is matra
SM is syllable modifier signs
VD is vedic

"Syllable modifier signs" and "vedic" are not defined.  It appears that
SM includes U+0903 DEVANAGARI SIGN VISARGA.

I note that even ग॒ः  is
given a dotted circle by HarfBuzz.  Now, this might not be an entirely
fair test; I suspect anudatta is assigned this position because
originally the Sindhi implosives were encoded as consonant plus nukta
and anudatta, though rendering still fails with HarfBuzz when nukta is
inserted (ग़॒ः).

Richard.




Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread James Kass via Unicode



On 2020-01-01 8:11 PM, James Kass wrote:
It’s too bad that ISCII didn’t accomodate the needs of Vedic Sanskrit, 
but here we are.


Sorry, that might be wrong to say.  It's possible that it's Unicode's 
adaptation of ISCII that hinders Vedic Sanskrit.


One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread James Kass via Unicode



On 2020-01-01 11:17 AM, Richard Wordingham via Unicode wrote:

> That's exactly the sort of mess that jack-booted renderers are trying
> to minimise.  Their principle is that there should be only one encoding
> per shape, though to be fair:
>
> 1) some renderers accept canonical equivalents.
> 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ), collating
> (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ).
> 3) Superseded chillu encodings are still supported.

There was never any need for atomic chillu form characters.  The 
principle of only one encoding per shape is best achieved when every 
shape gets an atomic encoding.  Glyph-based encoding is incompatible 
with Unicode character encoding principles.


It’s too bad that ISCII didn’t accomodate the needs of Vedic Sanskrit, 
but here we are.




Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-01 Thread Richard Wordingham via Unicode
On Wed, 1 Jan 2020 01:19:02 +
James Kass via Unicode  wrote:

> A workaround until some kind of satisfactory adjustment is made might
> be to simply use COLON for VISARGA.  Or...
> 
>   VISARGA ⇒ U+02F8 MODIFIER LETTER RAISED COLON
> ANUSVARA⇒U+02D9 DOT ABOVE
> 
> ...as long as the font(s) included both those characters.
> 
> य॑ यॆ॑
> 
> य॑ं -- anusvara last
> यॆ॑ं -- "
> 
> य॑: -- colon last
> यॆ॑: -- "
> 
> य॑˸ -- raised colon modifier last
> यॆ॑˸ -- "
> 
> य॑˙ -- spacing dot above last
> यॆ॑˙ -- "
> 

That's exactly the sort of mess that jack-booted renderers are trying
to minimise.  Their principle is that there should be only one encoding
per shape, though to be fair:

1) some renderers accept canonical equivalents.
2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ), collating
(CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ).
3) Superseded chillu encodings are still supported.

Richard.



Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2019-12-31 Thread James Kass via Unicode



A workaround until some kind of satisfactory adjustment is made might be 
to simply use COLON for VISARGA.  Or...


 VISARGA ⇒ U+02F8 MODIFIER LETTER RAISED COLON
ANUSVARA⇒U+02D9 DOT ABOVE

...as long as the font(s) included both those characters.

य॑ यॆ॑

य॑ं -- anusvara last
यॆ॑ं -- "

य॑: -- colon last
यॆ॑: -- "

य॑˸ -- raised colon modifier last
यॆ॑˸ -- "

य॑˙ -- spacing dot above last
यॆ॑˙ -- "



Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2019-12-31 Thread James Kass via Unicode



On 2019-12-21 6:27 AM, Shriramana Sharma via Unicode wrote:

However, even the simplest Vedic sequence (not involving Sama Vedic or
multiple tone marker combinations) like दे॒वेभ्य॑ः throws up a dotted
circle, and one is expected (see developer feedback in that bug
report) to input the visarga before tone markers, hoping the software
is intelligent enough to skip over the visarga (or anusvara) place the
tone marker over the preceding syllable correctly. Why it is necessary
to put the visarga first in input only to have to skip over it in
shaping is beyond me.

य॔ यॆ॔
य॔ः -- visarga last
यॆ॔ः -- "
यः॔ -- visarga before accent (U+0954)
यॆः॔ -- "

य॑ यॆ॑
य॑ः -- visarga last
यॆ॑ः -- "
यः॑  visarga before svarita (U+0951)
यॆः॑  "

U+0951 and U+0954 have canonical combining class of 230.  Putting 
VISARGA (CCC=0) after those CCC=230 marks generates the dotted circle 
for VISARGA.  Putting VISARGA before those CCC=230 marks generates the 
dotted circle for U+0954 but drops the dotted circle for U+0951.  In 
both cases where VISARGA comes before, the mark positioning is broken.  
(Mangal font, Win 7)


As far as I can tell, the simplest solution would be for the Indic 
shaping engines to suppress the dotted circle for VISARGA (or ANUSVARA) 
where appropriate.  Entering/storing VISARGA or ANUSVARA at the end of 
the syllable makes sense since that's where it goes, visually and logically.




Long standing problem with Vedic tone markers and post-base visarga/anusvara

2019-12-20 Thread Shriramana Sharma via Unicode
https://github.com/harfbuzz/harfbuzz/issues/2017 should provide the
context for this.

Ever since the early days of Devanagari Unicode, scholars like me
dealing with Vedic Sanskrit orthography have been experiencing this
problem, but chalked it upto early days and consequent insufficient
support for Vedic sequences. Even now, Vedic support even on the font
side is quite limited, and we also find limitations on the software
side. So I hope it's time to fix them one by one.

The issue I would like to discuss now is as follows:

# SEMANTIC DISSOCIATION OF THE VISARGA FROM THE SYLLABLE

In Vedic, syllables that carry tone markers – which are mostly
above-base or below-base – often have to take a visarga, which is
always post-base. In this case, the sequence intuitive to native
scholars like me is:

 +  + 

This is because the tone marker indicates the tone of the syllable (or
its vowel) and the visarga is a separate aspirated sound *after* the
syllable to which the tone marker doesn't apply.

In fact, the only reason the visarga sign is analysed as a combining
mark rather than a separate letter is that it is not used in isolation
without a preceding syllable. Otherwise ie linguistically it doesn't
modify the preceding syllable in any way.

Anyhow, the point is that the tone marker should come before the
visarga because it semantically applies to the preceding syllable and
not the visarga.

This is all the more so since in some Vedic contexts (Sama Gana) the
visarga is far separated from the syllable by other syllables like
digits (themselves carrying combining marks) or spacing anusvara, as
seen in examples from my Grantha proposal L2/09-372 p 40.

So the visarga is semantically quite dissociated from the preceding
syllable unlikely the tone marker which is intimately associated with
it.

# SAME APPLICABLE TO THE ANUSVARA

The same argument is also applicable to the anusvara as it also
represents a nasal sound separate from the preceding syllable. (The
candrabindu OTOH nasalises the preceding syllable itself.)

The above Grantha proposal page also shows an example where an
anusvara is orthographically separated from the preceding syllable by
three characters: a tone marker + avagraha + digit. L2/15-178 shows
that in equivalent contexts of Devanagari the digit 0 is used as a
substitute since the Devanagari anusvara is non-spacing.

All this goes to the dissociation from the syllable of the anusvara –
just like the visarga – compared to tone markers. So to be consistent,
even in case of Devanagari (or such script) where the anusvara is
non-spacing, the sequence when a tone marker is also involved puts the
tone marker first, as mentioned before:

 +  + 

# CURRENT SITUATION INCOMPATIBLE WITH ABOVE

However, even the simplest Vedic sequence (not involving Sama Vedic or
multiple tone marker combinations) like दे॒वेभ्य॑ः throws up a dotted
circle, and one is expected (see developer feedback in that bug
report) to input the visarga before tone markers, hoping the software
is intelligent enough to skip over the visarga (or anusvara) place the
tone marker over the preceding syllable correctly. Why it is necessary
to put the visarga first in input only to have to skip over it in
shaping is beyond me.

So makes sense neither from a linguistic nor technological perspective
to push the tone markers to the end of the syllable. Even the
developers acknowledge that non-spacing marks are normally (ie outside
Indic) input before spacing ones.

However, they say “we can't support that in this particular case
because this is how Microsoft does it and we have to follow suit to
ensure people get the same shaping for the same input”,
notwithstanding the fact that the expectation to put the
visarga/anusvara first is non-sensical as explained above.

So everyone is looking to Microsoft Uniscribe (or whatever its
successor is) to fix things first before they can follow. I figured
that if this is discussed and decided here, everyone can fix it at the
same time.

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा 𑀰𑁆𑀭𑀻𑀭𑀫𑀡𑀰𑀭𑁆𑀫𑀸