Re: Amiguity(?) in Sinhala named sequences

2016-10-17 Thread Asmus Freytag

  
  
On 10/17/2016 7:58 AM, Martin Jansche
  wrote:


  Thanks for the pointer to the 2011 version of SLS
1134. After reading that and discussing further with Cibu,
here's a tentative proposal:


* The most logical[*] interpretation of the sequence 0DBB
  0DCA 200D 0DBA is as Repaya+Ya. A standard (Unicode and/or
  SLS) should call this out explicitly. ([*]Logical: In other
  scripts, including Devanagari, Myanmar, etc. similar types of
  modifiers that logically precede a letter are represented in
  this way, sometimes without ZWJ or with a different character
  in lieu of ZWJ. Also this interpretation plays well alongside
  a hypothetical alternative encoding of Yansaya using a single
  codepoint.)



* A standard (Unicode and/or SLS) should specify how
  Ra+Yansaya should be encoded. SLS 1134 points out that
  Ra+Yansaya is an incorrect spelling, yet in order to make this
  point it has to show the glyph sequence for Ra+Yansaya. So
  there is clearly some need to be able to render this, even if
  it's only at this meta-linguistic level. Plus SLS 1134 is very
  explicit that e.g. keyboarding should allow for letter
  combinations to be entered even if they are not practically
  useful. One possible way of encoding Ra+Yansaya is 0DBB 200C
  0DCA 200D 0DBA, i.e. Ra ZWNJ Yansaya. This renders as intended
  in HarfBuzz with NotoSansSinhala, but not with
  LBhashitaComplex. If we had a clear directive regarding how
  Ra+Yansaya should be represented, we could work on getting
  fonts updated.
  


There are some didactic needs that aren't directly catered to by the
standard. That is as it should be, especially, if you are intending
to show things that "shouldn't exist".

  


* Everything about 0DBB 0DCA 200D 0DBA also applies to 0DBB
  0DCA 200D 0DBB. This is much less relevant in practice, but
  the same arguments about ambiguity apply and should be
  resolved in the same way.


Regards,
-- martin
  
  
On Mon, Oct 17, 2016 at 12:15 AM,
  Harshula 
  wrote:
  Hi Martin,

  
On 15/10/16 04:07, Martin Jansche wrote:
> For Sinhala, the following named sequences are
defined (for good reasons):
>
> SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA
> SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB
> SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D
>
> I'll abbreviate these as Yansaya, Rakaransaya, and
Repaya, and I'll
> write Ya for 0DBA and Ra for 0DBB.
>
> Note that these give rise to two potentially
ambiguous codepoint
> strings, namely
>
>   0DBB 0DCA 200D 0DBA
>   0DBB 0DCA 200D 0DBB
>
> I'll concentrate on the first, as all arguments
apply to the second one
> analogously.
>
> At a first glance, the sequence 0DBB 0DCA 200D 0DBA
has two possible parses:
>
>   0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya
>   0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya
>
> First question: Does the standard give any guidance
as to which one is
> the intended parse? The section on Sinhala in the
Unicode Standard is
> silent about this. Is there a general principle I'm
missing?
>
> Sri Lanka Standard SLS 1134 (2004 draft) states
that Ra+Yansaya is not
> used and is considered incorrect, suggesting that
the second parse
> (Repaya+Ya) should be the default interpretation of
this sequence.
> However, SLS 1134 does not address the potential
ambiguity of this
> sequence explicitly and the description there could
be read as
> informative, not normative.

  

1) re: 0DBB 0DCA 200D 0DBA

SLS 1134 was updated in 2011 (The latest public version I
could find is
v3.41. This extract is the same in v3.6.):
https://sourceforge.net/p/sinhala/mailman/attachment/4d957c56.5050...@cse.mrt.ac.lk/1/

"1.   The yansaya is not used following the le

Re: Amiguity(?) in Sinhala named sequences

2016-10-17 Thread Martin Jansche
Thanks for the pointer to the 2011 version of SLS 1134. After reading that
and discussing further with Cibu, here's a tentative proposal:

* The most logical[*] interpretation of the sequence 0DBB 0DCA 200D 0DBA is
as Repaya+Ya. A standard (Unicode and/or SLS) should call this out
explicitly. ([*]Logical: In other scripts, including Devanagari, Myanmar,
etc. similar types of modifiers that logically precede a letter are
represented in this way, sometimes without ZWJ or with a different
character in lieu of ZWJ. Also this interpretation plays well alongside a
hypothetical alternative encoding of Yansaya using a single codepoint.)

* A standard (Unicode and/or SLS) should specify how Ra+Yansaya should be
encoded. SLS 1134 points out that Ra+Yansaya is an incorrect spelling, yet
in order to make this point it has to show the glyph sequence for
Ra+Yansaya. So there is clearly some need to be able to render this, even
if it's only at this meta-linguistic level. Plus SLS 1134 is very explicit
that e.g. keyboarding should allow for letter combinations to be entered
even if they are not practically useful. One possible way of encoding
Ra+Yansaya is 0DBB 200C 0DCA 200D 0DBA, i.e. Ra ZWNJ Yansaya. This renders
as intended in HarfBuzz with NotoSansSinhala, but not with
LBhashitaComplex. If we had a clear directive regarding how Ra+Yansaya
should be represented, we could work on getting fonts updated.

* Everything about 0DBB 0DCA 200D 0DBA also applies to 0DBB 0DCA 200D 0DBB.
This is much less relevant in practice, but the same arguments about
ambiguity apply and should be resolved in the same way.

Regards,
-- martin

On Mon, Oct 17, 2016 at 12:15 AM, Harshula  wrote:

> Hi Martin,
>
> On 15/10/16 04:07, Martin Jansche wrote:
> > For Sinhala, the following named sequences are defined (for good
> reasons):
> >
> > SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA
> > SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB
> > SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D
> >
> > I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll
> > write Ya for 0DBA and Ra for 0DBB.
> >
> > Note that these give rise to two potentially ambiguous codepoint
> > strings, namely
> >
> >   0DBB 0DCA 200D 0DBA
> >   0DBB 0DCA 200D 0DBB
> >
> > I'll concentrate on the first, as all arguments apply to the second one
> > analogously.
> >
> > At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible
> parses:
> >
> >   0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya
> >   0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya
> >
> > First question: Does the standard give any guidance as to which one is
> > the intended parse? The section on Sinhala in the Unicode Standard is
> > silent about this. Is there a general principle I'm missing?
> >
> > Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not
> > used and is considered incorrect, suggesting that the second parse
> > (Repaya+Ya) should be the default interpretation of this sequence.
> > However, SLS 1134 does not address the potential ambiguity of this
> > sequence explicitly and the description there could be read as
> > informative, not normative.
>
> 1) re: 0DBB 0DCA 200D 0DBA
>
> SLS 1134 was updated in 2011 (The latest public version I could find is
> v3.41. This extract is the same in v3.6.):
> https://sourceforge.net/p/sinhala/mailman/attachment/
> 4d957c56.5050...@cse.mrt.ac.lk/1/
>
> "1.   The yansaya is not used following the letter ර. e.g.: the spelling
> කාර‍්‍ය is incorrect."
>
> If the above is insufficient, it's best to discuss the issue with Harsha
> (CC'd) and Ruvan (CC'd).
>
> 2) re: 0DBB 0DCA 200D 0DBB
>
> Harsha & Ruvan can clarify this too.
>
> cya,
> #
>
>
> > Second question: Given that one parse of this sequence should be the
> > default, how does one represent the non-default parse?
> >
> > In most cases one can guess what the intended meaning is, but I suspect
> > this is somewhat of a gray area. In practice, trying to render these
> > problematic sequences and their neighbors in HarfBuzz with a variety of
> > fonts results in a variety of outcomes (including occasionally
> > unexpected glyph choices). If the meaning of these sequences is not well
> > defined, that would partly explain the variation across fonts.
> >
> > Am I missing something fundamental? If not, it seems this issue should
> > be called out explicit in some part of the standard.
> >
> > Regards,
> > -- martin
>


Re: Amiguity(?) in Sinhala named sequences

2016-10-16 Thread സിബു ‌
Hi Martin,

Isn't this question analogous to asking whether the layout engine should
use C1-conjoining form or C2-conjoining form for a 
sequence in any indic? that is, whether the  should form a
glyph while C2 keeping its independent form or vice versa. (Potentially
there can be more forms - that is, full ligature and explicit Virama form).
If the question you asked is equivalent, then the answer is traditionally
is left to the font to decide.

BTW, even for a given C1 and C2 for a given script, a font can potentially
choose a different answer based on its its purpose/character, like a font
for Malayalam traditional script Vs a font for reformed script.

regards,
Cibu

On Mon, Oct 17, 2016 at 12:15 AM, Harshula  wrote:

> Hi Martin,
>
> On 15/10/16 04:07, Martin Jansche wrote:
> > For Sinhala, the following named sequences are defined (for good
> reasons):
> >
> > SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA
> > SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB
> > SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D
> >
> > I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll
> > write Ya for 0DBA and Ra for 0DBB.
> >
> > Note that these give rise to two potentially ambiguous codepoint
> > strings, namely
> >
> >   0DBB 0DCA 200D 0DBA
> >   0DBB 0DCA 200D 0DBB
> >
> > I'll concentrate on the first, as all arguments apply to the second one
> > analogously.
> >
> > At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible
> parses:
> >
> >   0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya
> >   0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya
> >
> > First question: Does the standard give any guidance as to which one is
> > the intended parse? The section on Sinhala in the Unicode Standard is
> > silent about this. Is there a general principle I'm missing?
> >
> > Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not
> > used and is considered incorrect, suggesting that the second parse
> > (Repaya+Ya) should be the default interpretation of this sequence.
> > However, SLS 1134 does not address the potential ambiguity of this
> > sequence explicitly and the description there could be read as
> > informative, not normative.
>
> 1) re: 0DBB 0DCA 200D 0DBA
>
> SLS 1134 was updated in 2011 (The latest public version I could find is
> v3.41. This extract is the same in v3.6.):
> https://sourceforge.net/p/sinhala/mailman/attachment/
> 4d957c56.5050...@cse.mrt.ac.lk/1/
>
> "1.   The yansaya is not used following the letter ර. e.g.: the spelling
> කාර‍්‍ය is incorrect."
>
> If the above is insufficient, it's best to discuss the issue with Harsha
> (CC'd) and Ruvan (CC'd).
>
> 2) re: 0DBB 0DCA 200D 0DBB
>
> Harsha & Ruvan can clarify this too.
>
> cya,
> #
>
>
> > Second question: Given that one parse of this sequence should be the
> > default, how does one represent the non-default parse?
> >
> > In most cases one can guess what the intended meaning is, but I suspect
> > this is somewhat of a gray area. In practice, trying to render these
> > problematic sequences and their neighbors in HarfBuzz with a variety of
> > fonts results in a variety of outcomes (including occasionally
> > unexpected glyph choices). If the meaning of these sequences is not well
> > defined, that would partly explain the variation across fonts.
> >
> > Am I missing something fundamental? If not, it seems this issue should
> > be called out explicit in some part of the standard.
> >
> > Regards,
> > -- martin
>


Re: Amiguity(?) in Sinhala named sequences

2016-10-16 Thread Harshula
Hi Martin,

On 15/10/16 04:07, Martin Jansche wrote:
> For Sinhala, the following named sequences are defined (for good reasons):
> 
> SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA
> SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB
> SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D
> 
> I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll
> write Ya for 0DBA and Ra for 0DBB.
> 
> Note that these give rise to two potentially ambiguous codepoint
> strings, namely
> 
>   0DBB 0DCA 200D 0DBA
>   0DBB 0DCA 200D 0DBB
> 
> I'll concentrate on the first, as all arguments apply to the second one
> analogously.
> 
> At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible parses:
> 
>   0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya
>   0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya
> 
> First question: Does the standard give any guidance as to which one is
> the intended parse? The section on Sinhala in the Unicode Standard is
> silent about this. Is there a general principle I'm missing?
> 
> Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not
> used and is considered incorrect, suggesting that the second parse
> (Repaya+Ya) should be the default interpretation of this sequence.
> However, SLS 1134 does not address the potential ambiguity of this
> sequence explicitly and the description there could be read as
> informative, not normative.

1) re: 0DBB 0DCA 200D 0DBA

SLS 1134 was updated in 2011 (The latest public version I could find is
v3.41. This extract is the same in v3.6.):
https://sourceforge.net/p/sinhala/mailman/attachment/4d957c56.5050...@cse.mrt.ac.lk/1/

"1.   The yansaya is not used following the letter ර. e.g.: the spelling
කාර‍්‍ය is incorrect."

If the above is insufficient, it's best to discuss the issue with Harsha
(CC'd) and Ruvan (CC'd).

2) re: 0DBB 0DCA 200D 0DBB

Harsha & Ruvan can clarify this too.

cya,
#


> Second question: Given that one parse of this sequence should be the
> default, how does one represent the non-default parse?
> 
> In most cases one can guess what the intended meaning is, but I suspect
> this is somewhat of a gray area. In practice, trying to render these
> problematic sequences and their neighbors in HarfBuzz with a variety of
> fonts results in a variety of outcomes (including occasionally
> unexpected glyph choices). If the meaning of these sequences is not well
> defined, that would partly explain the variation across fonts.
> 
> Am I missing something fundamental? If not, it seems this issue should
> be called out explicit in some part of the standard.
> 
> Regards,
> -- martin


Re: Amiguity(?) in Sinhala named sequences

2016-10-14 Thread Asmus Freytag

This is an interesting question.

It seems the task of parsing a text into sequences depends on the 
purpose. Not all sequences of interest are named and, in the general 
case, not all attempts at parsing may be unique. In this case, it looks 
like the named sequences would correspond to a specific (ligated) glyph 
that matches a user-perceived unit of the writing system.


Such a parsing task is akin to scanning, for example, strings using the 
Latin script for ligatures - while trying to emulate the rules that were 
in effect during days of hot metal typesetting for certain languages. 
For example, it wasn't enough to know that a certain cluster of letters 
might have a ligature glyph, one would also have to know whether the 
cluster straddled a (compound) word boundary or not. Just knowing the 
specification of ligated sequences alone would not be enough to identify 
a correct parse.


Such rules, however, are usually not part of the Unicode standard.

The situation here is similar; the standard simply specifies that a 
certain sequence of code points has a collective name. In case of 
ambiguities, you'll have to turn to external sources to resolve them.


Now, if this isthe only such ambiguity (or one of a very small number) 
and if identification of the correct sequence is essential for selecting 
the correct rendering, I don't see why the script description for 
Sinhala couldn't be augmented to discuss that issue.


In which case, the way to proceed is to assemble the full set of facts 
and submit them to the UTC using the reporting form on the website.


A./


On 10/14/2016 10:07 AM, Martin Jansche wrote:

For Sinhala, the following named sequences are defined (for good reasons):

SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA
SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB
SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D

I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll 
write Ya for 0DBA and Ra for 0DBB.


Note that these give rise to two potentially ambiguous codepoint 
strings, namely


  0DBB 0DCA 200D 0DBA
  0DBB 0DCA 200D 0DBB

I'll concentrate on the first, as all arguments apply to the second 
one analogously.


At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible 
parses:


  0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya
  0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya

First question: Does the standard give any guidance as to which one is 
the intended parse? The section on Sinhala in the Unicode Standard is 
silent about this. Is there a general principle I'm missing?


Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not 
used and is considered incorrect, suggesting that the second parse 
(Repaya+Ya) should be the default interpretation of this sequence. 
However, SLS 1134 does not address the potential ambiguity of this 
sequence explicitly and the description there could be read as 
informative, not normative.


Second question: Given that one parse of this sequence should be the 
default, how does one represent the non-default parse?


In most cases one can guess what the intended meaning is, but I 
suspect this is somewhat of a gray area. In practice, trying to render 
these problematic sequences and their neighbors in HarfBuzz with a 
variety of fonts results in a variety of outcomes (including 
occasionally unexpected glyph choices). If the meaning of these 
sequences is not well defined, that would partly explain the variation 
across fonts.


Am I missing something fundamental? If not, it seems this issue should 
be called out explicit in some part of the standard.


Regards,
-- martin