Re: Amiguity(?) in Sinhala named sequences
On 10/17/2016 7:58 AM, Martin Jansche wrote: Thanks for the pointer to the 2011 version of SLS 1134. After reading that and discussing further with Cibu, here's a tentative proposal: * The most logical[*] interpretation of the sequence 0DBB 0DCA 200D 0DBA is as Repaya+Ya. A standard (Unicode and/or SLS) should call this out explicitly. ([*]Logical: In other scripts, including Devanagari, Myanmar, etc. similar types of modifiers that logically precede a letter are represented in this way, sometimes without ZWJ or with a different character in lieu of ZWJ. Also this interpretation plays well alongside a hypothetical alternative encoding of Yansaya using a single codepoint.) * A standard (Unicode and/or SLS) should specify how Ra+Yansaya should be encoded. SLS 1134 points out that Ra+Yansaya is an incorrect spelling, yet in order to make this point it has to show the glyph sequence for Ra+Yansaya. So there is clearly some need to be able to render this, even if it's only at this meta-linguistic level. Plus SLS 1134 is very explicit that e.g. keyboarding should allow for letter combinations to be entered even if they are not practically useful. One possible way of encoding Ra+Yansaya is 0DBB 200C 0DCA 200D 0DBA, i.e. Ra ZWNJ Yansaya. This renders as intended in HarfBuzz with NotoSansSinhala, but not with LBhashitaComplex. If we had a clear directive regarding how Ra+Yansaya should be represented, we could work on getting fonts updated. There are some didactic needs that aren't directly catered to by the standard. That is as it should be, especially, if you are intending to show things that "shouldn't exist". * Everything about 0DBB 0DCA 200D 0DBA also applies to 0DBB 0DCA 200D 0DBB. This is much less relevant in practice, but the same arguments about ambiguity apply and should be resolved in the same way. Regards, -- martin On Mon, Oct 17, 2016 at 12:15 AM, Harshulawrote: Hi Martin, On 15/10/16 04:07, Martin Jansche wrote: > For Sinhala, the following named sequences are defined (for good reasons): > > SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA > SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB > SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D > > I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll > write Ya for 0DBA and Ra for 0DBB. > > Note that these give rise to two potentially ambiguous codepoint > strings, namely > > 0DBB 0DCA 200D 0DBA > 0DBB 0DCA 200D 0DBB > > I'll concentrate on the first, as all arguments apply to the second one > analogously. > > At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible parses: > > 0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya > 0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya > > First question: Does the standard give any guidance as to which one is > the intended parse? The section on Sinhala in the Unicode Standard is > silent about this. Is there a general principle I'm missing? > > Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not > used and is considered incorrect, suggesting that the second parse > (Repaya+Ya) should be the default interpretation of this sequence. > However, SLS 1134 does not address the potential ambiguity of this > sequence explicitly and the description there could be read as > informative, not normative. 1) re: 0DBB 0DCA 200D 0DBA SLS 1134 was updated in 2011 (The latest public version I could find is v3.41. This extract is the same in v3.6.): https://sourceforge.net/p/sinhala/mailman/attachment/4d957c56.5050...@cse.mrt.ac.lk/1/ "1. The yansaya is not used following the le
Re: Amiguity(?) in Sinhala named sequences
Thanks for the pointer to the 2011 version of SLS 1134. After reading that and discussing further with Cibu, here's a tentative proposal: * The most logical[*] interpretation of the sequence 0DBB 0DCA 200D 0DBA is as Repaya+Ya. A standard (Unicode and/or SLS) should call this out explicitly. ([*]Logical: In other scripts, including Devanagari, Myanmar, etc. similar types of modifiers that logically precede a letter are represented in this way, sometimes without ZWJ or with a different character in lieu of ZWJ. Also this interpretation plays well alongside a hypothetical alternative encoding of Yansaya using a single codepoint.) * A standard (Unicode and/or SLS) should specify how Ra+Yansaya should be encoded. SLS 1134 points out that Ra+Yansaya is an incorrect spelling, yet in order to make this point it has to show the glyph sequence for Ra+Yansaya. So there is clearly some need to be able to render this, even if it's only at this meta-linguistic level. Plus SLS 1134 is very explicit that e.g. keyboarding should allow for letter combinations to be entered even if they are not practically useful. One possible way of encoding Ra+Yansaya is 0DBB 200C 0DCA 200D 0DBA, i.e. Ra ZWNJ Yansaya. This renders as intended in HarfBuzz with NotoSansSinhala, but not with LBhashitaComplex. If we had a clear directive regarding how Ra+Yansaya should be represented, we could work on getting fonts updated. * Everything about 0DBB 0DCA 200D 0DBA also applies to 0DBB 0DCA 200D 0DBB. This is much less relevant in practice, but the same arguments about ambiguity apply and should be resolved in the same way. Regards, -- martin On Mon, Oct 17, 2016 at 12:15 AM, Harshula wrote: > Hi Martin, > > On 15/10/16 04:07, Martin Jansche wrote: > > For Sinhala, the following named sequences are defined (for good > reasons): > > > > SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA > > SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB > > SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D > > > > I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll > > write Ya for 0DBA and Ra for 0DBB. > > > > Note that these give rise to two potentially ambiguous codepoint > > strings, namely > > > > 0DBB 0DCA 200D 0DBA > > 0DBB 0DCA 200D 0DBB > > > > I'll concentrate on the first, as all arguments apply to the second one > > analogously. > > > > At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible > parses: > > > > 0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya > > 0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya > > > > First question: Does the standard give any guidance as to which one is > > the intended parse? The section on Sinhala in the Unicode Standard is > > silent about this. Is there a general principle I'm missing? > > > > Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not > > used and is considered incorrect, suggesting that the second parse > > (Repaya+Ya) should be the default interpretation of this sequence. > > However, SLS 1134 does not address the potential ambiguity of this > > sequence explicitly and the description there could be read as > > informative, not normative. > > 1) re: 0DBB 0DCA 200D 0DBA > > SLS 1134 was updated in 2011 (The latest public version I could find is > v3.41. This extract is the same in v3.6.): > https://sourceforge.net/p/sinhala/mailman/attachment/ > 4d957c56.5050...@cse.mrt.ac.lk/1/ > > "1. The yansaya is not used following the letter ර. e.g.: the spelling > කාර්ය is incorrect." > > If the above is insufficient, it's best to discuss the issue with Harsha > (CC'd) and Ruvan (CC'd). > > 2) re: 0DBB 0DCA 200D 0DBB > > Harsha & Ruvan can clarify this too. > > cya, > # > > > > Second question: Given that one parse of this sequence should be the > > default, how does one represent the non-default parse? > > > > In most cases one can guess what the intended meaning is, but I suspect > > this is somewhat of a gray area. In practice, trying to render these > > problematic sequences and their neighbors in HarfBuzz with a variety of > > fonts results in a variety of outcomes (including occasionally > > unexpected glyph choices). If the meaning of these sequences is not well > > defined, that would partly explain the variation across fonts. > > > > Am I missing something fundamental? If not, it seems this issue should > > be called out explicit in some part of the standard. > > > > Regards, > > -- martin >
Re: Amiguity(?) in Sinhala named sequences
Hi Martin, Isn't this question analogous to asking whether the layout engine should use C1-conjoining form or C2-conjoining form for a sequence in any indic? that is, whether the should form a glyph while C2 keeping its independent form or vice versa. (Potentially there can be more forms - that is, full ligature and explicit Virama form). If the question you asked is equivalent, then the answer is traditionally is left to the font to decide. BTW, even for a given C1 and C2 for a given script, a font can potentially choose a different answer based on its its purpose/character, like a font for Malayalam traditional script Vs a font for reformed script. regards, Cibu On Mon, Oct 17, 2016 at 12:15 AM, Harshula wrote: > Hi Martin, > > On 15/10/16 04:07, Martin Jansche wrote: > > For Sinhala, the following named sequences are defined (for good > reasons): > > > > SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA > > SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB > > SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D > > > > I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll > > write Ya for 0DBA and Ra for 0DBB. > > > > Note that these give rise to two potentially ambiguous codepoint > > strings, namely > > > > 0DBB 0DCA 200D 0DBA > > 0DBB 0DCA 200D 0DBB > > > > I'll concentrate on the first, as all arguments apply to the second one > > analogously. > > > > At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible > parses: > > > > 0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya > > 0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya > > > > First question: Does the standard give any guidance as to which one is > > the intended parse? The section on Sinhala in the Unicode Standard is > > silent about this. Is there a general principle I'm missing? > > > > Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not > > used and is considered incorrect, suggesting that the second parse > > (Repaya+Ya) should be the default interpretation of this sequence. > > However, SLS 1134 does not address the potential ambiguity of this > > sequence explicitly and the description there could be read as > > informative, not normative. > > 1) re: 0DBB 0DCA 200D 0DBA > > SLS 1134 was updated in 2011 (The latest public version I could find is > v3.41. This extract is the same in v3.6.): > https://sourceforge.net/p/sinhala/mailman/attachment/ > 4d957c56.5050...@cse.mrt.ac.lk/1/ > > "1. The yansaya is not used following the letter ර. e.g.: the spelling > කාර්ය is incorrect." > > If the above is insufficient, it's best to discuss the issue with Harsha > (CC'd) and Ruvan (CC'd). > > 2) re: 0DBB 0DCA 200D 0DBB > > Harsha & Ruvan can clarify this too. > > cya, > # > > > > Second question: Given that one parse of this sequence should be the > > default, how does one represent the non-default parse? > > > > In most cases one can guess what the intended meaning is, but I suspect > > this is somewhat of a gray area. In practice, trying to render these > > problematic sequences and their neighbors in HarfBuzz with a variety of > > fonts results in a variety of outcomes (including occasionally > > unexpected glyph choices). If the meaning of these sequences is not well > > defined, that would partly explain the variation across fonts. > > > > Am I missing something fundamental? If not, it seems this issue should > > be called out explicit in some part of the standard. > > > > Regards, > > -- martin >
Re: Amiguity(?) in Sinhala named sequences
Hi Martin, On 15/10/16 04:07, Martin Jansche wrote: > For Sinhala, the following named sequences are defined (for good reasons): > > SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA > SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB > SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D > > I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll > write Ya for 0DBA and Ra for 0DBB. > > Note that these give rise to two potentially ambiguous codepoint > strings, namely > > 0DBB 0DCA 200D 0DBA > 0DBB 0DCA 200D 0DBB > > I'll concentrate on the first, as all arguments apply to the second one > analogously. > > At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible parses: > > 0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya > 0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya > > First question: Does the standard give any guidance as to which one is > the intended parse? The section on Sinhala in the Unicode Standard is > silent about this. Is there a general principle I'm missing? > > Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not > used and is considered incorrect, suggesting that the second parse > (Repaya+Ya) should be the default interpretation of this sequence. > However, SLS 1134 does not address the potential ambiguity of this > sequence explicitly and the description there could be read as > informative, not normative. 1) re: 0DBB 0DCA 200D 0DBA SLS 1134 was updated in 2011 (The latest public version I could find is v3.41. This extract is the same in v3.6.): https://sourceforge.net/p/sinhala/mailman/attachment/4d957c56.5050...@cse.mrt.ac.lk/1/ "1. The yansaya is not used following the letter ර. e.g.: the spelling කාර්ය is incorrect." If the above is insufficient, it's best to discuss the issue with Harsha (CC'd) and Ruvan (CC'd). 2) re: 0DBB 0DCA 200D 0DBB Harsha & Ruvan can clarify this too. cya, # > Second question: Given that one parse of this sequence should be the > default, how does one represent the non-default parse? > > In most cases one can guess what the intended meaning is, but I suspect > this is somewhat of a gray area. In practice, trying to render these > problematic sequences and their neighbors in HarfBuzz with a variety of > fonts results in a variety of outcomes (including occasionally > unexpected glyph choices). If the meaning of these sequences is not well > defined, that would partly explain the variation across fonts. > > Am I missing something fundamental? If not, it seems this issue should > be called out explicit in some part of the standard. > > Regards, > -- martin
Re: Amiguity(?) in Sinhala named sequences
This is an interesting question. It seems the task of parsing a text into sequences depends on the purpose. Not all sequences of interest are named and, in the general case, not all attempts at parsing may be unique. In this case, it looks like the named sequences would correspond to a specific (ligated) glyph that matches a user-perceived unit of the writing system. Such a parsing task is akin to scanning, for example, strings using the Latin script for ligatures - while trying to emulate the rules that were in effect during days of hot metal typesetting for certain languages. For example, it wasn't enough to know that a certain cluster of letters might have a ligature glyph, one would also have to know whether the cluster straddled a (compound) word boundary or not. Just knowing the specification of ligated sequences alone would not be enough to identify a correct parse. Such rules, however, are usually not part of the Unicode standard. The situation here is similar; the standard simply specifies that a certain sequence of code points has a collective name. In case of ambiguities, you'll have to turn to external sources to resolve them. Now, if this isthe only such ambiguity (or one of a very small number) and if identification of the correct sequence is essential for selecting the correct rendering, I don't see why the script description for Sinhala couldn't be augmented to discuss that issue. In which case, the way to proceed is to assemble the full set of facts and submit them to the UTC using the reporting form on the website. A./ On 10/14/2016 10:07 AM, Martin Jansche wrote: For Sinhala, the following named sequences are defined (for good reasons): SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll write Ya for 0DBA and Ra for 0DBB. Note that these give rise to two potentially ambiguous codepoint strings, namely 0DBB 0DCA 200D 0DBA 0DBB 0DCA 200D 0DBB I'll concentrate on the first, as all arguments apply to the second one analogously. At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible parses: 0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya 0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya First question: Does the standard give any guidance as to which one is the intended parse? The section on Sinhala in the Unicode Standard is silent about this. Is there a general principle I'm missing? Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not used and is considered incorrect, suggesting that the second parse (Repaya+Ya) should be the default interpretation of this sequence. However, SLS 1134 does not address the potential ambiguity of this sequence explicitly and the description there could be read as informative, not normative. Second question: Given that one parse of this sequence should be the default, how does one represent the non-default parse? In most cases one can guess what the intended meaning is, but I suspect this is somewhat of a gray area. In practice, trying to render these problematic sequences and their neighbors in HarfBuzz with a variety of fonts results in a variety of outcomes (including occasionally unexpected glyph choices). If the meaning of these sequences is not well defined, that would partly explain the variation across fonts. Am I missing something fundamental? If not, it seems this issue should be called out explicit in some part of the standard. Regards, -- martin