Hi Martin, On 15/10/16 04:07, Martin Jansche wrote: > For Sinhala, the following named sequences are defined (for good reasons): > > SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA > SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB > SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D > > I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll > write Ya for 0DBA and Ra for 0DBB. > > Note that these give rise to two potentially ambiguous codepoint > strings, namely > > 0DBB 0DCA 200D 0DBA > 0DBB 0DCA 200D 0DBB > > I'll concentrate on the first, as all arguments apply to the second one > analogously. > > At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible parses: > > 0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya > 0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya > > First question: Does the standard give any guidance as to which one is > the intended parse? The section on Sinhala in the Unicode Standard is > silent about this. Is there a general principle I'm missing? > > Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not > used and is considered incorrect, suggesting that the second parse > (Repaya+Ya) should be the default interpretation of this sequence. > However, SLS 1134 does not address the potential ambiguity of this > sequence explicitly and the description there could be read as > informative, not normative.
1) re: 0DBB 0DCA 200D 0DBA SLS 1134 was updated in 2011 (The latest public version I could find is v3.41. This extract is the same in v3.6.): https://sourceforge.net/p/sinhala/mailman/attachment/[email protected]/1/ "1. The yansaya is not used following the letter ර. e.g.: the spelling කාර්ය is incorrect." If the above is insufficient, it's best to discuss the issue with Harsha (CC'd) and Ruvan (CC'd). 2) re: 0DBB 0DCA 200D 0DBB Harsha & Ruvan can clarify this too. cya, # > Second question: Given that one parse of this sequence should be the > default, how does one represent the non-default parse? > > In most cases one can guess what the intended meaning is, but I suspect > this is somewhat of a gray area. In practice, trying to render these > problematic sequences and their neighbors in HarfBuzz with a variety of > fonts results in a variety of outcomes (including occasionally > unexpected glyph choices). If the meaning of these sequences is not well > defined, that would partly explain the variation across fonts. > > Am I missing something fundamental? If not, it seems this issue should > be called out explicit in some part of the standard. > > Regards, > -- martin

