Re: Amiguity(?) in Sinhala named sequences

Asmus Freytag Fri, 14 Oct 2016 11:14:57 -0700

This is an interesting question.

It seems the task of parsing a text into sequences depends on thepurpose. Not all sequences of interest are named and, in the generalcase, not all attempts at parsing may be unique. In this case, it lookslike the named sequences would correspond to a specific (ligated) glyphthat matches a user-perceived unit of the writing system.

Such a parsing task is akin to scanning, for example, strings using theLatin script for ligatures - while trying to emulate the rules that werein effect during days of hot metal typesetting for certain languages.For example, it wasn't enough to know that a certain cluster of lettersmight have a ligature glyph, one would also have to know whether thecluster straddled a (compound) word boundary or not. Just knowing thespecification of ligated sequences alone would not be enough to identifya correct parse.


Such rules, however, are usually not part of the Unicode standard.

The situation here is similar; the standard simply specifies that acertain sequence of code points has a collective name. In case ofambiguities, you'll have to turn to external sources to resolve them.

Now, if this isthe only such ambiguity (or one of a very small number)and if identification of the correct sequence is essential for selectingthe correct rendering, I don't see why the script description forSinhala couldn't be augmented to discuss that issue.

In which case, the way to proceed is to assemble the full set of factsand submit them to the UTC using the reporting form on the website.


A./


On 10/14/2016 10:07 AM, Martin Jansche wrote:

For Sinhala, the following named sequences are defined (for good reasons):

SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA
SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB
SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D
I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'llwrite Ya for 0DBA and Ra for 0DBB.
Note that these give rise to two potentially ambiguous codepointstrings, namely
  0DBB 0DCA 200D 0DBA
  0DBB 0DCA 200D 0DBB
I'll concentrate on the first, as all arguments apply to the secondone analogously.
At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possibleparses:
  0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya
  0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya
First question: Does the standard give any guidance as to which one isthe intended parse? The section on Sinhala in the Unicode Standard issilent about this. Is there a general principle I'm missing?
Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is notused and is considered incorrect, suggesting that the second parse(Repaya+Ya) should be the default interpretation of this sequence.However, SLS 1134 does not address the potential ambiguity of thissequence explicitly and the description there could be read asinformative, not normative.
Second question: Given that one parse of this sequence should be thedefault, how does one represent the non-default parse?
In most cases one can guess what the intended meaning is, but Isuspect this is somewhat of a gray area. In practice, trying to renderthese problematic sequences and their neighbors in HarfBuzz with avariety of fonts results in a variety of outcomes (includingoccasionally unexpected glyph choices). If the meaning of thesesequences is not well defined, that would partly explain the variationacross fonts.
Am I missing something fundamental? If not, it seems this issue shouldbe called out explicit in some part of the standard.
Regards,
-- martin

Re: Amiguity(?) in Sinhala named sequences

Reply via email to