Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-08 Thread Richard Wordingham via Unicode
On Thu, 8 Aug 2019 00:33:47 +
Andrew Glass via Unicode  wrote:

> I agree and understand that accurate representation is important in
> this case. It would be good to understand how widespread the issue is
> in order to begin to justify the work to retrofit shaping with
> normalization. The number of problematic strings may be small but the
> risk of regression in this case might be quite large.

Well, you could always reverse engineer HarfBuzz!

Just a reminder though.  You would be using a permutation of the
canonical combining classes - for Tai Tham, U+1A60 should be treated as
ccc=254, not ccc=0, and for Tibetan you would need to ensure that the
vowels below (ccc=132) came before the vowels above (ccc=130).

Richard.



Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-08 Thread Asmus Freytag via Unicode

  
  
On 8/8/2019 1:06 AM, Richard Wordingham
  via Unicode wrote:


  This is not compliant with Unicode, but
neither is deliberately treating canonically equivalent forms
differently.

That.
A./

  



Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-08 Thread Richard Wordingham via Unicode
On Wed, 7 Aug 2019 14:19:26 -0700
Asmus Freytag via Unicode  wrote:

> What about text that must exist normalized for other purposes?
> 
> Domain names must be normalized to NFC, for example. Will such
> strings display correctly if passed to USE?

One solution, of course, is to minimise the use of Microsoft
products.  (The trick
is to apply the normalisation algorithm using a permutation of the
positive ccc values.)  The latest version of HarfBuzz renders
subscripted final consonants; it's slowly recovering its pre-USE
rendering capabilities. 

> On 8/7/2019 1:39 PM, Andrew Glass via Unicode wrote:
> That's correct, the Microsoft implementation of USE spec does not
> normalize as part of the shaping process. Why? Because the ccc system
> for non-Latin scripts is not a good mechanism for handling complex
> requirements for these writing systems and the effects of ccc-based
> normalization can disrupt authors intent. Unfortunately, because we
> cannot fix ccc values, shaping engines at Microsoft have ignored
> them. Therefore, recommendation for passing text to USE is to not
> normalize.

HarfBuzz solved the problem of  by choosing a
suitable normalisation; it uses the same technique for Hebrew, where
the normalisation classes are also unfriendly to renderers.  

> By the way, at the current time, I do not have a final consensus from
> Tai Tham experts and community on the changes required to support Tai
> Tham in USE. Therefore, I've not been able to make the changes
> proposed in this thread.

Grammatical denazification is one solution.  Another one is to delegate
matters to the font.  Give us a script type that will implement a GSUB
feature by default, and font writers can take it from there. At present
I have a conundrum on how to render the accusative singular of the
cruciform form of the word for enlightenment without usinɡ chained
syllables, _bodhiṃ_.  The obvious visual encoding is .  This combination is very
unusual, perhaps unique to this word.  (Pali 'o' is ). However, a very common combination, because the UTC refused Tai
Tham the character SIGN AM, is SIGN AA, MAI KANG, so for the USE, SIGN
AA and MAI KANG have to be in the same character class.  (Alternatively,
we split the syllable before SIGN AA.)  MAI KANG has InSc=bindu, while
SIGN AA is a right matra. Unfortunately, there is a strong temptation
for many to write what would have been 'SIGN AM' as MAI KANG, SIGN AA,
which is to be rendered quite differently from 'SIGN AM' outside
Northern Thailand, e.g. in NE Thailand.  (Northern Thailand has both
syles; it is quite diverse.)  If I understand the principles of USE,
allowing both '... MAI KANG, SIGN AA...' and '... SIGN AA, MAI
KANG ', which immediately after a consonant have the same rendering
in some fonts and very confusable renderings in many others, is
considered highly undesirable.

For Microsoft applications, another solution is for fonts to deleted
dotted circles between Tai Tham characters.  (I try to be more
selective, but this results in a complicated set of lookups to
ensure that deletion only occurs when the renderer has inserted
inappropriate dotted circles.)  This is not compliant with Unicode, but
neither is deliberately treating canonically equivalent forms
differently.

Richard.



Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-07 Thread Asmus Freytag via Unicode

  
  
On 8/7/2019 5:33 PM, Andrew Glass via
  Unicode wrote:


  
  
  
  
I agree and understand that accurate
representation is important in this case. It would be good
to understand how widespread the issue is in order to begin
to justify the work to retrofit shaping with normalization.
The number of problematic strings may be small but the risk
of regression in this case might be quite large.
  

Not sure how to quantify this. Potentially every URL (assuming
  that local users eventually migrate to non-ASCII domains). Then
  again, not all of these will be normalized in the document. 

I don't know the precise behavior of address bar / status bar. I
  know that when you type in an uppercase ASCII domain name, it will
  resolve, but the lower case name is echoed.
Can't tell immediately whether that means that for names that are
  normalized for lookup, you also get the canonical name displayed.
  If so, then every single (local) URL in those scripts is
  potentially affected.
A./




  

 
Cheers,
 
Andrew
 

  
From: Asmus Freytag (c)


Sent: 07 August 2019 17:17
To: Andrew Glass
; Unicode Mailing List

    Subject: Re: What is the time frame for USE
        shapers to provide support for CV+C ?
  

 

  On 8/7/2019 5:08 PM, Andrew Glass wrote:


  Shaping domain names is a new requirement. It
  would be good to understand the specific cases that are
  falling in the gap here.

Domain names are simply strings, but the protocol enforces
  normalization to NFC. In some situations, it might be possible
  for a browser, for example, to have access to the
  user-provided string, but I can see any number of situations
  where the actual string (as stored in the DNS) would need to
  be displayed.
For the scenario, it does not matter whether it's NFC or NFD,
  what matters is that some particular un-normalized state would
  be lost; and therefore it would be bad if the result is that
  the string can no longer be rendered correctly.
In particular, as the strings in question would be
  identifiers, where accurate recognition is prime.
A./

   
  

  From: Unicode
  
  
On Behalf Of Asmus Freytag via Unicode
  Sent: 07 August 2019 14:19
  To: unicode@unicode.org
      Subject: Re: What is the time frame for USE
          shapers to provide support for CV+C ?

  
   
  
What about text that must exist
  normalized for other purposes?
  
  
 
  
  
Domain names must be normalized to NFC,
  for example. Will such strings display correctly if passed
  to USE?
  
  
 
  
  
A./
  
  
 
  
  
On 8/7/2019 1:39 PM, Andrew Glass via
  Unicode wrote:
  
  
That's correct, the Microsoft implementation of USE spec does not normalize as part of the shaping process.
Why? Because the ccc system for non-Latin scripts is not a good mechanism for handling complex requirements for these writing systems and the effects of ccc-based normalization can disrupt authors intent. Unfortunately, because we cannot fix ccc values, shaping engines at Microsoft have ignored them. Therefore, recommendation for passing text to USE is to not normalize.
 
By the way, at the current time, I do not have a final consensus from Tai Tham experts and community on the changes required to support Tai Tham in USE. Therefore, I've not been able to make the changes proposed in this thread.
 
Cheers,
 
Andrew
 
-Original Message-
From: Richard Wordingham  
Sent: 07 August 2019 13:29
To: Richard Wordingham via Unicode 
Cc: Andrew Glass 
        Subject: Re: What is the time frame for USE shapers to provide support for CV+C ?
 
On Tue, 14 May 2019 03:08:04 +0100
Richard Wordingham via Unicode  wrote:
 

  On Tue, 14 May 2019 00:58:07 +
  Andrew

RE: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-07 Thread Andrew Glass via Unicode
I agree and understand that accurate representation is important in this case. 
It would be good to understand how widespread the issue is in order to begin to 
justify the work to retrofit shaping with normalization. The number of 
problematic strings may be small but the risk of regression in this case might 
be quite large.

Cheers,

Andrew

From: Asmus Freytag (c) 
Sent: 07 August 2019 17:17
To: Andrew Glass ; Unicode Mailing List 

Subject: Re: What is the time frame for USE shapers to provide support for CV+C 
?

On 8/7/2019 5:08 PM, Andrew Glass wrote:
Shaping domain names is a new requirement. It would be good to understand the 
specific cases that are falling in the gap here.

Domain names are simply strings, but the protocol enforces normalization to 
NFC. In some situations, it might be possible for a browser, for example, to 
have access to the user-provided string, but I can see any number of situations 
where the actual string (as stored in the DNS) would need to be displayed.

For the scenario, it does not matter whether it's NFC or NFD, what matters is 
that some particular un-normalized state would be lost; and therefore it would 
be bad if the result is that the string can no longer be rendered correctly.

In particular, as the strings in question would be identifiers, where accurate 
recognition is prime.

A./

From: Unicode <mailto:unicode-boun...@unicode.org> 
On Behalf Of Asmus Freytag via Unicode
Sent: 07 August 2019 14:19
To: unicode@unicode.org<mailto:unicode@unicode.org>
Subject: Re: What is the time frame for USE shapers to provide support for CV+C 
?

What about text that must exist normalized for other purposes?

Domain names must be normalized to NFC, for example. Will such strings display 
correctly if passed to USE?

A./

On 8/7/2019 1:39 PM, Andrew Glass via Unicode wrote:

That's correct, the Microsoft implementation of USE spec does not normalize as 
part of the shaping process.

Why? Because the ccc system for non-Latin scripts is not a good mechanism for 
handling complex requirements for these writing systems and the effects of 
ccc-based normalization can disrupt authors intent. Unfortunately, because we 
cannot fix ccc values, shaping engines at Microsoft have ignored them. 
Therefore, recommendation for passing text to USE is to not normalize.



By the way, at the current time, I do not have a final consensus from Tai Tham 
experts and community on the changes required to support Tai Tham in USE. 
Therefore, I've not been able to make the changes proposed in this thread.



Cheers,



Andrew



-Original Message-

From: Richard Wordingham 
<mailto:richard.wording...@ntlworld.com>

Sent: 07 August 2019 13:29

To: Richard Wordingham via Unicode 
<mailto:unicode@unicode.org>

Cc: Andrew Glass <mailto:andrew.gl...@microsoft.com>

Subject: Re: What is the time frame for USE shapers to provide support for CV+C 
?



On Tue, 14 May 2019 03:08:04 +0100

Richard Wordingham via Unicode 
<mailto:unicode@unicode.org> wrote:



On Tue, 14 May 2019 00:58:07 +

Andrew Glass via Unicode <mailto:unicode@unicode.org> 
wrote:



Here is the essence of the initial changes needed to support CV+C.

Open to feedback.





  *   Create new SAKOT class

SAKOT (Sk) based on UISC = Invisible_Stacker

  *   Reduced HALANT class

Now only HALANT (H) based on UISC = Virama

  *   Updated Standard cluster mode



[< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB

[VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)*

(VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk

B)* (FAbv)* (FBlw)* (FPst)* [FM]



This next question does not, I believe, affect HarfBuzz.  Will NFC

code render as well as unnormalised code?  In the first example above,

 normalises to , which

does not match any portion of the regular expression.



Could someone answer this question, please?  The USE documentation ("CGJ 
handling will need to be updated if USE is modified to support

normalization") still implies that the USE does not respect canonical 
equivalence.



Richard.










Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-07 Thread Asmus Freytag (c) via Unicode

On 8/7/2019 5:08 PM, Andrew Glass wrote:


Shaping domain names is a new requirement. It would be good to 
understand the specific cases that are falling in the gap here.


Domain names are simply strings, but the protocol enforces normalization 
to NFC. In some situations, it might be possible for a browser, for 
example, to have access to the user-provided string, but I can see any 
number of situations where the actual string (as stored in the DNS) 
would need to be displayed.


For the scenario, it does not matter whether it's NFC or NFD, what 
matters is that some particular un-normalized state would be lost; and 
therefore it would be bad if the result is that the string can no longer 
be rendered correctly.


In particular, as the strings in question would be identifiers, where 
accurate recognition is prime.


A./

*From:*Unicode  *On Behalf Of *Asmus 
Freytag via Unicode

*Sent:* 07 August 2019 14:19
*To:* unicode@unicode.org
*Subject:* Re: What is the time frame for USE shapers to provide 
support for CV+C ?


What about text that must exist normalized for other purposes?

Domain names must be normalized to NFC, for example. Will such strings 
display correctly if passed to USE?


A./

On 8/7/2019 1:39 PM, Andrew Glass via Unicode wrote:

That's correct, the Microsoft implementation of USE spec does not normalize 
as part of the shaping process.

Why? Because the ccc system for non-Latin scripts is not a good mechanism 
for handling complex requirements for these writing systems and the effects of 
ccc-based normalization can disrupt authors intent. Unfortunately, because we 
cannot fix ccc values, shaping engines at Microsoft have ignored them. 
Therefore, recommendation for passing text to USE is to not normalize.

By the way, at the current time, I do not have a final consensus from Tai 
Tham experts and community on the changes required to support Tai Tham in USE. 
Therefore, I've not been able to make the changes proposed in this thread.

Cheers,

Andrew

-Original Message-

From: Richard Wordingham  <mailto:richard.wording...@ntlworld.com>  


Sent: 07 August 2019 13:29

To: Richard Wordingham via Unicode  
<mailto:unicode@unicode.org>

Cc: Andrew Glass  
<mailto:andrew.gl...@microsoft.com>

    Subject: Re: What is the time frame for USE shapers to provide support for 
CV+C ?

On Tue, 14 May 2019 03:08:04 +0100

Richard Wordingham via Unicode  
<mailto:unicode@unicode.org>  wrote:

On Tue, 14 May 2019 00:58:07 +

Andrew Glass via Unicode  
<mailto:unicode@unicode.org>  wrote:

Here is the essence of the initial changes needed to support CV+C.

Open to feedback.

   *   Create new SAKOT class

SAKOT (Sk) based on UISC = Invisible_Stacker

   *   Reduced HALANT class

Now only HALANT (H) based on UISC = Virama

   *   Updated Standard cluster mode

[< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB

[VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)*

(VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk

B)* (FAbv)* (FBlw)* (FPst)* [FM]

This next question does not, I believe, affect HarfBuzz.  Will NFC

code render as well as unnormalised code?  In the first example above,

 normalises to , which

does not match any portion of the regular expression.

Could someone answer this question, please?  The USE documentation ("CGJ 
handling will need to be updated if USE is modified to support

normalization") still implies that the USE does not respect canonical 
equivalence.

Richard.





RE: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-07 Thread Andrew Glass via Unicode
Shaping domain names is a new requirement. It would be good to understand the 
specific cases that are falling in the gap here.

From: Unicode  On Behalf Of Asmus Freytag via 
Unicode
Sent: 07 August 2019 14:19
To: unicode@unicode.org
Subject: Re: What is the time frame for USE shapers to provide support for CV+C 
?

What about text that must exist normalized for other purposes?

Domain names must be normalized to NFC, for example. Will such strings display 
correctly if passed to USE?

A./

On 8/7/2019 1:39 PM, Andrew Glass via Unicode wrote:

That's correct, the Microsoft implementation of USE spec does not normalize as 
part of the shaping process.

Why? Because the ccc system for non-Latin scripts is not a good mechanism for 
handling complex requirements for these writing systems and the effects of 
ccc-based normalization can disrupt authors intent. Unfortunately, because we 
cannot fix ccc values, shaping engines at Microsoft have ignored them. 
Therefore, recommendation for passing text to USE is to not normalize.



By the way, at the current time, I do not have a final consensus from Tai Tham 
experts and community on the changes required to support Tai Tham in USE. 
Therefore, I've not been able to make the changes proposed in this thread.



Cheers,



Andrew



-Original Message-

From: Richard Wordingham 
<mailto:richard.wording...@ntlworld.com>

Sent: 07 August 2019 13:29

To: Richard Wordingham via Unicode 
<mailto:unicode@unicode.org>

Cc: Andrew Glass <mailto:andrew.gl...@microsoft.com>

Subject: Re: What is the time frame for USE shapers to provide support for CV+C 
?



On Tue, 14 May 2019 03:08:04 +0100

Richard Wordingham via Unicode 
<mailto:unicode@unicode.org> wrote:



On Tue, 14 May 2019 00:58:07 +

Andrew Glass via Unicode <mailto:unicode@unicode.org> 
wrote:



Here is the essence of the initial changes needed to support CV+C.

Open to feedback.





  *   Create new SAKOT class

SAKOT (Sk) based on UISC = Invisible_Stacker

  *   Reduced HALANT class

Now only HALANT (H) based on UISC = Virama

  *   Updated Standard cluster mode



[< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB

[VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)*

(VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk

B)* (FAbv)* (FBlw)* (FPst)* [FM]



This next question does not, I believe, affect HarfBuzz.  Will NFC

code render as well as unnormalised code?  In the first example above,

 normalises to , which

does not match any portion of the regular expression.



Could someone answer this question, please?  The USE documentation ("CGJ 
handling will need to be updated if USE is modified to support

normalization") still implies that the USE does not respect canonical 
equivalence.



Richard.








Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-07 Thread Asmus Freytag via Unicode

  
  
What about text that must exist
  normalized for other purposes?


Domain names must be normalized to NFC,
  for example. Will such strings display correctly if passed to USE?


A./



On 8/7/2019 1:39 PM, Andrew Glass via
  Unicode wrote:


  That's correct, the Microsoft implementation of USE spec does not normalize as part of the shaping process.
Why? Because the ccc system for non-Latin scripts is not a good mechanism for handling complex requirements for these writing systems and the effects of ccc-based normalization can disrupt authors intent. Unfortunately, because we cannot fix ccc values, shaping engines at Microsoft have ignored them. Therefore, recommendation for passing text to USE is to not normalize.

By the way, at the current time, I do not have a final consensus from Tai Tham experts and community on the changes required to support Tai Tham in USE. Therefore, I've not been able to make the changes proposed in this thread.

Cheers,

Andrew

-Original Message-
From: Richard Wordingham  
Sent: 07 August 2019 13:29
To: Richard Wordingham via Unicode 
Cc: Andrew Glass 
Subject: Re: What is the time frame for USE shapers to provide support for CV+C ?

On Tue, 14 May 2019 03:08:04 +0100
Richard Wordingham via Unicode  wrote:


  
On Tue, 14 May 2019 00:58:07 +
Andrew Glass via Unicode  wrote:



  Here is the essence of the initial changes needed to support CV+C.
Open to feedback.


  *   Create new SAKOT class
SAKOT (Sk) based on UISC = Invisible_Stacker
  *   Reduced HALANT class
Now only HALANT (H) based on UISC = Virama
  *   Updated Standard cluster mode

[< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB

  
[VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)*
(VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk
B)* (FAbv)* (FBlw)* (FPst)* [FM]

  

  
  

  
This next question does not, I believe, affect HarfBuzz.  Will NFC 
code render as well as unnormalised code?  In the first example above, 
 normalises to , which 
does not match any portion of the regular _expression_.

  
  
Could someone answer this question, please?  The USE documentation ("CGJ handling will need to be updated if USE is modified to support
normalization") still implies that the USE does not respect canonical equivalence.

Richard.






  



RE: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-07 Thread Andrew Glass via Unicode
That's correct, the Microsoft implementation of USE spec does not normalize as 
part of the shaping process.
Why? Because the ccc system for non-Latin scripts is not a good mechanism for 
handling complex requirements for these writing systems and the effects of 
ccc-based normalization can disrupt authors intent. Unfortunately, because we 
cannot fix ccc values, shaping engines at Microsoft have ignored them. 
Therefore, recommendation for passing text to USE is to not normalize.

By the way, at the current time, I do not have a final consensus from Tai Tham 
experts and community on the changes required to support Tai Tham in USE. 
Therefore, I've not been able to make the changes proposed in this thread.

Cheers,

Andrew

-Original Message-
From: Richard Wordingham  
Sent: 07 August 2019 13:29
To: Richard Wordingham via Unicode 
Cc: Andrew Glass 
Subject: Re: What is the time frame for USE shapers to provide support for CV+C 
?

On Tue, 14 May 2019 03:08:04 +0100
Richard Wordingham via Unicode  wrote:

> On Tue, 14 May 2019 00:58:07 +
> Andrew Glass via Unicode  wrote:
> 
> > Here is the essence of the initial changes needed to support CV+C.
> > Open to feedback.
> > 
> > 
> >   *   Create new SAKOT class
> > SAKOT (Sk) based on UISC = Invisible_Stacker
> >   *   Reduced HALANT class
> > Now only HALANT (H) based on UISC = Virama
> >   *   Updated Standard cluster mode
> > 
> > [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB
> > > [VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)*
> > > (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk
> > > B)* (FAbv)* (FBlw)* (FPst)* [FM]

> This next question does not, I believe, affect HarfBuzz.  Will NFC 
> code render as well as unnormalised code?  In the first example above, 
>  normalises to , which 
> does not match any portion of the regular expression.

Could someone answer this question, please?  The USE documentation ("CGJ 
handling will need to be updated if USE is modified to support
normalization") still implies that the USE does not respect canonical 
equivalence.

Richard.



Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-07 Thread Richard Wordingham via Unicode
On Tue, 14 May 2019 03:08:04 +0100
Richard Wordingham via Unicode  wrote:

> On Tue, 14 May 2019 00:58:07 +
> Andrew Glass via Unicode  wrote:
> 
> > Here is the essence of the initial changes needed to support CV+C.
> > Open to feedback.
> > 
> > 
> >   *   Create new SAKOT class
> > SAKOT (Sk) based on UISC = Invisible_Stacker
> >   *   Reduced HALANT class
> > Now only HALANT (H) based on UISC = Virama
> >   *   Updated Standard cluster mode
> > 
> > [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B |
> > SUB  
> > > [VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)*
> > > (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk
> > > B)* (FAbv)* (FBlw)* (FPst)* [FM]  

> This next question does not, I believe, affect HarfBuzz.  Will NFC
> code render as well as unnormalised code?  In the first example above,
>  normalises to , which
> does not match any portion of the regular expression.

Could someone answer this question, please?  The USE documentation
("CGJ handling will need to be updated if USE is modified to support
normalization") still implies that the USE does not respect canonical
equivalence.

Richard.


Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-06-23 Thread 梁海 Liang Hai via Unicode

> (1) When can we anticipate that the USE spec will be updated to provide 
> support for subjoined consonants below vowels (as required for TAI THAM) ?

• The exact scope is actually about allowing conjoined consonant forms (either 
encoded with a stacker, or encoded atomically?) after vowel signs in an encoded 
cluster.

> ** A good use case is the Tai Tham word U+1A27 U+1A6A U+1A60 U+1A37 , 
> transcribed to Central Thai script as จูบ, (to kiss). Currently, people are 
> writing this as U+1A27 U+1A60 U+1A37 U+1A6A ("จบู") which violates the 
> "phonetic ordering" but is the current workaround because USE is still broken 
> for TAI THAM.

• I agree with Richard on that this is really not a good use case. This word 
(as long as it is written with the vowel sign Uu either under or after the 
conjoined consonant sign B) should really be encoded as , according to our best understanding today.

• The “phonetic ordering” principle of Unicode is a frequently misinterpreted 
one. Note that when there are multiple ways of interpreting the phonetic order 
of a written structure, we try to stick to the more graphically apparent order, 
in order to have a stable encoding order.

> An example of the contrast is shown in the attached files luynam.png, with 
> first orthographic syllable , and yukya.png, with 
> the first orthographic syllable .

• Right. I was always wondering to what extent this distinction happens as an 
orthographically conscious choice.

• Generally I feel, when at least one of the interacting signs (usually a 
consonant one and a vowel one) has inline advance, it should be safe to take a 
graphic order approach. The “6th preliminary recommendation” doesn’t have the 
luynam vs yukya case taken into consideration mostly only because we wasn’t 
sure about what good attestations are there.

> * Create new SAKOT class SAKOT (Sk) based on UISC = Invisible_Stacker
> * Reduced HALANT class Now only HALANT (H) based on UISC = Virama

• This feels like an undesirable Tham-specific relaxation. Note the artificial 
distinction between UISC Invisible_Stacker and Virama has nothing to do with 
whether graphically writing a consonant sign after a vowel sign is attested for 
a script. (কা)

• At least we need to look into USE-applicable (existing and future) scripts 
encoded with a Virama and see if any of them does need the relaxation.

> * Updated Standard cluster mode [< R | CS >] < B | GB > [VS] (CMAbv)* 
> (CMBlw)* (< < H | Sk > B | SUB > [VS] (CMAbv)* (CMBlw)) [MPre] [MAbv] [MBlw] 
> [MPst] (VPre)* (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* 
> (Sk B)* (FAbv)* (FBlw)* (FPst)* [FM]


• I’m still trying to think about the possibility of only relaxing the cluster 
when either/both of  has post-base advance…

• The artificial distinction made between < H | Sk > B, SUB, and CM really 
needs to be resolved together with the relaxation.

> * Updated Halant-terminated cluster [< R | CS >] < B | GB > [VS] (CMAbv)* 
> (CMBlw)* (< < H | Sk > B | SUB > [VS] (CMAbv)* (CMBlw)) < H | Sk >


• So, the intention of allowing Sk at the end is only about allowing the glyph 
of Sk to be positioned on the preceding character(s), right?

> * New Sakot-terminated cluster [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* 
> (< < H | Sk > B | SUB > [VS] (CMAbv)* (CMBlw)) [MPre] [MAbv] [MBlw] [MPst] 
> (VPre)* (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk B 
> [VS] (CMAbv)* (CMBlw)) Sk


• The “(Sk B [VS] (CMAbv)* (CMBlw)) Sk” part doesn’t seem to align with the 
updated Standard cluster’s “(Sk B)*”?

> I trust you'll be reclassifying U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA and 
> U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA into the category SUB so that we can 
> write about bananas forever (ᨠᩖ᩠ᩅ᩠᩶ᨿᨲᩕ᩠ᩃᩬᨯ):  TONE-2, SAKOT, LOW YA> /kluai/ 'banana'  OA BELOW, DA> /tʰalɔːt/ 'for ever' The issues here are that WA in a medial 
> rôle is indistinguishable from a coda ('sakot') consonant and that MEDIAL RA 
> can act as a consonant aspirator.

• The issues here are:

• Medial consonant sign characters of Tham are not encoded based on a 
clear phono-orthographical distinction.

• Tham allows syllable chaining that does not rely on a preceding 
inline coda letter.

• Consonant sign Medial Ra being a consonant aspirator here is not relevant to 
its appearance before a non-medial consonant sign here.

Best,
梁海 Liang Hai
https://lianghai.github.io



Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-05-14 Thread Richard Wordingham via Unicode
On Tue, 14 May 2019 03:08:04 +0100
Richard Wordingham via Unicode  wrote:

> Together,
> these call for (Sk B)* to be replaced by ().

Correction:
Together, these call for (Sk B)* to be replaced by ()*.

Richard.


Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-05-13 Thread Richard Wordingham via Unicode
On Tue, 14 May 2019 00:58:07 +
Andrew Glass via Unicode  wrote:

> Here is the essence of the initial changes needed to support CV+C.
> Open to feedback.
> 
> 
>   *   Create new SAKOT class
> SAKOT (Sk) based on UISC = Invisible_Stacker
>   *   Reduced HALANT class
> Now only HALANT (H) based on UISC = Virama
>   *   Updated Standard cluster mode
> 
> [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB
> > [VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)*
> > (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk B)*
> > (FAbv)* (FBlw)* (FPst)* [FM]

This comes a lot closer to supporting Tai Tham monosyllabic clusters.

Although this shouldn't affect Tai Tham, some of those medials need to
be made repeatable; I belief this has already been done in HarfBuzz.

I trust you'll be reclassifying U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA
and U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA into the category SUB so
that we can write about bananas forever (ᨠᩖ᩠ᩅ᩠᩶ᨿᨲᩕ᩠ᩃᩬᨯ):

 /kluai/ 'banana'

 /tʰalɔːt/ 'for ever'

The issues here are that WA in a medial rôle is indistinguishable from
a coda ('sakot') consonant and that MEDIAL RA can act as a consonant
aspirator.

Unfortunately, we didn't define a consonant HIGH RATTHA with a
canonical decomposition to .  The problem is that 'HIGH RATTHA', widely seen as an alternative
form of HIGH RATHA, can act as a subscript coda consonant.  There are
also a couple of words in the Northern Thai Dictionary of Palm-Leaf
Manuscripts where MEDIAL LA acts as a coda consonant.  Together,
these call for (Sk B)* to be replaced by ().

This next question does not, I believe, affect HarfBuzz.  Will NFC
code render as well as unnormalised code?  In the first example above,
 normalises to , which
does not match any portion of the regular expression.

Richard.



RE: What is the time frame for USE shapers to provide support for CV+C ?

2019-05-13 Thread Andrew Glass via Unicode
Here is the essence of the initial changes needed to support CV+C. Open to 
feedback.


  *   Create new SAKOT class
SAKOT (Sk) based on UISC = Invisible_Stacker
  *   Reduced HALANT class
Now only HALANT (H) based on UISC = Virama
  *   Updated Standard cluster mode

[< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB > [VS] 
(CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)* (VAbv)* (VBlw)* (VPst)* 
(VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk B)* (FAbv)* (FBlw)* (FPst)* [FM]


The only required component of a standard cluster is a BASE or BASE_OTHER. A 
cluster may optionally begin with a REPH or CONS_WITH_STACKER. A BASE or 
BASE_OTHER may be followed immediately by a VARIATION_SELECTOR and/or multiple 
CONS_MOD characters in the order CONS_MOD_ABOVE CONS_MOD_BELOW. Multiple 
sequences of a HALANT BASE or SAKOT BASE with optional VARIATION_SELECTOR or 
optional CONS_MOD can occur. The sequence can continue with zero or one 
CONS_MED for each cardinal position (Pre, Above, Below, Post); zero to many 
VOWEL characters in each cardinal position; zero to many VOWEL_MODs in each 
cardinal position; zero to many sequences of SAKOT BASE; zero to many 
CONS_FINALs in each of Above, Below, and Post; and lastly, an optional 
FINAL_MOD.



  *   Updated Halant-terminated cluster
[< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB > [VS] 
(CMAbv)* (CMBlw)*)* < H | Sk >



This is similar to the Standard cluster but terminates in a final HALANT or 
SAKOT after a BASE, BASE_OTHER, or CONS_MOD. When such a HALANT or SAKOT it 
will form a cluster. When any character other than a BASE or BASE_OTHER follows 
the HALANT or SAKOT there will be a cluster break between the HALANT or SAKOT 
and the following character. Multiple sequences of a HALANT BASE or SAKOT BASE 
with optional VARIATION_SELECTOR or optional CONS_MOD can occur. A CONS_SUBJ is 
equivalent to the sequence HALANT BASE.



  *   New Sakot-terminated cluster

[< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB > [VS] 
(CMAbv)* (CMBlw)*)*

[MPre] [MAbv] [MBlw] [MPst]

(VPre)* (VAbv)* (VBlw)* (VPst)*

(VMPre)* (VMAbv)* (VMBlw)* (VMPst)*

(Sk B [VS] (CMAbv)* (CMBlw)*)* Sk



This is similar to the Standard cluster but terminates in a final SAKOT after a 
VOWEL or VOWEL_MOD. When such a SAKOT follows a VOWEL or VOWEL_MOD it will form 
a cluster. When any character other than a BASE or BASE_OTHER follows this 
SAKOT there will be a cluster break between the SAKOT and the following 
character. Multiple sequences of a SAKOT BASE with optional VARIATION_SELECTOR 
or optional CONS_MOD can occur. A CONS_SUBJ is equivalent to the sequence 
HALANT BASE.

This would allow a consonant to follow a vowel when joined with a Sakot. It 
would support multiple final consonants. It would not support polysyllabic 
chaining of CV+CV+CV etc.

Cheers,

Andrew


From: Behdad Esfahbod 
Sent: 10 May 2019 11:32
To: Ed Trager 
Cc: Andrew Glass ; Unicode Mailing List 

Subject: Re: What is the time frame for USE shapers to provide support for CV+C 
?

I'm open to doing that if there's consensus on how it should be done.

On Thu, May 9, 2019 at 8:55 AM Ed Trager 
mailto:ed.tra...@gmail.com>> wrote:
Hi, Andrew and Behdad,

Prompted by a conversation I had with Liang Hai yesterday, I am just curious to 
get some idea about the following:

(1) When can we anticipate that the USE spec will be updated to provide support 
for subjoined consonants below vowels (as required for TAI THAM) ?

(2) Once the USE spec is updated, how much lag time can we expect until 
Microsoft actually releases an implementation with said support for CV+C ?

(3a) And the related question —for Behdad and the HarfBuzz development group— 
is when can we expect to see CV+C support (at least for TAI THAM) in HarfBuzz ?

(3b) Would the HarfBuzz team consider providing CV+C support for TAI THAM even 
before the USE spec gets updated, so that we could test things out ? * **

---
* PLEASE AND THANKYOU?

** A good use case is the Tai Tham word U+1A27 U+1A6A U+1A60 U+1A37 , 
transcribed to Central Thai script as จูบ, (to kiss). Currently, people are 
writing this as U+1A27 U+1A60 U+1A37 U+1A6A ("จบู") which violates the 
"phonetic ordering" but is the current workaround because USE is still broken 
for TAI THAM.

REFERENCE DOCUMENT:
http://www.unicode.org/L2/L2018/18332-tai-tham-ad-hoc-report.pdf<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2FL2%2FL2018%2F18332-tai-tham-ad-hoc-report.pdf&data=02%7C01%7CAndrew.Glass%40microsoft.com%7Cc068e18210314e1e3c3208d6d575d3ac%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636931099374714582&sdata=U6xDQJs6Srh8dfwogdoH4yr%2FrkAoxspXpSWNcYEo0f0%3D&reserved=0>




--
behdad
http://behdad.org/<https://nam06.safelink

Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-05-09 Thread Richard Wordingham via Unicode
On Thu, 9 May 2019 11:55:23 -0400
Ed Trager via Unicode  wrote:
 
> ** A good use case is the Tai Tham word U+1A27 U+1A6A U+1A60 U+1A37 ,
> transcribed to Central Thai script as จูบ, (*to kiss*). Currently,
> people are writing this as U+1A27 U+1A60 U+1A37 U+1A6A ("จบู") which
> violates the "phonetic ordering" but is the current workaround
> because USE is still broken for TAI THAM.
> 
> REFERENCE DOCUMENT:
> http://www.unicode.org/L2/L2018/18332-tai-tham-ad-hoc-report.pdf

How is this a good test case?  The 6th preliminary recommendation
reads, "To represent a cluster, regardless of the phonetic order CCV or
CVC, a consonant sign should always be encoded before the vowel sign,
unless the vowel sign has inline advance and is apparently followed by
the consonant sign".  If this recommendation is adopted, then the
spelling "U+1A27 U+1A6A U+1A60 U+1A37" will be  wrong.

Now, SIGN U and SIGN UU before subscript BA, HIGH PA and LOW YA aren't
always written as though they followed the subscript consonants in
phonetic order.  Sometimes the vowel sign is written in the bottom left
of the syllable.  Presumably we'll need 3 or 4 new signs:

TAI THAM UNAMBIGUOUS UB

TAI THAM UNAMBIGUOUS UUB

TAI THAM UNAMBIGUOUS UY

TAI THAM UNAMBIGUOUS UUY (?)

I'm not sure that the fourth one can occur.

An example of the contrast is shown in the attached files luynam.png,
with first orthographic syllable , and
yukya.png, with the first orthographic syllable . 

I wonder how we'd be supposed to encode ᩉᩖᩩ᩠᩶ᨿ (currently  'to crawl'?  The simplest
way would be to encode it as , which currently encodes
the unlikely ᩉᩖ᩠ᨿᩩ᩶. Will good fonts be expected to move the vowel left
and down from the subscript LOW YA to the MEDIAL LA?  Or will we need to
encode it with *TAI THAM UNAMBIGUOUS UY?

Richard.