Re: PUA (BMP) planned characters HTML tables

2019-08-10 Thread Richard Wordingham via Unicode
On Sun, 11 Aug 2019 00:07:05 -0400
Robert Wheelock via Unicode  wrote:

> I remember that a website that has tables for certain PUA precomposed
> accented characters that aren’t yet in Unicode (thing like:
> Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital
> H-underbar, acute accented Cyrillic vowels, Cyrillic
> ER/er-caron, ...).  Where was it at?!  I still want to get the
> information.  Thank You!

You may mean https://www.eki.ee/letter.  Once there, you'll want to make
a query by Unicode range, e.g. e000-f8ff.  It doesn't seem to refer to
the relevant agreement.  You could start hunting for agreements at
https://scripts.sil.org/cms/scripts/page.php?item_id=VendorUseOfPUA

Most of the characters you mention are scheduled to be assigned their
own codepoint on the Greek kalends.  They are precluded by policy
because they would need to be composition exclusions to avoid making
text in NFC cease to be in NFC.

I first thought of the SIL PUA at
https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi=PUA_home ,
but they knew better than to include most of them.

Richard.



RE: PUA (BMP) planned characters HTML tables

2019-08-10 Thread Robert Wheelock via Unicode
Hello!
I remember that a website that has tables for certain PUA precomposed
accented characters that aren’t yet in Unicode (thing like:  Marshallese
M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute
accented Cyrillic vowels, Cyrillic ER/er-caron, ...).  Where was it at?!  I
still want to get the information.  Thank You!

Robert Lloyd Wheelock


Re: Fonts and Canonical Equivalence

2019-08-10 Thread Richard Wordingham via Unicode
On Sat, 10 Aug 2019 16:37:48 +0100
Andrew West via Unicode  wrote:

> On Sat, 10 Aug 2019 at 15:46, Richard Wordingham via Unicode
>  wrote:

> > Does vowel above before vowel below yield a dotted circle?  
> 
> Yes. Attached are screenshots for two real world examples, one which
> is logically spelled as i + u, and one as u + i:
> 
> 1. ཉིུ <0F49 0F72 0F74> [nyiu] as a contraction for ཉི་ཤུ [nyi shu]
> "twenty"
> 
> 2. བཅིུག <0F56 0F45 0F74 0F72 0F42> [bcuig] as a contraction for
> བཅུ་གཅིག [bcu gcig] "eleven"

Thanks for the clarification.  I must have done something wrong when I
tried to break Tibetan rendering by an above-below sequence - unless MS
Edge denormalises Tibetan text so that it will render.

However, we may be able to redress the balance between the renderers by
inserting CGJ between the vowels to preserve the order when the strings
are copied:

nyiu ཉི͏ུ 0F49 0F72 034F 0F74

bcuig བཅུ͏ིག  0F56 0F45 0F74 034F 0F72 0F42

On my machine they display without dotted circles in Claws-Mail and
LibreOffice, but I may be using too old a version of HarfBuzz.  However,
the ligaturing is missing in _nyiu_ with CGJ. LibreOffice at least is
using Tibetan Machine Uni.  However, in a snapshot of HarfBuzz I pulled
in the past few days, both were rendered with dotted circles. This issue
is apparently being worked on - 
(https://github.com/harfbuzz/harfbuzz/issues/483).

The forms without CGJ render fine in the two applications

Richard.



Re: Fonts and Canonical Equivalence

2019-08-10 Thread Andrew West via Unicode
On Sat, 10 Aug 2019 at 15:46, Richard Wordingham via Unicode
 wrote:
>
> > Just retested on Windows 10 with
> > a Tibetan font that supports both sequences of vowels, and both
> > sequences display correctly under Harfbuzz (as expected), but only
> > vowel-below followed by vowel-above displays correctly when using
> > built-in Windows rendering.
>
> Does vowel above before vowel below yield a dotted circle?

Yes. Attached are screenshots for two real world examples, one which
is logically spelled as i + u, and one as u + i:

1. ཉིུ <0F49 0F72 0F74> [nyiu] as a contraction for ཉི་ཤུ [nyi shu] "twenty"

2. བཅིུག <0F56 0F45 0F74 0F72 0F42> [bcuig] as a contraction for
བཅུ་གཅིག [bcu gcig] "eleven"

Andrew


Re: Fonts and Canonical Equivalence

2019-08-10 Thread Richard Wordingham via Unicode
On Sat, 10 Aug 2019 11:22:01 +0100
Andrew West via Unicode  wrote:

> On Sat, 10 Aug 2019 at 08:29, Richard Wordingham via Unicode
>  wrote:
> >
> > There are similar issues with Tibetan; some fonts do not work
> > properly if a vowel below (ccc=132) is separated from the base of
> > the consonant stack by a vowel above (ccc=130).  
> 
> It's not that the fonts don't work, it's that some the rendering
> engines do not apply the OpenType features in the font that support
> both sequences of vowels (vowel-above followed by vowel-below, and
> vowel-below followed by vowel-above).

My observation was based on a Tibetan font that failed when pre-USE
HarfBuzz added or changed the normalisation for Tibetan.

> Just retested on Windows 10 with
> a Tibetan font that supports both sequences of vowels, and both
> sequences display correctly under Harfbuzz (as expected), but only
> vowel-below followed by vowel-above displays correctly when using
> built-in Windows rendering.

Does vowel above before vowel below yield a dotted circle?

According to the documentation - and the USE may have been improved in
undocumented ways - the blwf feature will not apply across a
Tibetan sequence of vowel above (VBlw) followed by vowel below (Vabv
or CMBlw), but the blws feature will, even if a dotted circle has been
added at the boundary.

> It is very frustrating that Windows cannot correctly support the
> display of Tibetan in normalized form, yet Harfbuzz does not have any
> problems. Personally, I think USE is a failed experiment, and I wish
> Microsoft would simply adopt Harfbuzz as the default rendering engine.

>From what I've seen from discussions on HarfBuzz, the USE seems to work
well for non-Indic scripts and Devanagari clones - possibly even
for Bengali clones.  It's also a definition that HarfBuzz can fall back
on.  The problems is that it doesn't address the quirks of scripts, and
its anti-spoofing measures are draconian and overdone.

There may well be an issue of funding for the USE - for all I know, it
may in part be charity work.

If Microsoft gave up on rendering engines, who would write the
rendering specifications for HarfBuzz?

I was wondering how the USE might be modified to handle canonical
equivalence.  The simplest way may be to permute the canonical
combining classes, normalise (NFD) according to these classes, and
process the rearranged string.  That's roughly what HarfBuzz does.

Another technique would be to derive regular expressions that would
match any string canonically equivalent to a string matching the
original regular expressions and use them instead.  (It may be
simpler to derive a regular expression that finds matches from amongst
normalised strings - that's what my canonical equivalence respecting
regular expression does.) Using a different canonical equivalent to the
present one could 'break' fonts whose sets of properly handled strings
were not closed under canonical equivalence - which is why I asked the
original question.

Richard.



Re: Fonts and Canonical Equivalence

2019-08-10 Thread Andrew West via Unicode
On Sat, 10 Aug 2019 at 08:29, Richard Wordingham via Unicode
 wrote:
>
> There are similar issues with Tibetan; some fonts do not work properly
> if a vowel below (ccc=132) is separated from the base of the
> consonant stack by a vowel above (ccc=130).

It's not that the fonts don't work, it's that some the rendering
engines do not apply the OpenType features in the font that support
both sequences of vowels (vowel-above followed by vowel-below, and
vowel-below followed by vowel-above). Just retested on Windows 10 with
a Tibetan font that supports both sequences of vowels, and both
sequences display correctly under Harfbuzz (as expected), but only
vowel-below followed by vowel-above displays correctly when using
built-in Windows rendering.

It is very frustrating that Windows cannot correctly support the
display of Tibetan in normalized form, yet Harfbuzz does not have any
problems. Personally, I think USE is a failed experiment, and I wish
Microsoft would simply adopt Harfbuzz as the default rendering engine.

Andrew


Fonts and Canonical Equivalence

2019-08-10 Thread Richard Wordingham via Unicode
I've spun this question off from the issue of what the USE is to do when
confronted with the NFC canonical equivalent of a string it will accept
when this equivalent does not match its regular expressions when they
are applied to strings of characters rather than canonical equivalence
classes of strings.

What sort of guidance is there on the streams of characters to be
supported by a font with respect to canonical equivalence?  For example,
one might think it would suffice for a font to support NFD strings
only, but sometimes it seems that the only canonical equivalent that
needs be supported is not the Unicode-defined canonical form, but a
renderer-defined canonical form.

For example, when a Tai Tham renderer supports subscripted final
consonants, should the font support both the sequences  and , or just the one chosen by the
rendering engine? The HarfBuzz SEA engine would present the font with
the former; font designers had seen rendering failures when Tai Tham
text belatedly started being canonically normalised.

There are similar issues with Tibetan; some fonts do not work properly
if a vowel below (ccc=132) is separated from the base of the
consonant stack by a vowel above (ccc=130).

TUS sees a rendering engine plus a font file (or a set of them) as a
single entity, so I don't think it's much guidance here.  It seems
tolerant of the loss of precision in placement when a Latin character
is rendered as base plus diacritic rather than as a precomposed glyph.
One can also pedantically argue that a font is a data file rather than
a 'process'.  (Additionally, a lot of us get confused by the mens rea
aspect of Unicode compliance.)

Richard.