Re: PUA (BMP) planned characters HTML tables
On Sun, 11 Aug 2019 00:07:05 -0400 Robert Wheelock via Unicode wrote: > I remember that a website that has tables for certain PUA precomposed > accented characters that aren’t yet in Unicode (thing like: > Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital > H-underbar, acute accented Cyrillic vowels, Cyrillic > ER/er-caron, ...). Where was it at?! I still want to get the > information. Thank You! You may mean https://www.eki.ee/letter. Once there, you'll want to make a query by Unicode range, e.g. e000-f8ff. It doesn't seem to refer to the relevant agreement. You could start hunting for agreements at https://scripts.sil.org/cms/scripts/page.php?item_id=VendorUseOfPUA Most of the characters you mention are scheduled to be assigned their own codepoint on the Greek kalends. They are precluded by policy because they would need to be composition exclusions to avoid making text in NFC cease to be in NFC. I first thought of the SIL PUA at https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi=PUA_home , but they knew better than to include most of them. Richard.
RE: PUA (BMP) planned characters HTML tables
Hello! I remember that a website that has tables for certain PUA precomposed accented characters that aren’t yet in Unicode (thing like: Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...). Where was it at?! I still want to get the information. Thank You! Robert Lloyd Wheelock
Re: Fonts and Canonical Equivalence
On Sat, 10 Aug 2019 16:37:48 +0100 Andrew West via Unicode wrote: > On Sat, 10 Aug 2019 at 15:46, Richard Wordingham via Unicode > wrote: > > Does vowel above before vowel below yield a dotted circle? > > Yes. Attached are screenshots for two real world examples, one which > is logically spelled as i + u, and one as u + i: > > 1. ཉིུ <0F49 0F72 0F74> [nyiu] as a contraction for ཉི་ཤུ [nyi shu] > "twenty" > > 2. བཅིུག <0F56 0F45 0F74 0F72 0F42> [bcuig] as a contraction for > བཅུ་གཅིག [bcu gcig] "eleven" Thanks for the clarification. I must have done something wrong when I tried to break Tibetan rendering by an above-below sequence - unless MS Edge denormalises Tibetan text so that it will render. However, we may be able to redress the balance between the renderers by inserting CGJ between the vowels to preserve the order when the strings are copied: nyiu ཉི͏ུ 0F49 0F72 034F 0F74 bcuig བཅུ͏ིག 0F56 0F45 0F74 034F 0F72 0F42 On my machine they display without dotted circles in Claws-Mail and LibreOffice, but I may be using too old a version of HarfBuzz. However, the ligaturing is missing in _nyiu_ with CGJ. LibreOffice at least is using Tibetan Machine Uni. However, in a snapshot of HarfBuzz I pulled in the past few days, both were rendered with dotted circles. This issue is apparently being worked on - (https://github.com/harfbuzz/harfbuzz/issues/483). The forms without CGJ render fine in the two applications Richard.
Re: Fonts and Canonical Equivalence
On Sat, 10 Aug 2019 at 15:46, Richard Wordingham via Unicode wrote: > > > Just retested on Windows 10 with > > a Tibetan font that supports both sequences of vowels, and both > > sequences display correctly under Harfbuzz (as expected), but only > > vowel-below followed by vowel-above displays correctly when using > > built-in Windows rendering. > > Does vowel above before vowel below yield a dotted circle? Yes. Attached are screenshots for two real world examples, one which is logically spelled as i + u, and one as u + i: 1. ཉིུ <0F49 0F72 0F74> [nyiu] as a contraction for ཉི་ཤུ [nyi shu] "twenty" 2. བཅིུག <0F56 0F45 0F74 0F72 0F42> [bcuig] as a contraction for བཅུ་གཅིག [bcu gcig] "eleven" Andrew
Re: Fonts and Canonical Equivalence
On Sat, 10 Aug 2019 11:22:01 +0100 Andrew West via Unicode wrote: > On Sat, 10 Aug 2019 at 08:29, Richard Wordingham via Unicode > wrote: > > > > There are similar issues with Tibetan; some fonts do not work > > properly if a vowel below (ccc=132) is separated from the base of > > the consonant stack by a vowel above (ccc=130). > > It's not that the fonts don't work, it's that some the rendering > engines do not apply the OpenType features in the font that support > both sequences of vowels (vowel-above followed by vowel-below, and > vowel-below followed by vowel-above). My observation was based on a Tibetan font that failed when pre-USE HarfBuzz added or changed the normalisation for Tibetan. > Just retested on Windows 10 with > a Tibetan font that supports both sequences of vowels, and both > sequences display correctly under Harfbuzz (as expected), but only > vowel-below followed by vowel-above displays correctly when using > built-in Windows rendering. Does vowel above before vowel below yield a dotted circle? According to the documentation - and the USE may have been improved in undocumented ways - the blwf feature will not apply across a Tibetan sequence of vowel above (VBlw) followed by vowel below (Vabv or CMBlw), but the blws feature will, even if a dotted circle has been added at the boundary. > It is very frustrating that Windows cannot correctly support the > display of Tibetan in normalized form, yet Harfbuzz does not have any > problems. Personally, I think USE is a failed experiment, and I wish > Microsoft would simply adopt Harfbuzz as the default rendering engine. >From what I've seen from discussions on HarfBuzz, the USE seems to work well for non-Indic scripts and Devanagari clones - possibly even for Bengali clones. It's also a definition that HarfBuzz can fall back on. The problems is that it doesn't address the quirks of scripts, and its anti-spoofing measures are draconian and overdone. There may well be an issue of funding for the USE - for all I know, it may in part be charity work. If Microsoft gave up on rendering engines, who would write the rendering specifications for HarfBuzz? I was wondering how the USE might be modified to handle canonical equivalence. The simplest way may be to permute the canonical combining classes, normalise (NFD) according to these classes, and process the rearranged string. That's roughly what HarfBuzz does. Another technique would be to derive regular expressions that would match any string canonically equivalent to a string matching the original regular expressions and use them instead. (It may be simpler to derive a regular expression that finds matches from amongst normalised strings - that's what my canonical equivalence respecting regular expression does.) Using a different canonical equivalent to the present one could 'break' fonts whose sets of properly handled strings were not closed under canonical equivalence - which is why I asked the original question. Richard.
Re: Fonts and Canonical Equivalence
On Sat, 10 Aug 2019 at 08:29, Richard Wordingham via Unicode wrote: > > There are similar issues with Tibetan; some fonts do not work properly > if a vowel below (ccc=132) is separated from the base of the > consonant stack by a vowel above (ccc=130). It's not that the fonts don't work, it's that some the rendering engines do not apply the OpenType features in the font that support both sequences of vowels (vowel-above followed by vowel-below, and vowel-below followed by vowel-above). Just retested on Windows 10 with a Tibetan font that supports both sequences of vowels, and both sequences display correctly under Harfbuzz (as expected), but only vowel-below followed by vowel-above displays correctly when using built-in Windows rendering. It is very frustrating that Windows cannot correctly support the display of Tibetan in normalized form, yet Harfbuzz does not have any problems. Personally, I think USE is a failed experiment, and I wish Microsoft would simply adopt Harfbuzz as the default rendering engine. Andrew
Fonts and Canonical Equivalence
I've spun this question off from the issue of what the USE is to do when confronted with the NFC canonical equivalent of a string it will accept when this equivalent does not match its regular expressions when they are applied to strings of characters rather than canonical equivalence classes of strings. What sort of guidance is there on the streams of characters to be supported by a font with respect to canonical equivalence? For example, one might think it would suffice for a font to support NFD strings only, but sometimes it seems that the only canonical equivalent that needs be supported is not the Unicode-defined canonical form, but a renderer-defined canonical form. For example, when a Tai Tham renderer supports subscripted final consonants, should the font support both the sequences and , or just the one chosen by the rendering engine? The HarfBuzz SEA engine would present the font with the former; font designers had seen rendering failures when Tai Tham text belatedly started being canonically normalised. There are similar issues with Tibetan; some fonts do not work properly if a vowel below (ccc=132) is separated from the base of the consonant stack by a vowel above (ccc=130). TUS sees a rendering engine plus a font file (or a set of them) as a single entity, so I don't think it's much guidance here. It seems tolerant of the loss of precision in placement when a Latin character is rendered as base plus diacritic rather than as a precomposed glyph. One can also pedantically argue that a font is a data file rather than a 'process'. (Additionally, a lot of us get confused by the mens rea aspect of Unicode compliance.) Richard.