Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Richard Wordingham via Unicode
On Sat, 21 Mar 2020 13:33:18 -0600
Doug Ewell via Unicode  wrote:

> Eli Zaretskii wrote:

> > Emacs uses some of that for supporting charsets that cannot be
> > mapped into Unicode.  GB18030 is one example of such charsets.  The
> > internal representation of characters in Emacs is UTF-8, so it uses
> > 5-byte UTF-8 like sequences to represent such characters.  

> When 137,468 private-use characters aren't enough?

But they aren't private use!  I haven't made any agreement with anyone
about using them.

Additionally, just as some people seem to think that stray UTF-16 code
units should be supported (and occasionally declaring UTF-8
implementations of Unicode standard algorithms to be automatically
non-compliant), there is a case for supporting stray UTF-8 code units.
Emacs supports the full range of 8-bit byte values - 128 unified with
ASCII and the other 128 with high bit set.

> What characters exist in GB18030 that don't
> exist in Unicode, and have they been proposed for Unicode yet, and
> why was none of the PUA space considered appropriate for that in the
> meantime?

Doesn't GB18030 appropriate some of the PUA for Tibetan (and quite
possibly other complex scripts)?  I haven't looked up how Emacs
handles this. 

Richard.


Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread Richard Wordingham via Unicode
On Fri, 20 Mar 2020 13:46:25 +0100
Adam Borowski via Unicode  wrote:

> On Fri, Mar 20, 2020 at 12:21:26PM +, Costello, Roger L. via
> Unicode wrote:
> > [Definition] Property: an attribute, quality, or characteristic of
> > something.
> > 
> > JPEG is a binary data format.
> > CSV is a text data format.
> > 
> > Question #1: Is the binaryness/textness of a data format a
> > property? 
> > 
> > Question #2: If the answer to Question #1 is yes, then what is the
> > name of this binaryness/textness property?  

I'd suggest 'texthood' as the correct English term.

> I'm afraid this question is too fuzzy to have a proper answer.
> 
> For example, most Unix-heads will tell you that UTF16LE is a binary
> rather than text format.  Microsoft employees and some members of
> this list will disagree.

Some files change type on changing operating system.  Digital's old RMS
formats included as basic text files in which each record (roughly a
line) started with a binary 2-byte length field.  Text records on
magnetic tape typically started with an ASCII length count!

> Then you have Postscript -- nothing but basic ASCII, yet utterly
> unreadable for a (sane) human.

No worse than a hex dump - in fact, a lot more readable.  Indeed, are
you not aware of the concept of a write-only programming language? 

> If you want _my_ definition of a file being _technically_ text, it's:
> * no bytes 0..31 other than newlines and tabs (even form feeds are out
>   nowadays)
> * correctly encoded for the expected charset (and nowadays, if that's
> not UTF-8 Unicode, you're doing it wrong)
> * no invalid characters

Unassigned characters are perfectly reasonable in a text file.  Surely
you aren't saying that a text file using the characters new to Unicode
13.0 should, at present, usually be regarded as a binary file?

> But besides this narrow technical meaning -- is a Word document
> "text"? And if it is, why not Powerpoint?  This all falls apart.

Well, a .docx file isn't text - it's a variety of ZIP file, which is
binary.  Indeed, as word files naturally include pictures, it very much
isn't a text file.  A .doc file is more like an image dump of a file
system.  A .rtf file on the other hand, probably is a text file -
though I've a feeling there are variants that aren't *A*SCII.

Richard.


Re: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Richard Wordingham via Unicode
On Fri, 21 Feb 2020 15:53:52 +
"Costello, Roger L. via Unicode"  wrote:

> Based on a private correspondence, I now realize that this statement:
> 
> 
> 
> > Text files do not contain binary  
> 
> 
> 
> is  not correct.
> 
> 
> 
> Text files may indeed contain binary (i.e., bytes that are not
> interpretable as characters). Namely, text files may contain
> newlines, tabs, and some other invisible things.
> 
> 
> 
> Question: "characters" are defined as only the visible things, right?

No, white space (e.g. spaces, tabs and newlines) is normally considered
to be composed of characters.  And then there are much harder to discern
things, such as zero-width spaces, line-break suppressors such as
U+2060 WORD JOINER, and soft hyphens (interpreted as line-break
opportunities).

Richard.


Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Richard Wordingham via Unicode
On Thu, 13 Feb 2020 20:15:07 +
Shawn Steele via Unicode  wrote:

> I confess that even though I know nothing about Hieroglyphs, that I
> find it fascinating that such a thoroughly dead script might still be
> living in some way, even if it's only a little bit.

Plenty of people have learnt how to write their name in hieroglyphs.
However, it is rare enough that my initials suffice to label my milk at
work.

What's more striking is the implication that people are still
exchanging messages in Middle Egyptian.

Richard.


Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Richard Wordingham via Unicode
On Thu, 13 Feb 2020 10:18:40 +0100
Hans Åberg via Unicode  wrote:

> > On 13 Feb 2020, at 00:26, Shawn Steele 
> > wrote: 
> >> From the point of view of Unicode, it is simpler: If the character
> >> is in use or have had use, it should be included somehow.  
> > 
> > That bar, to me, seems too low.  Many things are only used briefly
> > or in a private context that doesn't really require encoding.  
> 
> That is a private use area for more special use.

Writing the plural ('Egyptologists') by writing the plural strokes below
the glyph could be difficult if the renderer won't include them in the
same script run.

Richard.



Re: Combining Marks and Variation Selectors

2020-02-02 Thread Richard Wordingham via Unicode
On Sun, 2 Feb 2020 16:20:07 -0800
Eric Muller via Unicode  wrote:

> That would imply some coordination among variations sequences on
> different code points, right?
> 
> E.g. <0B48> ≡ <0B47, 0B56>, so a variation sequence on 0B56 (Mn,
> ccc=0) would imply the existence of a variation sequence on 0B48 with
> the same variation selector, and the same effect.

That particular case oughtn't to be impossible, as in NFD everything in
sight has ccc=0.  However TUS 12.0 Section 23.4 does contain an
additional prohibition against meaningfully applying a variation
selector to a 'canonical decomposable character'. (Scare quotes because
'ly' seems to be missing from the phrase.)

Richard.

> On 2/2/2020 11:43 AM, Mark Davis ☕️ via Unicode wrote:
> I don't think there is a technical reason for disallowing variation
> selectors after any starters (ccc=000); the normalization algorithm
> doesn't care about the general category of characters.
> 
> Mark



Re: Combining Marks and Variation Selectors

2020-02-02 Thread Richard Wordingham via Unicode
On Sun, 2 Feb 2020 07:51:56 -0800
Ken Whistler via Unicode  wrote:

> What it comes down to is avoidance of conundrums involving canonical 
> reordering for normalization. The effect of variation selectors is 
> defined in terms of an immediate adjacency. If you allowed variation 
> selectors to be defined for combining marks of ccc!=0, then 
> normalization of sequences could, in principle, move the two apart.
> That would make implementation of the intended rendering much more
> difficult.

I can understand that for non-starters.  However, a lot of non-spacing
combining marks are starters (i.e. ccc=0), so they would not be a
problem.   is an unbreakable block in
canonical equivalence-preserving changes.  Is this restriction therefore
just a holdover from when canonical equivalence could be corrected?

Richard.


Re: Combining Marks and Variation Selectors

2020-02-01 Thread Richard Wordingham via Unicode
On Sat, 1 Feb 2020 17:59:57 -0800
Roozbeh Pournader via Unicode  wrote:

> They are actually allowed on combining marks of ccc=0. We even define
> one such variation sequence for Myanmar, IIRC.
> 
> On Sat, Feb 1, 2020, 2:12 PM Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:  
> 
> > Why are variation selectors not allowed for combining marks?  I can
> > see a reason for them not being allowed on characters with non-zero
> > canonical combining classes, but not for them being prohibited for
> > combining marks that are starters, i.e. have ccc=0.

Ah, I missed that change from Version 5.0, where the restriction was,
'The base character in a variation sequence is never a combining
character or a decomposable character'.  I now need to rephrase the
question.  Why are marks other than spacing marks prohibited?

Richard. 



Combining Marks and Variation Selectors

2020-02-01 Thread Richard Wordingham via Unicode
Why are variation selectors not allowed for combining marks?  I can see
a reason for them not being allowed on characters with non-zero
canonical combining classes, but not for them being prohibited for
combining marks that are starters, i.e. have ccc=0.

Richard.


Adding Experimental Control Characters for Tai Tham

2020-01-25 Thread Richard Wordingham via Unicode
This topic is very similar to the recent topic "How to make custom
combining diacritical marks for arabic letters?".

There is a suggestion that the encoding of Tai Tham syllables be
changed
(https://www.unicode.org/L2/L2019/19365-tai-tham-structure.pdf, by
Martin Hosken), and there is a strong desire to experiment with it.
However, unless it is to proscribe good rendering, it needs at least
two extra 'control' characters, which have been suggested as:

1A8E TAI THAM SIGN INITIAL
1A8F TAI THAM SIGN FINAL

These would follow a subscript character.  In simple cases, they
would indicate whether the subscript is part of the onset or part of
the coda of a syllable.

The idea that has been floated is that the experimentation be done by
changing the renderer, which is invoked by various applications.

However, there is the problem of script runs - these characters are not
yet in the Tai Tham script, and most applications lack a mechanism
for assigning PUA characters to a script.

However, there is a set of inherited characters which in a Tai Tham
context have not yet been assigned any meaning - the variation
selectors.  I have experimented with them, and at least in the older
versions of the HarfBuzz renderer (near Version 1.2.7), they do not
cause any problems with the implementation of the USE - no dotted
characters arise, and they can interact in shaping as suggested by a
font.

How inappropriate would it be to usurp a pair of variation selectors
for this purpose?  For mnemonic purposes, I would suggest usurping

FE0E VARIATION SELECTOR-15 for *1A8E TAI THAM SIGN INITIAL
FE0F VARIATION SELECTOR-16 for *1A8F TAI THAM SIGN FINAL

I can think of the follow relevant factors:

(a) It is a maxim of English law that a person intends the reasonable
foreseeable consequences of his actions.  By allowing grapheme cluster
boundaries between script changes, the UTC can hardly complain
loudly about inherited characters being usurped.

(b) Most subscript consonants are defined by SAKOT plus a base
consonant, and therefore the suggested control characters have the
nature of variation sequences.  The effect of these characters is,
though, mostly on how other characters are positioned relative to them,
rather than directly on the subscript characters themselves.

(c) There are 7 subscript consonants that are represented by single
characters:

U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA
This seems not to need marking for position relative to the nucleus.
If it did, the marking up of logical order ᩉᩕ᩠ᩅ᩠᩶ᨿ  /huai/  'brook' as semi-visual
order 
would not be so simple, as SIGN FINAL should not apply to the leftmost
character, MEDIAL RA.

U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA
This will have to be excluded from the experiment.  It is very rare as
a final consonant, and I suspect its exclusion will have no effect on
the experiment.

U+1A57 TAI THAM CONSONANT SIGN LA TANG LAI
This appears to be restricted to a single word, so its exclusion should
not matter at all.

U+1A5B TAI THAM CONSONANT SIGN HIGH RATHA OR LOW PA
Bizarrely, L2-19/365 treats this as a consonant modifier!  As the USE
does not require consonant modifiers to be applied to the base
consonant, this ought to have no adverse effects.  The combination
 frequently acts as a single consonant trespassing
on the territory of HIGH RATHA, but my suggestion that the sequence be
encoded as a precomposed character was rejected.

As far as I can tell, U+1A5B is always part of the phonetic onset.   As
the only case where one might need these control characters would be an
implausible contraction *ᩁᩢ᩠ᨭᩛᩣ /rat tʰaː/ logical order  parallel to Lao contraction
ᨣᩢ᩠ᩅᩣ /kʰan waː/ 'if' logical order  undisambiguated semi-visual order , which for Lao is rendered differently to ᨣ᩠ᩅᩢᩣ /kʰwaːk/ loɡical
order .  Now, the
disambiguated semi-visual order encoding for *ᩁᩢ᩠ᨭᩛᩣ is .  This is consistent with the USE if SIGN FINAL
is a variation selector, but is a seemingly needless flaw in L2-19/365
Section 5.1.1.

U+1A5C TAI THAM CONSONANT SIGN MA
This character seems only to occur immediately following
akshara-initial MA, so I think there are no issues.

U+1A5D TAI THAM CONSONANT SIGN BA
This sign is of very limited occurrence in Northern Thai.  In Lao, it
can occur as the subscript of a base consonant acting as a mater
lectionis, but I cannot see any scope for needing to mark the role of
the mark for proper rendering. 

U+1A5E TAI THAM CONSONANT SIGN SA
As this is a non-spacing mark principally used as a coda consonant, it
seems unlikely that we would need to mark the role at the experimental
stage.

(d) This scheme does not address the representation of the sequences
 and .  The best ideas I
have is the totally hacky sequences  and .

Richard.



Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-04 Thread Richard Wordingham via Unicode
On Sat, 4 Jan 2020 22:15:59 +
James Kass via Unicode  wrote:

> For the Grantha examples above, Grantha (1) displays much better
> here. It seems daft to put a spacing character between a base
> character and any mark which is supposed to combine with the base
> character.

Although it's not related to this issue, that happens in the USE scheme.
It puts vowels before vowel modifiers, which has this problem if any of
the vowel modifiers precede a vowel in visual order, as happens in Thai
and closely related writing systems. 

Richard.


Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-04 Thread Richard Wordingham via Unicode
On Thu, 2 Jan 2020 20:20:34 +
Richard Wordingham via Unicode  wrote:

> There's a project whose basis I can't find to convert Indian Indic
> rendering at least to use the USE.  Now, according to the
> specification of the USE, visarga, anusvara and cantillation marks
> are all classified as vowel modifiers, and are so ordered relative to
> one another in the Indian Indic order: left, top, bottom, right.  So,
> the problem should already be solved for Grantha, and, if the plans
> come to fruition, will work with a font whose Devanagari script tag
> is 'dev3'.  However, I may have overlooked a set of overrides to the
> USE categorisations.

I've now knocked up a partial* representation* of a Devanagari dev3 and
a Grantha font (which I'm dubbing 'Mock Indic 3').  The supported
orders of COMBINING DIGIT ONE and VISARGA, as in Firefox on
Linux, are:

dev2: कः꣡ 

dev3: क꣡ः  
Grantha: (1) ጕ፧ጃ 
 (2) ጕጃ፧ 
The second Grantha spelling is enabled by a Harfbuzz-only change to
the USE categorisations.  It treats Grantha visarga and spacing
anusvara as though inpc=Top rather than inpc=Right.  As I am using
Ubuntu 16.04, this override isn't supported in applications that use the
system HarfBuzz library, such as my email client.

We are now establishing incompatible Devanagari font-specific
encodings fully compliant with TUS!

Richard.

* Partial = much is not handled
* Representation = glyphs are wrong, merely showing arrangement.  (I've
  actually re-used a Tai Tham font.)



Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-02 Thread Richard Wordingham via Unicode
On Thu, 2 Jan 2020 15:07:04 -0800
Norbert Lindenberg  wrote:

>> On Jan 2, 2020, at 12:20, Richard Wordingham via Unicode
>>  wrote:  

>> So, the problem should already be solved for Grantha, and,
>> if the plans come to fruition, will work with a font whose
>> Devanagari script tag is 'dev3'.  However, I may have overlooked a
>> set of overrides to the USE categorisations.  

> You can create Indic 3 fonts that get processed by the USE today, and
> use them with Harfbuzz (Chrome, Firefox, Android, …) and with
> CoreText (Apple platforms). I don’t know if anybody has already
> created such fonts.
> https://lindenbergsoftware.com/en/notes/brahmic-script-support-in-opentype/

Is there a script tag registry, or is it now a free-for-all as with font
names?  (I suppose it is implicitly constrained by what the individual
renderers recognise.)

The nearest to a registry I can find is at
https://docs.microsoft.com/en-us/typography/opentype/spec/ttoreg, but
that appears to be limited to what Microsoft supports - "The tag
registry defines the OpenType Layout tags that Microsoft supports".
None of the Indic 3 script tags are there.

Richard.




Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-02 Thread Richard Wordingham via Unicode
On Thu, 2 Jan 2020 07:52:55 +
James Kass via Unicode  wrote:

>  > I've been looking at Microsoft's specification of Devanagari
>  > character order.  In
>  >   
> https://docs.microsoft.com/en-us/typography/script-development/devanagari,
>  > the consonant syllable ends
>  >
>  > [N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)]
>  >
>  > where
>  > N is nukta
>  > A is anudatta (U+0952)
>  > H is halant/virama
>  > M is matra
>  > SM is syllable modifier signs
>  > VD is vedic
>  >
>  > "Syllable modifier signs" and "vedic" are not defined.  It appears
>  > that SM includes U+0903 DEVANAGARI SIGN VISARGA.  
> 
> What action should Microsoft take to satisfy the needs of the user 
> community?
> 1.  No action, maintain status quo.
> 2.  Swap SM and VD in the specs ordering.
> 3.  Make new category PS (post-syllable) and move VISARGA/ANUSVARA
> there.
> 4.  ?

There's a project whose basis I can't find to convert Indian Indic
rendering at least to use the USE.  Now, according to the specification
of the USE, visarga, anusvara and cantillation marks are all classified
as vowel modifiers, and are so ordered relative to one another in the
Indian Indic order: left, top, bottom, right.  So, the problem should
already be solved for Grantha, and, if the plans come to fruition, will
work with a font whose Devanagari script tag is 'dev3'.  However, I may
have overlooked a set of overrides to the USE categorisations.

> What kind of impact would there be on existing data if Microsoft
> revised the ordering?

A good question that *I* can't answer.

> Or should Unicode encode a new character like ZERO-WIDTH INVISIBLE 
> DOTTED CIRCLE so that users can suppress unwanted and unexpected
> dotted circles by adding superfluous characters to the text stream?

It would be useful to be able to suppress inappropriate dotted circles
without disrespecting the character identity of U+25CC.  (Doable
in HarfBuzz, but not in OpenType.)  There's actually been a suggestion
that dotted circles should be applied after global substitutions have
been applied, so as to prevent the overcoming of renderer faults.

On Sat, 21 Dec 2019 11:57:53 +0530
Shriramana Sharma via Unicode  wrote:

> This is all the more so since in some Vedic contexts (Sama Gana) the
> visarga is far separated from the syllable by other syllables like
> digits (themselves carrying combining marks) or spacing anusvara, as
> seen in examples from my Grantha proposal L2/09-372 p 40.

I presume you referring to the middle picture.  I'm having difficulty
reading it.  Could you please tell us its transcription and encoding.

A minimal change would be to extend the range of base characters to
include digits - I'm surprised matras don't frequently get added to
them.

Richard.



Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread Richard Wordingham via Unicode
On Wed, 1 Jan 2020 20:11:04 +
James Kass via Unicode  wrote:

> On 2020-01-01 11:17 AM, Richard Wordingham via Unicode wrote:
> 
>  > That's exactly the sort of mess that jack-booted renderers are
>  > trying to minimise.  Their principle is that there should be only
>  > one encoding per shape, though to be fair:
>  >
>  > 1) some renderers accept canonical equivalents.
>  > 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ),
>  > collating (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ).
>  > 3) Superseded chillu encodings are still supported.  
> 
> There was never any need for atomic chillu form characters.  

> The 
> principle of only one encoding per shape is best achieved when every 
> shape gets an atomic encoding.

I should have written per-word shape.  I should also have added that
most renderers attempt to handle Mongolian, despite its encoding
Middle Mongolian phonetics rather than characters. Also, they don't
attempt to sort the Arabic script per-language subsets out, which
leads to a bad mess at Wiktionary when Unicode characters differ only in
a few forms.

> Glyph-based encoding is incompatible 
> with Unicode character encoding principles.

Visual encoding sometimes works - phonetic order for Thai is so
complicated that it is unsurprising that its definition is partly
missing from Unicode 1.0.  The official history hides behind
incompatibility with the Thai national standard, but phonetic order was
simply too complicated for Thai.  Additionally, Thais don't agree on
where preposed vowels go relative to Pali consonant clusters - they
don't agree that all of them should appear in the middle of the
cluster.  (I suppose the positioning rule could have been made a
stylistic feature of fonts.)

An analogue is Lao collation.  While syllable boundaries can
overwhelmingly be discerned in modern Lao, Lao collations are too
complicated to be accepted for ICU if they are to support anything but
single syllables.  CLDR collation (interpreted as a specification with
the normal use of specification language for the form of definitions)
can just cope, whereas the UCA can't, but the tables are huge. 

Richard.



Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread Richard Wordingham via Unicode
On Wed, 1 Jan 2020 23:09:49 +
James Kass via Unicode  wrote:

> On 2020-01-01 8:11 PM, James Kass wrote:
> > It’s too bad that ISCII didn’t accomodate the needs of Vedic
> > Sanskrit, but here we are.  
> 
> Sorry, that might be wrong to say.  It's possible that it's Unicode's 
> adaptation of ISCII that hinders Vedic Sanskrit.

Have you found a definition of the ISCII handling of Vedic characters?

The problem lies in Unicode's failure to standardise the encoding of
Devanagari text.  But for the consistent failure to include a
standardisation of text in a script in TUS, one might wonder if the
original idea was to duck the issue by resorting to canonical
equivalence.

I've been looking at Microsoft's specification of Devanagari character
order.  In
https://docs.microsoft.com/en-us/typography/script-development/devanagari,
the consonant syllable ends

[N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)]

where
N is nukta
A is anudatta (U+0952)
H is halant/virama
M is matra
SM is syllable modifier signs
VD is vedic

"Syllable modifier signs" and "vedic" are not defined.  It appears that
SM includes U+0903 DEVANAGARI SIGN VISARGA.

I note that even ग॒ः  is
given a dotted circle by HarfBuzz.  Now, this might not be an entirely
fair test; I suspect anudatta is assigned this position because
originally the Sindhi implosives were encoded as consonant plus nukta
and anudatta, though rendering still fails with HarfBuzz when nukta is
inserted (ग़॒ः).

Richard.




Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-01 Thread Richard Wordingham via Unicode
On Wed, 1 Jan 2020 01:19:02 +
James Kass via Unicode  wrote:

> A workaround until some kind of satisfactory adjustment is made might
> be to simply use COLON for VISARGA.  Or...
> 
>   VISARGA ⇒ U+02F8 MODIFIER LETTER RAISED COLON
> ANUSVARA⇒U+02D9 DOT ABOVE
> 
> ...as long as the font(s) included both those characters.
> 
> य॑ यॆ॑
> 
> य॑ं -- anusvara last
> यॆ॑ं -- "
> 
> य॑: -- colon last
> यॆ॑: -- "
> 
> य॑˸ -- raised colon modifier last
> यॆ॑˸ -- "
> 
> य॑˙ -- spacing dot above last
> यॆ॑˙ -- "
> 

That's exactly the sort of mess that jack-booted renderers are trying
to minimise.  Their principle is that there should be only one encoding
per shape, though to be fair:

1) some renderers accept canonical equivalents.
2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ), collating
(CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ).
3) Superseded chillu encodings are still supported.

Richard.



Re: NBSP supposed to stretch, right?

2019-12-20 Thread Richard Wordingham via Unicode
On Fri, 20 Dec 2019 17:25:17 +0530
Shriramana Sharma via Unicode  wrote:

> So I never asked for NBSP to disappear. I said I want it to *stretch*.
> And to my mind "stretch" means to become wider than one's normal
> width. It doesn't include decreasing or disappearing width.

Don't spaces sometimes shrink?  I thought they did in some 'show codes'
modes.

> I don't expect NBSP to ever disappear, because spaces disappear only
> at linebreaks, and NBSP simply doesn't stand at linebreaks.

I can certainly imagine someone writing "".

Richard.



Re: NBSP supposed to stretch, right?

2019-12-17 Thread Richard Wordingham via Unicode
On Tue, 17 Dec 2019 06:20:39 +0530
Shriramana Sharma via Unicode  wrote:

> Hello. I've just tested LibreOffice, Google Docs and MS Office on
> Linux, Android and Windows, and it seems that NBSP doesn't get
> stretched like the normal space character when justified alignment
> requires it.
> 
> Let me explain. I'm creating a document with the following text
> typeset in 12 pt Lohit Tamil with justified alignment on an A5 page
> with 0.5" margin all around:
> 
> ஶ்ரீமத் மஹாபாரதம் என்பது நமது தேசத்தின் பெரும் இதிஹாஸமாகும். இதனை
> இயற்றியவர் ஶ்ரீ வேத வ்யாஸர். அவரால் அனுக்ரஹிக்கப்பட்டவையான நூல்கள் பல.
> 
> The screenshot
> https://sites.google.com/site/jamadagni/files/temp/nbsp-not-expanding.png
> may be useful to illustrate the situation. Readers may try such
> similar sentences in any software/platform of their choice and report
> as to what happens.
> 
> Here the problem arises with the phrase ஶ்ரீ வேத வ்யாஸர். The word
> ஶ்ரீ is a honorific applying to the following name of the sage வேத
> வ்யாஸர், so it would seem unsightly to the reader if it goes to the
> previous line, so I insert an NBSP between it and the name. (Isn't
> there such a stylistic convention in English where Mr doesn't stand at
> the end of a line? I don't know.)

It's not widely taught in so far as it exists.  I would avoid
placingthe word at the end in wide columns, just as I suppress line
breaks in 'Figure 7' and '17 December', but I only apply it to short
adjuncts. However, I would find the use of narrower spacing somewhere
between acceptable and desirable.  Thai has a similar rule, where there
is generally no space between title and forename, but an obligatory
space between forename and surname.  To me, this is a continuation of
the principle that line-breaks within phrases make them more difficult
to understand.

> However, the phrase is shortly followed by a long word
> அனுக்ரஹிக்கப்பட்டவையான, which is too long to fit on the same line and
> hence goes to the next line, thereby increasing the inter-word spacing
> on its previous line significantly. But the NBSP after the honorific
> doesn't stretch, making the word layout unsightly.

The strategies to deal with this general problem in English are
hyphenation and abandoning justification.  In this particular case,
your text would benefit from using Knuth's algorithm for justification.

> IIUC, no-break space is just that: a space that doesn't permit a line
> break. This says nothing about it being fixed width.
> 
> Unicode 12.0 §2.3 on p 27 (55 of PDF) says:

You're assuming that TUS is a standard.  It's much more a collection of
influential recommendations.

Richard.



Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-03 Thread Richard Wordingham via Unicode
On Tue, 3 Dec 2019 17:35:14 +0530
विश्वासो वासुकिजः (Vishvas Vasuki) via Unicode 
wrote:

> On Tue, Dec 3, 2019 at 3:48 PM Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:  

> > On Tue, 3 Dec 2019 02:05:35 +
> > Richard Wordingham via Unicode  wrote:  

> The text in IAST that I encounter seems not to have ansuvara before
> > stop consonants.  

> That's typical.
> Whatever the source script (if there is one), IAST tends to be used by
> people who follow the sanskrit devanAgarI conventions pretty strictly
> (so ends up being transcription rather than transliteration.)
 
> > I believe 'sa' would naturally expand (are there
> > non-void prescribed rules on this?) as sa-Deva-IN, so perhaps the
> > sa-Latn I usually see is unusual as sa-t-m0-iast and the description
> > should be expanded to at least sa-t-m0-sa-150-iast if sa-Latn is not
> > precise enough.

> Not sure what 150 is doing there..

I read, but in an old book, that when Sanskrit was printed in
Devanagari, clusters phonetically composed of nasal plus plosive were
written using the nasal consonant, but in India were printed using
anusvara.  The Sanskrit version of the UN Declaration of Human Rights
at Unicode (https://unicode.org/udhr/d/udhr_san.html) conforms to this
pattern by using anusvara instead of clusters, but I don't know where
the translation actually came from.

Accordingly, I thought that to get clusters instead of anusvara before
plosives, I should select Sanskrit as used in Europe, as opposed to
Sanskrit as used in India.  '150' is the region code for Europe.

Richard.




Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-03 Thread Richard Wordingham via Unicode
I think the 'Latn' in sa-Latn-t-sa-m0-iast is unnecessary, though it
partly depends on the range of the IAST transform.  If the
transformation can only convert to the Roman script then 'Latn' is
superfluous; I'm not sure if the extension is formally enough to rule
out Devanagari.  On the other hand, some people seem to think that
there is an IAST transformation to Cyrillic. 

However, as a locale for generated text, I feel it is inadequate.
Wouldn't the expansion rules generate saṃti from संति rather than santi
from सन्ति for 'they are'? Or have better fonts changed Indian practice?

Richard.



Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-03 Thread Richard Wordingham via Unicode
On Tue, 3 Dec 2019 02:05:35 +
Richard Wordingham via Unicode  wrote:


> I'm still trying to work out what to do for IAST.  Is it just:
> 
> sa-t-m0-iast
> 
> if one finds that
> 
> sa-Latn
> 
> allows too much latitude?

For material that is a transcription rather than a transliteration, are
there regional preferences for the homorganic nasals when writing in
the writing systems generated by IAST?

> How does one choose between anusvara and specific consonants
> for homorganic nasals? Is it sa-150-t-m0-iast v. sa-IN-t-m0-iast?

As these locales strictly speaking defined locales, I think I put the
region in the wrong place.  Perhaps they should be:

sa-t-m0-sa-150-Deva-iast v. sa-t-m0-sa-IN-Deva-iast

As a locale, is the latter the same as sa-t-m0-sa-IN-Mlym?  I'm not
sure how the preference for writing homorganic nasals varies by region
and by script.  What is the scope of IAST?  Does sa-t-m0-sa-Thai
exist?  sa-Thai seems to prefer the nasal stops to anusvara before
oral stops.

The text in IAST that I encounter seems not to have ansuvara before
stop consonants.  I believe 'sa' would naturally expand (are there
non-void prescribed rules on this?) as sa-Deva-IN, so perhaps the
sa-Latn I usually see is unusual as sa-t-m0-iast and the description
should be expanded to at least sa-t-m0-sa-150-iast if sa-Latn is not
precise enough.

Can someone advise?

Richard.


Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-02 Thread Richard Wordingham via Unicode
On Tue, 3 Dec 2019 01:27:39 +
Richard Wordingham  wrote:

> On Mon, 2 Dec 2019 09:09:02 -0800
> Markus Scherer via Unicode  wrote:
> 
> > On Mon, Dec 2, 2019 at 8:42 AM Roozbeh Pournader via Unicode <  
> > unicode@unicode.org> wrote:
> >   
> > > You don't need an ISO 15924 script code. You need to think in
> > > terms of BCP 47. Sanskrit in Latin would be sa-Latn.
> > >
> > 
> > Right!
> > 
> > Now, if you want to distinguish the different transcription systems
> > for  
> > > writing Sanskrit in Latin, you can apply to registry a BCP 47
> > > variant. There are also BCP 47 extension T, which may also be
> > > useful to you:
> > >
> > > https://tools.ietf.org/html/rfc6497
> > >
> > 
> > And that extension is administered by Unicode, with documentation
> > and data here:
> > http://www.unicode.org/reports/tr35/tr35.html#t_Extension  
> 
> But that says that the definitions are at
> https://github.com/unicode-org/cldr/releases/tag/latest/common/bcp47/transform.xml
>  ,
> but all one currently gets from that is an error message 'XML Parsing
> Error: no element found'.

A working URI is
https://github.com/unicode-org/cldr/blob/master/common/bcp47/transform.xml .

I'm still trying to work out what to do for IAST.  Is it just:

sa-t-m0-iast

if one finds that

sa-Latn

allows too much latitude?

How does one choose between anusvara and specific consonants
for homorganic nasals? Is it sa-150-t-m0-iast v. sa-IN-t-m0-iast?

Richard.


Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-02 Thread Richard Wordingham via Unicode
On Mon, 2 Dec 2019 09:09:02 -0800
Markus Scherer via Unicode  wrote:

> On Mon, Dec 2, 2019 at 8:42 AM Roozbeh Pournader via Unicode <
> unicode@unicode.org> wrote:  
> 
> > You don't need an ISO 15924 script code. You need to think in terms
> > of BCP 47. Sanskrit in Latin would be sa-Latn.
> >  
> 
> Right!
> 
> Now, if you want to distinguish the different transcription systems
> for
> > writing Sanskrit in Latin, you can apply to registry a BCP 47
> > variant. There are also BCP 47 extension T, which may also be
> > useful to you:
> >
> > https://tools.ietf.org/html/rfc6497
> >  
> 
> And that extension is administered by Unicode, with documentation and
> data here:
> http://www.unicode.org/reports/tr35/tr35.html#t_Extension

But that says that the definitions are at
https://github.com/unicode-org/cldr/releases/tag/latest/common/bcp47/transform.xml
 ,
but all one currently gets from that is an error message 'XML Parsing
Error: no element found'.


Re: A neat description of encoding characters

2019-12-02 Thread Richard Wordingham via Unicode
On Mon, 2 Dec 2019 12:01:52 +
"Costello, Roger L. via Unicode"  wrote:

> From the book titled "Computer Power and Human Reason" by Joseph
> Weizenbaum, p.74-75
> 
> Suppose that the alphabet with which we wish to concern ourselves
> consists of 256 distinct symbols...

Why should I wish to concern myself with only one alphabet?

Richard.


Re: Is the Unicode Standard "The foundation for all modern software and communications around the world"?

2019-11-20 Thread Richard Wordingham via Unicode
On Tue, 19 Nov 2019 20:02:55 +
James Kass via Unicode  wrote:

> On 2019-11-19 6:59 PM, Costello, Roger L. via Unicode wrote:
> > Today I received an email from the Unicode organization. The email
> > said this: (italics and yellow highlighting are mine)
> >
> > The Unicode Standard is the foundation for all modern software and
> > communications around the world, including all modern operating
> > systems, browsers, laptops, and smart phones-plus the Internet and
> > Web (URLs, HTML, XML, CSS, JSON, etc.).
> >
> > That is a remarkable statement! But is it entirely true? Isn't it
> > assuming that everything is text? What about binary information
> > such as JPEG, GIF, MPEG, WAV; those are pretty core items to the
> > Web, right? The Unicode Standard is silent about them, right? Isn't
> > the above quote a bit misleading? 
> A bit, perhaps.  But think of it as a press release.
> 
> The statement smacks of hyperbole at first blush, but "foundation"
> can mean basis or starting point.  File names (and URLs) of *.WAV,
> *.MPG, etc. are stored and exchanged via Unicode.  Likewise, the tags 
> (metadata) for audio/video files are stored (and displayed) via 
> Unicode.  So fields such as Title, Artist, Comments/Notes, Release
> Date, Label, Composer, and so forth aren't limited to ASCII data.

But file names, URLs and syntax tags are still mostly in ASCII.  It's
only when you come to text data that you get to Unicode; the usual
unreliable assumption is that the recipient has the means to display
that text.  Now, a feature of a *modern* system is that file names and
(sometimes) syntax tags can be in Unicode.  But have the nightmares
of file names and canonical equivalence come to an end?  And remember
that canonical equivalence isn't just a matter of precomposed letters.

Moving away from communications, I still find that if I use 'sort -u' to
eliminate repeated lines in unordered lines of text, I have to ensure
that I'm using binary identity for comparison - too many collations
still treat unknown characters as identical.  And this is with a
distribution that has UTF-8 as its basic encoding.

There's now a looming threat to passwords in truly complex scripts.
Keyboards are coming that will prevent certain sequences of characters
- Thais have long faced such constraints.  Some people may discover that
an upgrade of their keyboards renders them unable to type their
passwords!

Richard.



Re: Grapheme clusters & backspace (was: Unicode Digest, Vol 70, Issue 17)

2019-10-23 Thread Richard Wordingham via Unicode
On Wed, 23 Oct 2019 02:31:09 +
Ben Morphett via Unicode  wrote:

> It totally depends on the editor.  In Notepad++, when I backspace
> over "Man Teacher: Dark Skin Tone", I get "Man Teacher: Dark Skin
> Tone" => ""Man: Dark Skin Tone" => gone.

In MS Word 2016 on Windows 10, I get an intermediate stage of “Man:
Dark Skin ZWJ”, which is comparable to my suggestion that only the
consonant be deleted from a sequence of Indic stacker + consonant, even
though it be very similar to a unitary consonant sign.  The main
difference in the Indic pair is that there is a (misplaced) grapheme
cluster boundary in the former.

Mark Davis has proclaimed that all these emoji behaviours are WRONG.

What is wrong is that the ZWJ may go missing with copy and paste, as I
found between Word and plain Notepad.

Richard.



Re: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)

2019-10-23 Thread Richard Wordingham via Unicode
On Tue, 22 Oct 2019 23:15:57 +
Martin J. Dürst via Unicode  wrote:

> I think this to some extent is a question of "reality in the users' 
> minds". But to a very large extent, this is an issue of muscle
> memory. If a user works with a keyboard/input method that deletes a
> whole combination, their muscles will get used to that the same way
> they will get used to the other case.

The issue is one of being able to edit the cluster.  Large clusters
call out for editing rather than replacement.

Richard.



Re: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)

2019-10-22 Thread Richard Wordingham via Unicode
On Tue, 22 Oct 2019 23:27:27 +0200
Daniel Bünzli via Unicode  wrote:

> Thanks for you answer.
> 
> > The compromise that has generally been reached is that 'delete'
> > deletes a grapheme cluster and 'backspace' deletes a scalar value.
> > (There are good editors like Emacs that delete only a single
> > character.)  
> 
> Just to make things clear. When you say character in your message,
> you consistently mean scalar value right ?

Yes.

I find it hard to imagine that having to type them doesn't endow then
with some sort of reality in the users' minds, though some, such as
invisible stackers, are probably envisaged as control characters.

One does come across some odd entry methods, such as typing an Indic
akshara using the Latin script and then entering it as a whole.  That
is no more conducive to seeing the constituents as characters than is
typing wab- to get the hieroglyph ヂ.

Richard. 




Re: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)

2019-10-22 Thread Richard Wordingham via Unicode
On Tue, 22 Oct 2019 11:04:01 +0200
Daniel Bünzli via Unicode  wrote:

> On 22 October 2019 at 09:37:22, Richard Wordingham via Unicode
> (unicode@unicode.org) wrote:
> 
> > When it comes to the second sentence of the text of Slide 7
> > 'Grapheme Clusters', my overwhelming reaction is one of extreme
> > anger. Slide 8 does nothing to lessen the offence. The problem is
> > that it gives the impression that in general it is acceptable for
> > backspace to delete the whole grapheme cluster.  
> 
> Let's turn extreme anger into knowledge. 
> 
> I'm not very knowledgable in ligature heavy scripts (I suspect that's
> what you refer to) and what you describe is the first thing I went
> with for a readline editor data structure. 

Not necessarily ligature-heavy, but heavy in combining characters.
Examples at the light end include IPA and pointed Hebrew.  The Thai
script is another fairly well-known one but Siamese itself doesn't use
more than two marks on a consonant.  (The vowel marks before and after
don't count - they work like letters.)

> Would maybe care to expand when exactly you think it's not acceptable
> and what kind of tools or standard I can find the Unicode toolbox to
> implement an acceptable behaviour for backspace on general Unicode
> text. 

The compromise that has generally been reached is that 'delete' deletes
a grapheme cluster and 'backspace' deletes a scalar value.  (There are
good editors like Emacs that delete only a single character.)  The
rationale for this is that backspace undoes the effect of a
keystroke. For a perfect match, the keyboard would need to handle the
backspace - and everyone editing the text would have to use compatible
keyboards!  That's not a very plausible scenario for a Wikipedia
article.

Now, deleting the last character is not very Unicode compliant; there
is a family of keyboard designs in development that by default deletes
the last character in NFC form if it is precomposed and otherwise the
last character in NFD forms.  UTS#35 Issue 36 Part 7 Section 5.21
allows for more elaborate behaviours.  I would contend that deleting
the last character is the best simple approximation.  However, it's not
impossible for a dead key implementation to decide that dead acute plus
'e' should be emitted as two characters, even though its more usual for
it to be emitted as a single character.

Now, there are cases where one may be unlikely to type a single
character.  I can imagine a variation sequence or being implemented as
a 'ligature', i.e. a single stroke (or IME selection action) yielding
the entry of a base character plus variation selector.  Emoji may be
another, though I must say I would probably enter a regional indicator
pair as two characters, and expect to be able to delete just the last
if I made an error, contra Davis 2019.

While stacker + consonant might be expected to be a unit, the original
designs envisaged them being a sequence.  Additionally, I would expect
an edit to change the subscripted consonant rather than remove it.  In
this case, delete last character and delete grapheme cluster agree for
the language-independent rules.

Richard.



Re: Annoyances from Implementation of Canonical Equivalence

2019-10-18 Thread Richard Wordingham via Unicode
On Fri, 18 Oct 2019 09:45:14 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Thu, 17 Oct 2019 21:58:50 +0100
> > From: Richard Wordingham via Unicode 
> >   
> > > Sounds arbitrary to me.  How do we know that all the users will
> > > want that?  
> > 
> > If the change from codepoint by codepoint matching is just canonical
> > equivalence, then there is no way that the ‘n’ of ‘na’ will be
> > matched by the ‘n’ within ‘ñ’.  
> 
> "Just canonical equivalence" is also quite arbitrary, for the user's
> POV.  At least IME.

Here's a similar issue.  If I do an incremental search in Welsh text,
entering bac (on the way to entering bach) will find words like "bach"
and  "bachgen" even though their third letter is 'ch', not 'c'.

'Canonical equivalence' is 'DTRT', unless you're working with systems
too lazy or too primitive to DTRT.  It involves treating sequences of
character sequences declared to be identical in signification
identically.

The only pleasant justification for treating canonical sequences
inequivalently that I can think of is to treat the difference as a way
of recording how the text was typed.  Quite a few editing systems erase
that information, and I doubt people care how someone else typed the
text.

Richard.



Re: Collation Grapheme Clusters and Canonical Equivalence

2019-10-18 Thread Richard Wordingham via Unicode
On Thu, 17 Oct 2019 23:11:55 +0100
Richard Wordingham via Unicode  wrote:

> There seems to be a Unicode non-compliance (C6) issue in the
> definition of collation grapheme clusters (defined in UTS#10 Section
> 9.9).  Using the DUCET collation, the canonically equivalent strings
> รู้  U+0E49 THAI CHARACTER MAI THO> and รัู 
> decompose into collation grapheme clusters in two different ways.
> The first decomposes into  and  and the
> second decomposes into  and .  

Correction:

One has to take the collating elements in NFD order, so the tone mark
(secondary weight) and the vowel (primary weight) also form a cluster,
so the division into clusters is , .  This
split respects canonical equivalence.

Replacement:

Now, one form of typo one may see in Thai is where the
vowel is typed twice.  Thai fonts often lack mark-to-mark positioning
for sequences that should not occur, so the two copies of the vowel may
be overlaid.  Proof-reading will not spot the mistake if the font or
layout engine does not assist.

Thus we can get  (417,000 raw Google
hits, the first 10 all good).  That splits into *three* collation
grapheme clusters - ,  and .  Its
canonical equivalence  splits into two
grapheme clusters, for to form a sequence of collating elements
without skipping starting at the U+0E49, one must take all three
characters.  Overall, we end up with *two* collation grapheme clusters,
 and .

> Thus UTS#18 RL3.2 'Tailored Grapheme Clusters', namely "To meet this
> requirement, an implementation shall provide for collation grapheme
> clusters matches based on a locale's collation order", requires
> canonically equivalent sequences to be interpreted differently.

Richard.



Collation Grapheme Clusters and Canonical Equivalence

2019-10-17 Thread Richard Wordingham via Unicode
There seems to be a Unicode non-compliance (C6) issue in the definition
of collation grapheme clusters (defined in UTS#10 Section 9.9).  Using
the DUCET collation, the canonically equivalent strings รู้  and รัู  decompose into collation
grapheme clusters in two different ways.  The first decomposes into
 and  and the second decomposes into  and .

Thus UTS#18 RL3.2 'Tailored Grapheme Clusters', namely "To meet this
requirement, an implementation shall provide for collation grapheme
clusters matches based on a locale's collation order", requires
canonically equivalent sequences to be interpreted differently.

Is this a known issue?

Should I report it against UTS#10 or UTS#18?

Is the phrase 'collation order' intended to preclude the use of search
collations?  Search collations allow one to find a collation grapheme
cluster starting with U+0E15 THAI CHARACTER TO TAO in its exemplifying
word เต่า .  DUCET splits it into , , but most (all?) CLDR search collations split
it into , , , matching the division
into grapheme clusters.

If we accept that in the Latin script Vietnamese tone marks have
primary weights (this only shows up with strings more than one
syllable long), I can produce more egregious examples based on the
various sequences canonically equivalent to U+1EAD LATIN SMALL LETTER A
WITH CIRCUMFLEX AND DOT BELOW or to U+1EDB LATIN SMALL LETTER O WITH
HORN AND ACUTE.

The root of the problem is the desire to match only contiguous
substrings.  This does not play nicely with canonical equivalence.

Richard.



Re: Annoyances from Implementation of Canonical Equivalence

2019-10-17 Thread Richard Wordingham via Unicode
On Thu, 17 Oct 2019 10:42:19 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Thu, 17 Oct 2019 02:26:35 +0100
> > From: Richard Wordingham 
> > Cc: Eli Zaretskii 
> > 
> > (c) A search for 'n' finding 'ñ'.
> > 
> > When it comes to canonical equivalence, one answer to (c) is that as
> > soon as one adds the next letter letter, e.g. 'na', the search will
> > no longer match 'ñ'.  
> 
> Sounds arbitrary to me.  How do we know that all the users will want
> that?

If the change from codepoint by codepoint matching is just canonical
equivalence, then there is no way that the ‘n’ of ‘na’ will be matched
by the ‘n’ within ‘ñ’.

> > (This doesn't apply to diacritic-ignoring folding.)  
> But the issue _was_ diacritic-ignoring folding.

Then we don't seem to have any evidence of user discontent arising from
supporting canonical equivalence.

> > That argument doesn't work with the Polish letter 'ń' though, as it
> > can be word-final.  

> It actually doesn't work in general, and one factor is indeed
> different languages.  The problem with ñ was raised by
> Spanish-speaking users, and only they were very much against folding
> in this case.

I'm not talking about folding.  I'm talking about canonical
equivalence, which largely but not solely consists of treating
precomposed characters as the same as their *canonical* decompositions. 

> > In many cases, the answer might be a search by collation graphemes,
> > but that has other issues besides language sensitivity.  

> It is also unworkable, because search has to work in contexts where
> the text is not displayed at all, and graphemes only exist at display
> time.

The definition of a grapheme cluster is given in Section 9.9 of UTS#10,
which is currently at Version 12.1.0.  It is only connected to display
at a deep level, so display time is irrelevant.  Formally, it depends
on a collation, though the sorting aspect is irrelevant and is removed
for many 'search' collations in the CLDR.

So, if one were using a Spanish collation, on typing 'n' into the
incremental search string (and having it committed), the search wouldn't
consider a match with 'ñ'. Then, on further typing the combining tilde,
it would reject the matches it had found and choose those matches with
'ñ', whether one codepoint or two.  Would that behaviour cause serious
grief for incremental search?  As I use an XSAMPA-based input
implemented in quail that attempts to generate text in form NFC, I would
type 'n~' to get the Spanish character, and so would never get an
intermediate state where the incremental search was searching for 'n'.
(At least, not in Emacs 25.3.1.)

Richard.



Re: Annoyances from Implementation of Canonical Equivalence

2019-10-16 Thread Richard Wordingham via Unicode
On Wed, 16 Oct 2019 09:33:38 +0300
Eli Zaretskii via Unicode  wrote:

> > These are complaints about primary-level searches, not canonical
> > equivalence.  
> 
> Not sure what you call primary-level searches, but if you deduced the
> complaints were only about searches for base characters, then that's
> not so.  They are long discussions with many sub-threads, so it might
> be hard to find the specific details you are looking for.

The nearest I've found to complaints about including canonical
equivalences are:

(a) an observation that very occasionally one would need to switch
canonical equivalence off.  In such cases, one is not concerned with
the text as such, but rather with how Unicode non-compliant processes
will handle it.  Compliant processes are often built out of
non-compliant processes.

(b) just possibly

"What we have seen is that the behavior that comes from that Unicode
data does not please the users very much.  Users seem to have many
different ideas of what folding is useful, and disagree with each
other greatly." -
https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg01359.html

I can't tell what (b) was talking about; it may well have been about
folding or asymmetric search, as opposed to supporting canonical
equivalence.

(c) A search for 'n' finding 'ñ'.

When it comes to canonical equivalence, one answer to (c) is that as
soon as one adds the next letter letter, e.g. 'na', the search will no
longer match 'ñ'.  (This doesn't apply to diacritic-ignoring folding.)
That argument doesn't work with the Polish letter 'ń' though, as it can
be word-final.

In programming, one might be able to prevent the issue
by using 'n\b{g}', but that is a requirement of RL2.2, which doesn't
seem to be high on the list of implementer's priorities, especially as
it depends on properties outwith the UCD, defined in a non-ASCII file
to boot.  A better supported solution is probably 'n\P{Mn}'.

In many cases, the answer might be a search by collation graphemes, but
that has other issues besides language sensitivity.

Richard.



Re: Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters)

2019-10-15 Thread Richard Wordingham via Unicode
On Tue, 15 Oct 2019 09:43:23 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Tue, 15 Oct 2019 00:23:59 +0100
> > From: Richard Wordingham via Unicode 
> >   
> > > I'm well aware of the official position.  However, when we
> > > attempted to implement it unconditionally in Emacs, some people
> > > objected, and brought up good reasons.  You can, of course, elect
> > > to disregard this experience, and instead learn it from your
> > > own.  
> > 
> > Is there a good record of these complaints anywhere?  
> 
> You could look up these discussions:
> 
>   https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00189.html
>   https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html

These are complaints about primary-level searches, not canonical
equivalence.

> > (It would occasionally be useful to have an easily issued command
> > like 'delete preceding NFD codepoint'.)  
> 
> I agree.  Emacs commands that delete characters backward (usually
> invoked by the Backspace key) do that automatically, if the text
> before cursor was produced by composing several codepoints.

That's pretty standard, though it looks as though GTK has chosen to
reject the principle that backwards deletion deletes the last character
entered.

> Sure.  There's an Emacs command (C-u C-x =) which shows that
> information for the text at a given position.

Or commands what-cursor-position and describe-char if an emulator
gets in the way.  Having forward-char-intrusive would make it perfect.

Richard,


Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters)

2019-10-14 Thread Richard Wordingham via Unicode
On Mon, 14 Oct 2019 21:41:19 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Mon, 14 Oct 2019 19:29:39 +0100
> > From: Richard Wordingham via Unicode 

> > The official position is that text that is canonically
> > equivalent is the same.  There are problem areas where traditional
> > modes of expression require that canonically equivalent text be
> > treated differently.  For these, it is useful to have tools that
> > treat them differently.  However, the normal presumption should be
> > that canonically equivalent text is the same.  

> I'm well aware of the official position.  However, when we attempted
> to implement it unconditionally in Emacs, some people objected, and
> brought up good reasons.  You can, of course, elect to disregard this
> experience, and instead learn it from your own.

Is there a good record of these complaints anywhere?  It is annoying
when a text entry function does not keep the text as one enters it, but
it would be interesting to know what the other complaints were.  (It
would occasionally be useful to have an easily issued command like
'delete preceding NFD codepoint'.)  I did mention above that
occasionally one needs to know what codepoints were used and in what
order.

Richard.


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Richard Wordingham via Unicode
On Mon, 14 Oct 2019 10:05:49 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Mon, 14 Oct 2019 01:10:45 +0100
> > From: Richard Wordingham via Unicode 

> > They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*,
> > and were expecting normalisation (even to NFC) to be a possible
> > cure.  They had begun to realise that converting expressions to
> > match all or none of a set of canonical equivalents was hard; the
> > issue of non-contiguous matches wasn't mentioned.  

> I think these are two separate issues: whether search should normalize
> (a.k.a. performs character folding) should be a user option.  You are
> talking only about canonical equivalence, but there's also
> compatibility decomposition, so, for example, searching for "1" should
> perhaps match ¹ and ①.

HERETIC!

The official position is that text that is canonically
equivalent is the same.  There are problem areas where traditional
modes of expression require that canonically equivalent text be treated
differently.  For these, it is useful to have tools that treat them
differently.  However, the normal presumption should be that
canonically equivalent text is the same.

The party line seems to be that most searching should actually be done
using a 'collation', which brings with it different levels of
'folding'.  In multilingual use, a collation used for searching should
be quite different to one used for sorting.

Now, there is a case for being able to switch off normalisation and
canonical equivalence generally, e.g. when dealing with ISO 10646 text
instead of Unicode text.  This of course still leaves the question of
what character classes defined by Unicode properties then mean.

If one converts the regular expression so that what it matches is
closed under canonical equivalence, then visibly normalising the
searched text becomes irrelevant.  For working with Unicode traces, I
actually do both.  I convert the text to NFD but report matches in terms
of the original code point sequence; working this way simplifies the
conversion of the regular expression, which I do as part of its
compilation.  For traces, it seems only natural to treat precomposed
characters as syntactic sugar for the NFD decompositions.  (They
have no place in the formal theory of traces.)  However, I go further
and convert the decomposed text to NFD. (Recall that conversion to NFD
can change the stored order of combining marks.)

One of the simplifications I get is that straight runs of text in the
regular expression then match in the middle just by converting that run
and the searched strings.

For the concatenation of expressions A and B, once I am looking at the
possible interleaving of two traces, I am dealing with NFA states of
the form states(A) × {1..254} × states(B), so that for an element (a,
n, b), a corresponds to starts of words with a match in A, b
corresponds to starts of _words_ with a match in B, and n is the ccc
of the last character used to advance to b.  The element n blocks
non-starters that can't belong to a word matching A.  If I didn't
(internally) convert the searched text to NFD, the element n would have
to be a set of blocked canonical combining classes, changing the number
of possible values from 54 to 2^54 - 1.

While aficionados of regular languages may object that converting the
searched text to NFD is cheating, there is a theorem that if I have a
finite automaton that recognises a family of NFD strings, there is
another finite automaton that will recognise all their canonical
equivalents.

Richard.



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 21:28:34 -0700
Mark Davis ☕️ via Unicode  wrote:

> The problem is that most regex engines are not written to handle some
> "interesting" features of canonical equivalence, like discontinuity.
> Suppose that X is canonically equivalent to AB.
> 
>- A query /X/ can match the separated A and C in the target string
>"AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how
> should it behave? "pqb", "pbq", "bpq"?

If A contains a non-starter, pqbC.
If C contains a non-starter, Abpq.
Otherwise, if the results are canonically inequivalent, it should
raise an exception for attempting a process that is either ill-defined
or not Unicode-compliant. 

> If the input was in NFD (for
> example), should the output be rearranged/decomposed so that it is
> NFD? and so on.

That is not a new issue.  It exists already.

>- A query /A/ can match *part* of the X in the target string
> "aXb". So if I have code to do [replace /A/ in "aXb" by "pq"], what
> should result: "apqBb"?

Yes, unless raising an exception is appropriate (see above).

> The syntax and APIs for regex engines are not built to handle these
> features. It introduces a enough complications in the code, syntax,
> and semantics that no major implementation has seen fit to do it. We
> used to have a section in the spec about this, but were convinced
> that it was better off handled at a higher level.

What higher level?  If anything, I would say that the handler is at a
lower level (character fragments and the like).

The potential requirement should be restored, but not subsumed in
Levels 1 to 3.  It is a sufficiently different level of endeavour.

Richard.



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 20:25:25 -0700
Asmus Freytag via Unicode  wrote:

> On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote:
> On Sun, 13 Oct 2019 17:13:28 -0700

>> Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so
>> [:Lu:] should not match > COMBINING CIRCUMFLEX ACCENT>. 

> Why does it matter if it is precomposed? Why should it? (For anyone
> other than a character coding maven).

Because general_category is a property of characters, not strings.  It
matters to anyone who intends to conform to a standard.

>> Now, I could invent a string
>> property so that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).

No, I shouldn't!  \m{xLu} is infinite, which would not be allowed for
a Unicode set.  I'd have to resort to a wordy definition for it to be a
property.

Richard.


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 17:13:28 -0700
Asmus Freytag via Unicode  wrote:

> On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100 but not .  When I'm respecting
> canonical equivalence, I want both to match [:Lu:], and that's what I
> do. [:Lu:] can then match a sequence of up to 4 NFD characters.
> 
> Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*;
> instead of formally handling NFD, you could extend the syntax to
> handle "inherited" properties across combining sequences.
> 
> Am I missing anything?

Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:]
should not match .  Now, I could invent a string property so
that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).

I don't entirely understand what you said; you may have missed the
distinction between "[:Lu:] can then match" and "[:Lu:] will then
match".  I think only Greek letters expand to 4 characters in NFD.

When I'm respecting canonical equivalence/working with traces, I want
[:insc=vowel_dependent:][:insc=tone_mark:] to match both  and its canonical
equivalent .  The canonical closure of that
sequence can be messy even within scripts.  Some pairs commute: others
don't, usually for good reasons.

Regards,

Richard.


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Mon, 14 Oct 2019 00:22:36 +0200
Hans Åberg via Unicode  wrote:

> > On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode
> >  wrote:

>> Besides invalidating complexity metrics, the issue was what \p{Lu}
>> should match.  For example, with PCRE syntax, GNU grep Version 2.25
>> \p{Lu} matches U+0100 but not .  When I'm respecting
>> canonical equivalence, I want both to match [:Lu:], and that's what
>> I do. [:Lu:] can then match a sequence of up to 4 NFD characters.  
 
> Hopefully some experts here can tune in, explaining exactly what
> regular expressions they have in mind.

The best indication lies at
https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents
(2008), which is the last version before support for canonical
equivalence was dropped as a requirement.

It's not entirely coherent, as the authors don't seem to find an
expression like

\p{L}\p{gcb=extend}*

a natural thing to use, as the second factor is mostly sequences of
non-starters.  At that point, I would say they weren't expecting
\p{Lu} to not match  , as they were still expecting  [ä] to
match both "ä" and "a\u0308".

They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, and
were expecting normalisation (even to NFC) to be a possible cure.  They
had begun to realise that converting expressions to match all or none
of a set of canonical equivalents was hard; the issue of non-contiguous
matches wasn't mentioned.

When I say 'hard', I'm thinking of the problem that concatenation may
require dissolution of the two constituent expressions and involve the
temporary creation of 54-fold (if text is handled as NFD) or 2^54-fold
(no normalisation) sets of extra states.  That's what's driven me to
write my own regular expression engine for traces.

Regards,

Richard.



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 22:14:10 +0200
Hans Åberg via Unicode  wrote:

> > On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode
> >  wrote:

> > Incidentally, at least some of the sizes and timings I gave seem to
> > be wrong even for strings.  They won't work with numeric
> > quantifiers, as in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/.  

> > One gets lesser issues in quantifying complexity if one wants "Å" to
> > match \p{Lu} when working in NFD - potentially a different state for
> > each prefix of the capital letters.  (It's also the case except for
> > UTF-32 if characters are treated as sequences of code units.)
> > Perhaps 'upper case letter that Unicode happens to have encoded as
> > a single character' isn't a concept that regular expressions need
> > to support concisely. What's needed is to have a set somewhere
> > between [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be
> > extended to include "ff" - there are English surnames like
> > "ffrench”.  

The point about these examples is that the estimate of one state per
character becomes a severe underestimate.  For example, after
processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can
be in any of about 50 states.  The number of possible states is not
linear in the length of the expression.  While a 'loop iteration' can
keep the size of the compiled regex down, it doesn't prevent the
proliferation of states - just add zeroes to my example. 

> I made some C++ templates that translate Unicode code point character
> classes into UTF-8/32 regular expressions. So anything that can be
> reduced to actual regular expressions would work. 

Besides invalidating complexity metrics, the issue was what \p{Lu}
should match.  For example, with PCRE syntax, GNU grep Version 2.25
\p{Lu} matches U+0100 but not .  When I'm respecting
canonical equivalence, I want both to match [:Lu:], and that's what I
do. [:Lu:] can then match a sequence of up to 4 NFD characters.

Regards,

Richard.



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 15:29:04 +0200
Hans Åberg via Unicode  wrote:

> > On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode
> > I'm now beginning to wonder what you are claiming.  

> I start with a NFA with no empty transitions and apply the subset DFA
> construction dynamically for a given string along with some reverse
> NFA-data that is enough to transverse backwards when a final state
> arrives. The result is a NFA where all transverses is a match of the
> string at that position.

And then the speed comparison depends on how quickly one can extract
the match information required from that data structure.

Incidentally, at least some of the sizes and timings I gave seem to be
wrong even for strings.  They won't work with numeric quantifiers, as
in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/.

One gets lesser issues in quantifying complexity if one wants "Å" to
match \p{Lu} when working in NFD - potentially a different state for
each prefix of the capital letters.  (It's also the case except for
UTF-32 if characters are treated as sequences of code units.)  Perhaps
'upper case letter that Unicode happens to have encoded as a single
character' isn't a concept that regular expressions need to support
concisely. What's needed is to have a set somewhere between
[\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be extended to
include "ff" - there are English surnames like "ffrench".

Regards,

Richard.



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 10:04:34 +0200
Hans Åberg via Unicode  wrote:

> > On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode
> >  wrote:
> > 
> > On Sat, 12 Oct 2019 21:36:45 +0200
> > Hans Åberg via Unicode  wrote:
> >   
> >>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
> >>>  wrote:
> >>> 
> >>> But remember that 'having longer first' is meaningless for a
> >>> non-deterministic finite automaton that does a single pass through
> >>> the string to be searched.
> >> 
> >> It is possible to identify all submatches deterministically in
> >> linear time without backtracking — I a made an algorithm for
> >> that.  
> > 
> > That's impressive, as the number of possible submatches for
> > a*(a*)a* is quadratic in the string length.  
> 
> That is probably after the possibilities in the matching graph have
> been expanded, which can even be exponential. As an analogy, think of
> a polynomial product, I compute the product, not the expansion.

I'm now beginning to wonder what you are claiming. One thing one can
do without backtracking is to determine which capture groups capture
something, and which combinations of capturing or not occur.  That's
a straightforward extension of doing the overall 'recognition' in linear
time - at least, linear in length (n) of the searched string.  (I say
straightforward, but it would mess up my state naming algorithm.)  The
time can also depend on the complexity of the regular expression, which
can be bounded by the length (m) of the expression if working with mere
strings, giving time O(mn) if one doesn't undertake the worst case
O(2^m) task of converting the non-deterministic FSM to a deterministic
FSM.

Using m as a complexity measure for traces may be misleading, and I
think plain wrong; for moderate m, the complexity can easily go up as
fast as m^10, and I think higher powers are possible.  Strings
exercising the higher complexities are linguistically implausible.

Regards,

Richard.



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-12 Thread Richard Wordingham via Unicode
On Sat, 12 Oct 2019 21:36:45 +0200
Hans Åberg via Unicode  wrote:

> > On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
> >  wrote:
> > 
> > But remember that 'having longer first' is meaningless for a
> > non-deterministic finite automaton that does a single pass through
> > the string to be searched.  
> 
> It is possible to identify all submatches deterministically in linear
> time without backtracking — I a made an algorithm for that.

That's impressive, as the number of possible submatches for a*(a*)a* is
quadratic in the string length.

> A selection among different submatches then requires additional rules.

Regards,

Richard.



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-12 Thread Richard Wordingham via Unicode
On Fri, 11 Oct 2019 12:39:56 +0200
Elizabeth Mattijsen via Unicode  wrote:


> Furthermore, Perl 6 uses Normalization Form Grapheme for matching:
> https://docs.perl6.org/type/Cool#index-entry-Grapheme

This approach does address the issue Mark Davis mentioned about regex
engines working at the wrong level.  Perhaps you can put my mind at
rest about whether it works at all with scripts that subordinate
vowels.

If I wanted to find the occurrences of the Pali word _pacati_ 'to cook'
in Latin script text using form NFG, I could use a Perl regular
expression like /\b(:?a|pa)?p[aā]c(:?\B.)*/.  (At least,

grep -P '\b(:?a|pa)?p[aā]c\p{Ll}*' file.txt

works on text in NFC.  I couldn't work out the command-line expression
to display a list of matches from Perl, and the PCRE \B is broken beyond
ASCII in GNU grep 2.25.)

How would I do such a search in an Indic script using form NFG?

The main issue is that the single character 'c' would have to expand to
a list of all but one of the Pali grapheme clusters whose initial
consonant transliterates to 'c'.  Have you a notation for such a class?

Regards,

Richard.



Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?

2019-10-12 Thread Richard Wordingham via Unicode
On Sat, 12 Oct 2019 18:15:38 +0800
Fred Brennan via Unicode  wrote:

> Indeed - it is extremely unfortunate that users will need to wait
> until 2021(!) to get it into Unicode so Google will finally add it to
> the Noto fonts.

> If that's just how things are done, fine, I certainly can't change
> the whole system. But imagine if you had to wait two years to even
> have a chance of using a letter you desperately need to write your
> language?

Update me on what the problem with using the character *now* is.  If the
character is so important, why do you need to wait for Noto fonts?

I can imagine a much bigger problem - you could have the problem that
the Babayin script is 'supported'.  This could result in dotted circles
between RA and the combining marks.  It took ages between the addition
of U+0BB6 TAMI LETTER SHA to Unicode and obtaining a renderer that
acknowledged it as a Tamil letter.  You should be (or are you?)
badgering HarfBuzz to speculatively support it.  (There may be other
problems in the system.)

> Imagine if the letter "Q" was unencoded and Noto refused to
> add it for two more years?

On private PCs, having Noto support for a script can actually be an
unmitigated disaster.

Richard.


Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?

2019-10-12 Thread Richard Wordingham via Unicode
On Sat, 12 Oct 2019 18:15:38 +0800
Fred Brennan via Unicode  wrote:

> Indeed - it is extremely unfortunate that users will need to wait
> until 2021(!) to get it into Unicode so Google will finally add it to
> the Noto fonts.

> There seems to be no conscionable reason for such a long delay after
> the approval.

The UTC's accepting a character does not mean it will make it into
Unicode.  In the ISO process it may yet be rejected, renumbered or
renamed.  These things have certainly happen for new scripts.

Richard.


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-12 Thread Richard Wordingham via Unicode
On Fri, 11 Oct 2019 18:37:18 -0700
Mark Davis ☕️ via Unicode  wrote:

> >
> > You claimed the order of alternatives mattered.  That is an
> > important issue for anyone rash enough to think that the standard
> > is fit to be used as a specification.
> >  
> 
> Regex engines differ in how they handle the interpretation of the
> matching of alternatives, and it is not possible for us to wave a
> magic wand to change them.

But you are close to waving a truncheon to deprecate some of them.  And
even if you do not wave the truncheon, you will provide other people a
stick to beat them with.

> What we can do is specify how the interpretation of the properties of
> strings works. By specifying that they behave like alternation AND
> adding the extra constraint of having longer first, we minimize the
> differences across regex engines.

But remember that 'having longer first' is meaningless for a
non-deterministic finite automaton that does a single pass through the
string to be searched.

> > I'm still not entirely clear what a regular
> > expression /[\u00c1\u00e1]/ can mean.  If the system uses NFD to
> > simulate Unicode conformance, shall the expression then be
> > converted to /[{A\u0301}{a\u0301}]/?  Or should it simply fail to
> > match any NFD string?  I've been implementing the view that all or
> > none of the canonical equivalents of a string match.  (I therefore
> > support mildly discontiguous substrings, though I don't support
> > splitting undecomposable characters.) 
> 
> We came to the conclusion years ago that regex engines cannot
> reasonably be expected to implement canonical equivalence; they are
> really working at a lower level.

So does a lot of text processing.  The issue should simply be that the
change is too complicated for straightforward implementation:

(1) One winds up with slightly discontiguous substrings: the
non-starters at the beginning and end may not be contiguous.

(2) If one does not work with NFD, one ends up with parts of characters
in substrings.

(3) If one does not work with NFD (thereby formally avoiding the
issue of Unicode equivalence), replacing a non-starter by a character
of a different ccc is in general not a Unicode-compliant process.  (This
avoidance technique can be necessary for the Unicode Collation
Algorithm.)

(4) The algorithm for recognising concatenation and iteration (more
precisely, their closures under canonical equivalence) need to be
significantly rewritten.  One needs to be careful with optimisation -
some approaches could lead to reducing an FSM with over 2^54 states.

The issue of concatenation and iteration is largely solved in the
theory of traces and regular expressions, though there is still the
issue of when the iteration (Kleene star) of a regular expression (for
traces) is itself regular.  In the literature, this issue is called the
'star problem'.  One practical answer is that the Kleene star is itself
regular if it is generated from the set of strings matching the regular
expression that either contain NFD non-starters or all of whose
characters have the same ccc.  An unchecked requirement that
Kleene stars all be of this form would probably not be too great
a problem - one could probably dress this up by 'only fully supporting
Kleene star that is the same as the "concurrent star"'. Another one is
that recognition algorithms do not need to restrict themselves to
*regular* expressions - back references are not 'regular' either.

/\u0F73*/ is probably the simplest example of a non-regular Kleene star
in the Unicode strings under canonical equivalence.  (That character is
a problem character for ICU collation.)
However, /[[:Tibetan:]&[:insc=vowel_dependent:]]*/ is regular, as
removing U+0F73 from the Unicode set does not change its iteration.
Contrariwise, there might be a formal issue with giving  preference over  if one used the iteration algorithm for
regular-only Kleene star.

> So you see the advice we give at
> http://unicode.org/reports/tr18/#Canonical_Equivalents. (Again, no
> magic wand.)

So who's got tools for converting the USE's expression for a
'standard cluster' into a regular expression that catches all NFD
equivalents of the original expression?  There may be perfection issues
- the star problem may be unsolved for sequences of Unicode strings
under canonical equivalence.  Annoyingly, I can't find any text but my
own that relates traces to Unicode!

The trick of converting strings to NFD before searching them is
certainly useful.  Even with an engine respecting canonical
equivalence, it cuts the 2^54 I mentioned down to 54, the
number of non-zero canonical combining classes currently in use.  Of
course, such a reduction is not fully consistent with the spirit of a
finite state machine.

Richard.




Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Richard Wordingham via Unicode
On Fri, 11 Oct 2019 14:35:33 -0700
Markus Scherer via Unicode  wrote:

> > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters
> > > in the alternation -- so this works equivalently if longer
> > > strings are sorted first.  

> > Does conformance UTS#18 to level 2 mandate the choice of matching
> > substring? This would appear to prohibit compliance to POSIX rules,
> > where the length of overall match counts.

> The idea is currently to specify properties-of-strings (and I think a
> range/class with "clusters") behaving like an alternation where the
> longest strings are first, and leaving it up to the regex engine
> exactly what that means.
> 
> In general, UTS #18 offers a lot of things that regex implementers
> may or may not adopt.

> If you have specific ideas, please send them as PRI feedback.
> (Discussion on the list is good and useful, but does not guarantee
> that it gets looked at when it counts.)

You claimed the order of alternatives mattered.  That is an important
issue for anyone rash enough to think that the standard is fit to be
used as a specification.

I'm still not entirely clear what a regular expression /[\u00c1\u00e1]/
can mean.  If the system uses NFD to simulate Unicode conformance,
shall the expression then be converted to /[{A\u0301}{a\u0301}]/?  Or
should it simply fail to match any NFD string?  I've been implementing
the view that all or none of the canonical equivalents of a string
match.  (I therefore support mildly discontiguous substrings, though I
don't support splitting undecomposable characters.)

Richard.


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Richard Wordingham via Unicode
On Thu, 10 Oct 2019 15:23:00 -0700
Markus Scherer via Unicode  wrote:

> [c \q{ch}]h should work like (ch|c)h. Note that the order matters in
> the alternation -- so this works equivalently if longer strings are
> sorted first.

Thanks for answering the question.

Does conformance UTS#18 to level 2 mandate the choice of matching
substring? This would appear to prohibit compliance to POSIX rules,
where the length of overall match counts.

Richard.


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Richard Wordingham via Unicode
On Fri, 11 Oct 2019 12:39:56 +0200
Elizabeth Mattijsen via Unicode  wrote:

> Furthermore, Perl 6 uses Normalization Form Grapheme for matching:
> https://docs.perl6.org/type/Cool#index-entry-Grapheme

I seriously doubt that a Thai considers each combination of consonant
(44), non-spacing vowel (7) and tone mark (4) a different character.
Moreover, if what you say is correct, perl6 will be useless for
finding such combinations in correctly spelled text.  The regular
expression

\p{insc=consonant}\p{insc=vowel_dependent}\p{insc=tone_mark}

would find only misspellings because in correct Thai spelling, matching
sequences constitute grapheme clusters.  I trust perl6 will actually
continue to support analyses of strings as sequences of codepoints. 

Richard.


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-10 Thread Richard Wordingham via Unicode
On Tue, 8 Oct 2019 15:25:34 +0100
Richard Wordingham via Unicode  wrote:

> An example UTS#18 gives for matching a literal cluster can be
> simplified to, in its notation:
> 
> [c \q{ch}]
> 
> This is interpreted as 'match against "ch" if possible, otherwise
> against "c".  Thus the strings "ca" and "cha" would both match the
> expression
> 
> [c \q{ch}]a
> 
> while "chh" but not "ch" would match against
> 
> [c \q{ch}]h
> 
> Or have I got this wrong?

After comparing this with the Perl behaviour of /(:?ch|c)
and /(:?ch|c)h, I've come to the conclusion that I've got the
interpretation wrong.  The former may match "ch" or "c", and I
conclude that the only funny meaning of \q is to indicate a preference
for the sequence of two characters - if the engine yields all matches,
it has no meaning.

This greatly simplifies matters.

Richard.


Pure Regular Expression Engines and Literal Clusters

2019-10-08 Thread Richard Wordingham via Unicode
I've been puzzling over how a pure regular expression engine that works
via a non-deterministic finite automaton can be bent to accommodate
'literal clusters' as in Requirement RL2.2 'Extended Grapheme Clusters'
of UTS#18 'Unicode Regular Expressions' - "To meet this requirement, an
implementation shall provide a mechanism for matching against an
arbitrary extended grapheme cluster, a literal cluster, and matching
extended grapheme cluster boundaries."  It works from a regular
expression by stitching together the FSMs corresponding to its elements.

An example UTS#18 gives for matching a literal cluster can be simplified
to, in its notation:

[c \q{ch}]

This is interpreted as 'match against "ch" if possible, otherwise
against "c".  Thus the strings "ca" and "cha" would both match the
expression

[c \q{ch}]a

while "chh" but not "ch" would match against

[c \q{ch}]h

Or have I got this wrong?

Thus, while "[c \q{ch}]" may be a regex, it is clearly not any notation
for a regular expression in the mathematical sense.

It seems to me that this expression requires backtracking, which is
totally alien to the design of the regular expression engine.  One
problem then is that the engine supports both the union and
intersection of regular languages.  While algebraic manipulation might
raise union to the highest level, eliminating intersection is an
expensive operation which I have deliberately avoided.  While
backtracking is feasible if state progression has been restricted to
the FSM for a literal cluster, it is far more difficult if multiple
FSMs have been running in parallel.

As the engine fully respects canonical equivalence (with the result
that it can find an accented letter of the Vietnamese alphabet even if
it bears a subscript tone mark), concatenated subexpressions can
divide the input streams between them.  Consequently, the
backtracking mechanism gets complicated.

May I correctly argue instead that matching against literal clusters
would be satisfied by instead supporting, for this example, the regular
subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"?

Richard.



Re: Manipuri/Meitei customary writing system

2019-10-04 Thread Richard Wordingham via Unicode
On Fri, 4 Oct 2019 07:12:59 +
Martin J. Dürst via Unicode  wrote:

> On 2019/10/04 15:35, Martin J. Dürst via Unicode wrote:
> > Hello Markus,
> > 
> > On 2019/10/04 01:53, Markus Scherer via Unicode wrote:  
> >> Dear Unicoders,
> >>
> >> Is Manipuri/Meitei customarily written in Bangla/Bengali script or
> >> in Meitei script?
> >>
> >> I am looking at
> >> https://en.wikipedia.org/wiki/Meitei_language#Writing_systems
> >> which seems to describe writing practice in transition, and I
> >> can't quite tell where it stands.
> >>
> >> Is the use of the Meitei script aspirational or customary?
> >> Which script is being used for major newspapers, popular books,
> >> and video captions?  
> > 
> > This may give you some more information:
> > https://www.atypi.org/conferences/tokyo-2019/programme/activity?a=906  
> 
> Sorry, this should have been two separate URIs (about the same talk).
> 
> > https://www.youtube.com/watch?v=S8XxVZkfUkk
> > 
> > It's a recent talk at ATypI in Tokyo (sponsored by Google, among
> > others).

So newspaper sales tell us that the Bengali script is still the *usual*
script for the language.  Is that a different question to
what the 'customary' script is?

Richard.



Re: On the lack of a SQUARE TB glyph

2019-09-30 Thread Richard Wordingham via Unicode
On Mon, 30 Sep 2019 01:32:02 -0700
Asmus Freytag via Unicode  wrote:

> On 9/30/2019 1:01 AM, Andre Schappo via Unicode wrote:
> 
> On Sep 27, 1 Reiwa, at 08:17, Julian Bradfield via Unicode
>  wrote:
> 
> Or one could allow IDS to have leaf components that are any
> characters, not just ideographic characters, and then one could have
> all sorts of fun.
> 
> I do like this idea.
> 
> Note: This is a modified repost as I previously forgot to credit
> Julian as the originator

Egyptian hieroglyphs have lay-out operators that are meant to control
actual lay-out, whereas IDS operators are only meant as descriptions,
and a compliant implementation need not perform the lay-out described.

Richard.


Re: Proposing mostly invisible characters

2019-09-13 Thread Richard Wordingham via Unicode
On Fri, 13 Sep 2019 08:56:02 +0300
Henri Sivonen via Unicode  wrote:

> On Thu, Sep 12, 2019, 15:53 Christoph Päper via Unicode
>  wrote:
> 
> > ISHY/SIHY is especially useful for encoding (German) noun compounds
> > in wrapped titles, e.g. on product labeling, where hyphens are often
> > suppressed for stylistic reasons, e.g. orthographically correct
> > _Spargelsuppe_, _Spargel-Suppe_ (U+002D) or _Spargel‐Suppe_
> > (U+2010) may be rendered as _Spargel␤Suppe_ and could then be
> > encoded as _SpargelSuppe_.
> >  
> 
> Why should this stylistic decision be encoded in the text content as
> opposed to being a policy applies on the CSS (or conceptually
> equivalent) layer?

How would you define such a property?

Richard.




Re: Proposing mostly invisible characters

2019-09-12 Thread Richard Wordingham via Unicode
On Thu, 12 Sep 2019 14:53:45 +0200 (CEST)
Christoph Päper via Unicode  wrote:

> Dear Unicoders
> 
> There are some characters that have no precedent in existing
> encodings and are also hard to attest directly from printed sources.
> Can one still make a solid case for encoding those in Unicode? 
> 
> I am thinking of characters that are either invisible (most of the
> time) or can become invisible under certain circumstances.
> - INVISIBLE HYPHEN (IHY) or ZERO-WIDTH HYPHEN (ZWH)  
>   is *never* rendered as a hyphen,  
>   *but* the word it appears in is treated as if it contained one at
> its position. 

SOFT HYPHEN is supposed to be rendered in the manner appropriate to the
writing system, not necessarily like a HYPHEN.  In some writing
systems, such as, I gather, most very modern Lao writing systems, it
has no visual indication.  TUS claims that I was hallucinating when I
saw word wrapping hyphens in non-scriptio continua Pali in the Tai
Tham script in a Lao book.  (To put it less provocatively, one needs
user-level control of the rendering of soft hyphens.)

So, to make a proper case for INVISIBLE HYPHEN, you at least need
evidence of a contrast between soft-hyphen and an invisible hyphen.
Even then, you run the risk of being told that you should use a higher
level protocol which you will have to implement yourself.  Also, so
long as you don't need your text to be automatically split into words,
you can use ZWSP for the function.

Richard.



Re: LDML Keyboard Descriptions and Normalisation

2019-09-10 Thread Richard Wordingham via Unicode
On Sat, 7 Sep 2019 20:41:34 +0100
Richard Wordingham via Unicode  wrote:

> I don't think the model will run with Python Version 2.7. 

I was wrong.  It does run under Version 2.7.

Richard.


Re: LDML Keyboard Descriptions and Normalisation

2019-09-07 Thread Richard Wordingham via Unicode
On Sat, 7 Sep 2019 20:02:09 +0100
Cibu via Unicode  wrote:

> Slightly off topic: Is there a CLDR tool to try out transformations
> specified in a keyboard spec?

No CLDR tool, or so far as I am aware, CLDR-endorsed tool.  Martin
Hoksen has put together a reference model in Python at
https://github.com/keymanapp/ldml-keyboards-dev , and it seems highly
likely that the model is consistent with the 'specification'.  I get
the strong feeling that the new sections of LDML Part 7 are an
inadequate description of this model.

I don't think the model will run with Python Version 2.7. 

Richard.


Re: LDML Keyboard Descriptions and Normalisation

2019-09-07 Thread Richard Wordingham via Unicode
On Tue, 3 Sep 2019 18:03:18 +
Andrew Glass via Unicode  wrote:

> Hi Richard,
> 
> This is a good point. A keyboard that is doing transforms should
> specify which type of normalization it has been designed to do. I've
> filed a ticket to track this.

The ticket is https://unicode-org.atlassian.net/browse/CLDR-13273 .

My question was whether the recording capability is already there in
LDML. It doesn't even need transforms - there are enough keys to support
Latin-1 in NFC without any transforms.

Richard.


LDML Keyboard Descriptions and Normalisation

2019-09-02 Thread Richard Wordingham via Unicode
I'm getting conflicting indications about how the LDML keyboard
description handles issues of canonical equivalence.  I have one
simple question which some people may be able to answer.

Is the keyboard specification intended to distinguish between
keyboards that generally output:

(a) NFC text;
(b) NFD text; or
(c) Deliberately unnormalised texts?

For example, when documenting my own keyboards, I would want to
distinguish between a keyboard that went to great trouble to output
text in precomposed characters as opposed to one that took the easy
route of outputting text in fully decomposed characters.  For a
Tibetan keyboard, it would matter whether contractions were compatible
with the USE (so generally *not* NFC or NFD) or in NFC or NFD.

Richard.


Re: The native name of Tai Viet script and language(s)

2019-08-27 Thread Richard Wordingham via Unicode
On Tue, 27 Aug 2019 04:56:35 +
Peter Constable via Unicode  wrote:

> The script _is_ related to Thai script, but I’m not sure I would say
> it has “the same origin as that of Thai language/script used in
> Thailand”, as that is too simplistic a view of the historic
> connections: it suggests that Thai script and Tai Viet developed
> directly from the same precursor, which isn’t really accurate.

Can you elaborate on that?  There seems to be a chasm when we reach
back beyond the Sukhothai script, which embodies a failed reform.
(There seems to be evidence that the writing system is not a 19th
century fake - motive and opportunity had seemed available.)
Incidentally, is there a consensus view on whether the Sukhothai script
is mostly encoded, and if so, in which Unicode script(s)?

What is true is that both Thai and Tai Viet use consonants to record
the difference between two sets of three tones (though later mergers
and splits can result in 3 + 3 = 5 or 3 + 3 = 7 = 6 = 5); this seems to
be a register difference as in Cham and still in a few Khmer dialects,
going back to an ancient voicing difference.

@Eli: Ideally, you need to check that default font and language are
consistent.  There are some regional differences which make it
necessary to calibrate the writing system, and one word may not suffice.

Richard.



Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar

2019-08-21 Thread Richard Wordingham via Unicode
On Tue, 20 Aug 2019 22:43:43 +
Andrew Glass via Unicode  wrote:

> The order of medials in Myanmar clusters is constrained by UTN
> #11. So yes, you do need to
> follow the preferred order for Myanmar even if the sequence does not
> match phonetic order.

If the spelling matters, there may be a partial solution.  It depends
on being able to have .  This
combination is intended for use in Pali and Sanskrit when the shape of
Burmese  is unacceptable - see 
http://www.unicode.org/L2/L2006/06077-n3043r-myanmar.pdf for the
justification for adding U+103D as an *alternative* to <1039, 101D>.

Now, UTN#11, which is explicitly not endorsed by the UTC, does not
allow the sequence  and the Padauk font does not
support it.  I can't find anything in the Myanmar rendering
description
https://docs.microsoft.com/en-gb/typography/script-development/myanmar
that indicates that that renderer might reject the combination.
Consequently, you may be able to find an ordinary font that will render
the sequence  to give you an
acceptable rendering of 'dvya' that will be analysed as having the right
spelling.

I can't see any way of helping you to get a renderable unambiguous
Myanmar script spelling for 'trya' - unless you're prepared to supply
alternative renderers.  (I presume Zawgyi fonts are not an acceptable
alternative.)

Richard.


Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar

2019-08-21 Thread Richard Wordingham via Unicode
On Wed, 21 Aug 2019 02:47:28 +
James Kass via Unicode  wrote:

> > Are we are allowed to write Llangollen as the definition of the
> > Unicode Collation Algorithm implies we should, with an invisible CGJ
> > between the 'n' and the 'g', so that it will collate correctly in
> > Welsh?  That CGJ is necessary so that it will collate*after*
> > Llanberis. (The problem is that the letter 'ng' comes before the
> > letter 'n'.)
> If 'ng' comes before 'n', shouldn't Llangollen 
> collate *before* Llanberis in a Welsh listing?

I'm not quite sure of the question.  There are two possible answers:

(a) I used the English spelling because, so far as I am aware, the US
keyboard lacks CGJ.  (I'm using a US keyboard layout so as to get
the keycaps engraved with Thai.)

(b) No, because 'Llangollen' doesn't contain the letter
'ng'. It's spelt 'll', 'a', 'n', 'g', 'o', 'll', 'e', 'n' (8 letters)
not 'll', 'a', 'ng', 'o', 'll', 'e', 'n' (7 letters).  There are a very
few look-alikes, where one is spelt with 'ng' and the other with 'n',
'g'.

Richard.



Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar

2019-08-21 Thread Richard Wordingham via Unicode
On Wed, 21 Aug 2019 02:40:09 +
James Kass via Unicode  wrote:

> On 2019-08-21 2:08 AM, Richard Wordingham via Unicode wrote:
> > Are we are allowed to write Llangollen as the definition of the
> > Unicode Collation Algorithm implies we should, with an invisible CGJ
> > between the 'n' and the 'g', so that it will collate correctly in
> > Welsh?  That CGJ is necessary so that it will collate*after*
> > Llanberis. (The problem is that the letter 'ng' comes before the
> > letter 'n'.)  
> So that it won't collate correctly in anything other than Welsh?

CGJ has zero weight in most, if not all standard UCA or
CLDR-like collations.

Richard.



Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar

2019-08-20 Thread Richard Wordingham via Unicode
On Tue, 20 Aug 2019 22:43:43 +
Andrew Glass via Unicode  wrote:

> The order of medials in Myanmar clusters is constrained by UTN
> #11. So yes, you do need to
> follow the preferred order for Myanmar even if the sequence does not
> match phonetic order.

Are we are allowed to write Llangollen as the definition of the
Unicode Collation Algorithm implies we should, with an invisible CGJ
between the 'n' and the 'g', so that it will collate correctly in
Welsh?  That CGJ is necessary so that it will collate *after*
Llanberis. (The problem is that the letter 'ng' comes before the letter
'n'.)

Welsh is a European language, so I believe it has the right to
strive to have its words collated correctly.  But perhaps I'm wrong.

Richard.


Re: PUA (BMP) planned characters HTML tables

2019-08-15 Thread Richard Wordingham via Unicode
On Wed, 14 Aug 2019 23:32:37 +
James Kass via Unicode  wrote:

> U+0149 has a compatibility decomposition.  It has been deprecated and
> is not rendered identically on my system.
> 'n ʼn
> ( ’n )

Compatibility decompositions are quite a mix, but are generally
expected to render differently.  If they were expected to render the
same, they would normally be canonical decompositions.

U+0149 and its decomposition naturally render very differently with a
monospaced font.  The same goes for the Roman numerals that the Far
East gave us.

Richard.



Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread Richard Wordingham via Unicode
On Wed, 14 Aug 2019 09:05:02 +
James Kass via Unicode  wrote:

> The solution is to deprecate "LATIN LOWER CASE I WITH HEART".  It's
> only in there because of legacy.  It's presence guarantees
> round-tripping with legacy data but it isn't needed for modern data
> or display.  Urge Groups One and Two to encode their data with the
> desired combiner and educate font engine developers about the
> deprecation.  As the rendering engines get updated, the system
> substitution of the wrongly named precomposed glyph will go away.

I think you'd also have to change the reference glyph of LATIN LOWER
CASE I WITH HEART to show a heart.  That's valid because the UCD trumps
the code charts, and and no Unicode-compliant process may deliberately
render  differently from LATIN LOWER CASE I WITH
HEART. 

Richard.



Re: PUA (BMP) planned characters HTML tables

2019-08-12 Thread Richard Wordingham via Unicode
On Mon, 12 Aug 2019 01:21:42 +
James Kass via Unicode  wrote:

> There was a time when populating the PUA with precomposed glyphs was 
> necessary for printing or display, but that time has passed.

There is still the issue that in pure X one can't put sequences of
characters on a key; if the application doesn't invoke an input method
one is stuck.  Useful 20-year old proprietary code may be totally unable
to use modern font capabilities.  Don't forget the Cobol Y10k joke.

On Ubuntu at least, there was a period when Emacs couldn't access
X-based input methods from an English locale. The work-around: Use a
Japanese locale plus the vanilla lack of internationalisation in the
interface, or Emacs's very convenient alternative keyboard capability
for text input as opposed to commands.  The bug turned out to be in the
definition of the locales, i.e. in privileged data beyond the purview
of Emacs.

As to the need for the PUA, writing fonts to cope with Tai Tham
rendering engines is not easy, and it's no surprise that the PUA is used
on line for a newspaper that uses the Tai Tham script.  The USE is too
user-hostile for it to have helped if it had been available earlier.
(It just ignored the regular expression published in 2007.
(It's in L2/07-007R in the UTC document register, ISO/IEC
JTC1/SC2/WG2/N3207R on ISO land.) Indeed, perhaps I should be
researching the PUA encoding for Tai Tham. (My Tai Tham font Da Lekh
started as proof of principle, for there is already an unpleasant
amount of glyph sequence changing, some style-dependent. I couldn't see
how to get rendering engine support even when it might be added.  I was
pleasantly surprised at how far from impossible Tai Tham layout was
until the USE came along and made everything harder.  I now have to work
out which glyph instances have already been Indicly rearranged when I
repair the clustering.)

Oh, and i seem to need some PUA codepoints for vowels that get stranded
when line-breaks occur between the columns of an akshara.  The
proposals show this phenomenon in old(?) Pali text.  Or is there any
chance of getting them encoded?

Richard.


Re: PUA (BMP) planned characters HTML tables

2019-08-10 Thread Richard Wordingham via Unicode
On Sun, 11 Aug 2019 00:07:05 -0400
Robert Wheelock via Unicode  wrote:

> I remember that a website that has tables for certain PUA precomposed
> accented characters that aren’t yet in Unicode (thing like:
> Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital
> H-underbar, acute accented Cyrillic vowels, Cyrillic
> ER/er-caron, ...).  Where was it at?!  I still want to get the
> information.  Thank You!

You may mean https://www.eki.ee/letter.  Once there, you'll want to make
a query by Unicode range, e.g. e000-f8ff.  It doesn't seem to refer to
the relevant agreement.  You could start hunting for agreements at
https://scripts.sil.org/cms/scripts/page.php?item_id=VendorUseOfPUA

Most of the characters you mention are scheduled to be assigned their
own codepoint on the Greek kalends.  They are precluded by policy
because they would need to be composition exclusions to avoid making
text in NFC cease to be in NFC.

I first thought of the SIL PUA at
https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi=PUA_home ,
but they knew better than to include most of them.

Richard.



Re: Fonts and Canonical Equivalence

2019-08-10 Thread Richard Wordingham via Unicode
On Sat, 10 Aug 2019 16:37:48 +0100
Andrew West via Unicode  wrote:

> On Sat, 10 Aug 2019 at 15:46, Richard Wordingham via Unicode
>  wrote:

> > Does vowel above before vowel below yield a dotted circle?  
> 
> Yes. Attached are screenshots for two real world examples, one which
> is logically spelled as i + u, and one as u + i:
> 
> 1. ཉིུ <0F49 0F72 0F74> [nyiu] as a contraction for ཉི་ཤུ [nyi shu]
> "twenty"
> 
> 2. བཅིུག <0F56 0F45 0F74 0F72 0F42> [bcuig] as a contraction for
> བཅུ་གཅིག [bcu gcig] "eleven"

Thanks for the clarification.  I must have done something wrong when I
tried to break Tibetan rendering by an above-below sequence - unless MS
Edge denormalises Tibetan text so that it will render.

However, we may be able to redress the balance between the renderers by
inserting CGJ between the vowels to preserve the order when the strings
are copied:

nyiu ཉི͏ུ 0F49 0F72 034F 0F74

bcuig བཅུ͏ིག  0F56 0F45 0F74 034F 0F72 0F42

On my machine they display without dotted circles in Claws-Mail and
LibreOffice, but I may be using too old a version of HarfBuzz.  However,
the ligaturing is missing in _nyiu_ with CGJ. LibreOffice at least is
using Tibetan Machine Uni.  However, in a snapshot of HarfBuzz I pulled
in the past few days, both were rendered with dotted circles. This issue
is apparently being worked on - 
(https://github.com/harfbuzz/harfbuzz/issues/483).

The forms without CGJ render fine in the two applications

Richard.



Re: Fonts and Canonical Equivalence

2019-08-10 Thread Richard Wordingham via Unicode
On Sat, 10 Aug 2019 11:22:01 +0100
Andrew West via Unicode  wrote:

> On Sat, 10 Aug 2019 at 08:29, Richard Wordingham via Unicode
>  wrote:
> >
> > There are similar issues with Tibetan; some fonts do not work
> > properly if a vowel below (ccc=132) is separated from the base of
> > the consonant stack by a vowel above (ccc=130).  
> 
> It's not that the fonts don't work, it's that some the rendering
> engines do not apply the OpenType features in the font that support
> both sequences of vowels (vowel-above followed by vowel-below, and
> vowel-below followed by vowel-above).

My observation was based on a Tibetan font that failed when pre-USE
HarfBuzz added or changed the normalisation for Tibetan.

> Just retested on Windows 10 with
> a Tibetan font that supports both sequences of vowels, and both
> sequences display correctly under Harfbuzz (as expected), but only
> vowel-below followed by vowel-above displays correctly when using
> built-in Windows rendering.

Does vowel above before vowel below yield a dotted circle?

According to the documentation - and the USE may have been improved in
undocumented ways - the blwf feature will not apply across a
Tibetan sequence of vowel above (VBlw) followed by vowel below (Vabv
or CMBlw), but the blws feature will, even if a dotted circle has been
added at the boundary.

> It is very frustrating that Windows cannot correctly support the
> display of Tibetan in normalized form, yet Harfbuzz does not have any
> problems. Personally, I think USE is a failed experiment, and I wish
> Microsoft would simply adopt Harfbuzz as the default rendering engine.

>From what I've seen from discussions on HarfBuzz, the USE seems to work
well for non-Indic scripts and Devanagari clones - possibly even
for Bengali clones.  It's also a definition that HarfBuzz can fall back
on.  The problems is that it doesn't address the quirks of scripts, and
its anti-spoofing measures are draconian and overdone.

There may well be an issue of funding for the USE - for all I know, it
may in part be charity work.

If Microsoft gave up on rendering engines, who would write the
rendering specifications for HarfBuzz?

I was wondering how the USE might be modified to handle canonical
equivalence.  The simplest way may be to permute the canonical
combining classes, normalise (NFD) according to these classes, and
process the rearranged string.  That's roughly what HarfBuzz does.

Another technique would be to derive regular expressions that would
match any string canonically equivalent to a string matching the
original regular expressions and use them instead.  (It may be
simpler to derive a regular expression that finds matches from amongst
normalised strings - that's what my canonical equivalence respecting
regular expression does.) Using a different canonical equivalent to the
present one could 'break' fonts whose sets of properly handled strings
were not closed under canonical equivalence - which is why I asked the
original question.

Richard.



Fonts and Canonical Equivalence

2019-08-10 Thread Richard Wordingham via Unicode
I've spun this question off from the issue of what the USE is to do when
confronted with the NFC canonical equivalent of a string it will accept
when this equivalent does not match its regular expressions when they
are applied to strings of characters rather than canonical equivalence
classes of strings.

What sort of guidance is there on the streams of characters to be
supported by a font with respect to canonical equivalence?  For example,
one might think it would suffice for a font to support NFD strings
only, but sometimes it seems that the only canonical equivalent that
needs be supported is not the Unicode-defined canonical form, but a
renderer-defined canonical form.

For example, when a Tai Tham renderer supports subscripted final
consonants, should the font support both the sequences  and , or just the one chosen by the
rendering engine? The HarfBuzz SEA engine would present the font with
the former; font designers had seen rendering failures when Tai Tham
text belatedly started being canonically normalised.

There are similar issues with Tibetan; some fonts do not work properly
if a vowel below (ccc=132) is separated from the base of the
consonant stack by a vowel above (ccc=130).

TUS sees a rendering engine plus a font file (or a set of them) as a
single entity, so I don't think it's much guidance here.  It seems
tolerant of the loss of precision in placement when a Latin character
is rendered as base plus diacritic rather than as a precomposed glyph.
One can also pedantically argue that a font is a data file rather than
a 'process'.  (Additionally, a lot of us get confused by the mens rea
aspect of Unicode compliance.)

Richard.



Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-08 Thread Richard Wordingham via Unicode
On Thu, 8 Aug 2019 00:33:47 +
Andrew Glass via Unicode  wrote:

> I agree and understand that accurate representation is important in
> this case. It would be good to understand how widespread the issue is
> in order to begin to justify the work to retrofit shaping with
> normalization. The number of problematic strings may be small but the
> risk of regression in this case might be quite large.

Well, you could always reverse engineer HarfBuzz!

Just a reminder though.  You would be using a permutation of the
canonical combining classes - for Tai Tham, U+1A60 should be treated as
ccc=254, not ccc=0, and for Tibetan you would need to ensure that the
vowels below (ccc=132) came before the vowels above (ccc=130).

Richard.



Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-08 Thread Richard Wordingham via Unicode
On Wed, 7 Aug 2019 14:19:26 -0700
Asmus Freytag via Unicode  wrote:

> What about text that must exist normalized for other purposes?
> 
> Domain names must be normalized to NFC, for example. Will such
> strings display correctly if passed to USE?

One solution, of course, is to minimise the use of Microsoft
products.  (The trick
is to apply the normalisation algorithm using a permutation of the
positive ccc values.)  The latest version of HarfBuzz renders
subscripted final consonants; it's slowly recovering its pre-USE
rendering capabilities. 

> On 8/7/2019 1:39 PM, Andrew Glass via Unicode wrote:
> That's correct, the Microsoft implementation of USE spec does not
> normalize as part of the shaping process. Why? Because the ccc system
> for non-Latin scripts is not a good mechanism for handling complex
> requirements for these writing systems and the effects of ccc-based
> normalization can disrupt authors intent. Unfortunately, because we
> cannot fix ccc values, shaping engines at Microsoft have ignored
> them. Therefore, recommendation for passing text to USE is to not
> normalize.

HarfBuzz solved the problem of  by choosing a
suitable normalisation; it uses the same technique for Hebrew, where
the normalisation classes are also unfriendly to renderers.  

> By the way, at the current time, I do not have a final consensus from
> Tai Tham experts and community on the changes required to support Tai
> Tham in USE. Therefore, I've not been able to make the changes
> proposed in this thread.

Grammatical denazification is one solution.  Another one is to delegate
matters to the font.  Give us a script type that will implement a GSUB
feature by default, and font writers can take it from there. At present
I have a conundrum on how to render the accusative singular of the
cruciform form of the word for enlightenment without usinɡ chained
syllables, _bodhiṃ_.  The obvious visual encoding is .  This combination is very
unusual, perhaps unique to this word.  (Pali 'o' is ). However, a very common combination, because the UTC refused Tai
Tham the character SIGN AM, is SIGN AA, MAI KANG, so for the USE, SIGN
AA and MAI KANG have to be in the same character class.  (Alternatively,
we split the syllable before SIGN AA.)  MAI KANG has InSc=bindu, while
SIGN AA is a right matra. Unfortunately, there is a strong temptation
for many to write what would have been 'SIGN AM' as MAI KANG, SIGN AA,
which is to be rendered quite differently from 'SIGN AM' outside
Northern Thailand, e.g. in NE Thailand.  (Northern Thailand has both
syles; it is quite diverse.)  If I understand the principles of USE,
allowing both '... MAI KANG, SIGN AA...' and '... SIGN AA, MAI
KANG ', which immediately after a consonant have the same rendering
in some fonts and very confusable renderings in many others, is
considered highly undesirable.

For Microsoft applications, another solution is for fonts to deleted
dotted circles between Tai Tham characters.  (I try to be more
selective, but this results in a complicated set of lookups to
ensure that deletion only occurs when the renderer has inserted
inappropriate dotted circles.)  This is not compliant with Unicode, but
neither is deliberately treating canonically equivalent forms
differently.

Richard.



Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-07 Thread Richard Wordingham via Unicode
On Tue, 14 May 2019 03:08:04 +0100
Richard Wordingham via Unicode  wrote:

> On Tue, 14 May 2019 00:58:07 +
> Andrew Glass via Unicode  wrote:
> 
> > Here is the essence of the initial changes needed to support CV+C.
> > Open to feedback.
> > 
> > 
> >   *   Create new SAKOT class
> > SAKOT (Sk) based on UISC = Invisible_Stacker
> >   *   Reduced HALANT class
> > Now only HALANT (H) based on UISC = Virama
> >   *   Updated Standard cluster mode
> > 
> > [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B |
> > SUB  
> > > [VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)*
> > > (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk
> > > B)* (FAbv)* (FBlw)* (FPst)* [FM]  

> This next question does not, I believe, affect HarfBuzz.  Will NFC
> code render as well as unnormalised code?  In the first example above,
>  normalises to , which
> does not match any portion of the regular expression.

Could someone answer this question, please?  The USE documentation
("CGJ handling will need to be updated if USE is modified to support
normalization") still implies that the USE does not respect canonical
equivalence.

Richard.


Re: Akkha script (used by Eastern Magar language) in ISO 15924?

2019-07-23 Thread Richard Wordingham via Unicode
On Mon, 22 Jul 2019 17:42:37 -0700
Anshuman Pandey via Unicode  wrote:

> As I pointed out in L2/11-144, the “Magar Akkha” script is an
> appropriation of Brahmi, renamed to link it to the primordialist
> daydreams of an ethno-linguistic community in Nepal. I have never
> seen actual usage of the script by Magars. If things have changed
> since 2011, I would very much welcome such information. Otherwise,
> the so-called “Magar Akkha” is not suitable for encoding. The Brahmi
> encoding that we have should suffice.

How would mere usage qualify it as a separate script?

Richard.



Re: Displaying Lines of Text as Line-Broken by a Human

2019-07-22 Thread Richard Wordingham via Unicode
On Sun, 21 Jul 2019 20:53:19 -0700
Asmus Freytag via Unicode  wrote:

> There's really no inherent need for many spacing combining marks to
> have a base character. At least the ones that do not reorder and that
> don't overhang the base character's glyph.

We are in agreement here.

> As far as I can  tell, it's largely a convention that originally
> helped identify clusters and other lack of break opportunities. But
> now that we have separate properties for segmentation, it's not
> strictly necessary to overload the combining property for that
> purpose.

Which relates to the separate question I asked about breaking at
grapheme boundaries.  Interestingly, I'm not seeing breaks next to an
invisible stacker, but that may be because Pali subscript consonants
only slightly increase the width of the cluster.

The need for a base makes sense for reordering spacing marks, but should
be to detect editing errors, not deliberate effects.  An unreordered
rordering mark plus consonant is visually ambiguous with consonant plus
reordering mark.

> In you example, why do you need the ZWJ and dotted circle?

The user- and application-supplied text would be
.

> Originally, just applying a combining mark to a NBSP should normally
> show the mark by itself. If a font insists on inserting a dotted
> circle glyph, that's not required from a conformance perspective -
> just something that's seen as helpful (to most users).

It's not the font that inserts the dotted circle, it's the rendering
engine.  That's why the USE set Tai Tham rendering back several
years.  Now, there is at least one renderer (HarfBuzz) for which a
cunning font can work out whether the renderer has introduced the
dotted circle glyph rather than it being in the text to be rendered.  I
am looking for a general font-level solution to the problem that would
even work on Windows 10.

The ZWJ seems a reasonable hint that the space should be rendered with
zero width.  Do you think it is reasonable for  to
have zero width contribution from the NBSP when the spacing mark has a
non-overhanging glyph? It seems to be an unstandardised area, but zero
width might be considered to violate the character identity of NBSP.

I also have the problem of visually line-final U+1A6E TAI THAM VOWEL
SIGN E, which needs to be separated from a preceding consonant in the
backing store.  It seems to be particularly common before the holes
(two per page) for the string that holds the pages together.   Perhaps
the scribe tried to avoid line-final U+1A6E.

There are examples of these issues in Figure 9b of
http://www.unicode.org/L2/L2007/07007r-n3207r-lanna.pdf .  The last
syllable of _cattāro_ 'four' straddles lines 2 and 3, with its first
glyph (corresponding to SIGN E) ending line 2, and 
starting line 3.

The antepenultimate syllable of _sammodamānehi_ (misspelt
_samoddamānehi_) 'pleasing' is split between lines 7 and 8, with line 7
ending in MA and line 8 starting in SIGN AA.

I am looking for advice on what is the least bad readily achievable
solution. I can then adapt that to cope with the messier issue of the
non-spacing character U+1A58 TAI THAM SIGN MAI KANG LAI, which acts
like Burmese kinzi in the Pali text I am working on.  (If one does not
know the font well, one should not put a line break next to it unless
all other options are exhausted.)  Figure 9b also has an example of this
issue.  The initial consonant of saṅkhepaṃ (misspelt saṅkheppaṃ)
'collection, summary' is on line 9, while the rest of the word,
starting , is on line 10. 

There is weird hack that currently helps with LibreOffice - inserting
CGJ turns off some parts of Indic shaping in the rest of the run.  Or
have I missed some new specification of Indic encoding?  This helps
with visually line-final SIGN E.

Richard.



Displaying Lines of Text as Line-Broken by a Human

2019-07-21 Thread Richard Wordingham via Unicode
I've been transcribing some Pali text written on palm leaf in the
Tai Tham script.  I'm looking for a way of reflecting the line
boundaries in a manuscript in a transcription.  The problem is that
lines sometimes start or end with an isolated spacing mark.  I want
my text to be searchable and therefore encoded in Unicode.  (I
appreciate that There is a trade-off between searchability and showing
line boundaries.  The unorthodox spelling is also a problem.)

How unreasonable is it for a font to render



as just the spacing mark?  Some rendering systems give the font no way
of distinguishing dotted circles in the backing store from dotted
circles added by the renderer, so this technique is not Unicode
compliant.

An alternative solution is to have a parallel font (or, more neatly, a
feature) that renders some base character (or sequence) as a zero-width
non-inking character.  This, however, would violate that character's
identity.  I suspect there is no Unicode-compliant solution.

Richard.


Breaking lines at Grapheme Boundaries

2019-07-19 Thread Richard Wordingham via Unicode
If a renderer claims to support a writing system, should it render the
text reasonably if its client breaks lines at extended grapheme
cluster boundaries?

The writing system itself has no compunction about
breaking lines between legacy grapheme clusters, though I've no idea
how I should get a mere word-processor to implement some of these line
breaks.  (The big problem here is that Indic reordering would be
required around the line break.)

Richard.


Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-18 Thread Richard Wordingham via Unicode
On Wed, 17 Jul 2019 21:01:30 -0700
Asmus Freytag via Unicode  wrote:

> On 7/17/2019 6:03 PM, Richard Wordingham via Unicode wrote:

>> A significant issue is that the hieratic script is right to left but
>> Unicode only standardises the encoding of left-to-right
>> transcriptions.  I don't recall the difference between retrograde v.
>> normal text being declared a style difference.
 
> Use directional overrides. Those have been in the standard forever.

How do they help distinguish normal right-to-left text and
right-to-left retrograde text?  As I understand it, the implementer has
to guess which way characters in an ancient script face when the
direction of the text is overridden.

Unicode used to define the orientation, but that got withdrawn a few
years ago.

Richard.


Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-17 Thread Richard Wordingham via Unicode
On Thu, 18 Jul 2019 01:54:52 +0200
Philippe Verdy via Unicode  wrote:

> In fact the ligatures system for the "cursive" Egyptian Hieratic is so
> complex (and may also have its own variants showing its progression
> from Hieroglyphs to Demotic or Old Coptic), that probably Hieratic
> should no longer be considered "unified" with Hieroglyphs, and its
> existing ISO 15924 code is then not represented at all in Unicode.

Writing hieroglyphic text as plain text has only been supported since
Unicode 12.0, so it may take a little while to explore workable encoding
conventions.

A significant issue is that the hieratic script is right to left but
Unicode only standardises the encoding of left-to-right
transcriptions.  I don't recall the difference between retrograde v.
normal text being declared a style difference.

For comparison, we still have no guidance on how to encode sexagesimal
Mesopotamian cuneiform numbers, e.g. '610' v. '20' written using the U
graphic element.

Richard.


Re: Unicode "no-op" Character?

2019-07-03 Thread Richard Wordingham via Unicode
On Wed, 3 Jul 2019 17:51:29 -0400
"Mark E. Shoulson via Unicode"  wrote:

> I think the idea being considered at the outset was not so complex as 
> these (and indeed, the point of the character was to avoid making
> these kinds of decisions).

Shawn Steele appeared to be claiming that there was no good, interesting
reason for separating base character and combining mark.  I was
refuting that notion.  Natural text boundaries can get very messy -
some languages have word boundaries that can be *within* an
indecomposable combining mark.

Richard.


Re: Unicode "no-op" Character?

2019-06-23 Thread Richard Wordingham via Unicode
On Sat, 22 Jun 2019 21:10:08 -0400
Sławomir Osipiuk via Unicode  wrote:

> In fact, that might be the best description: It's not just an
> "ignorable", it's a "discardable". Unicode doesn't have that, does it?

No, though the byte order mark at the start of a file comes close.
Discardables are a security risk, as security filters may find it hard
to take them into account.

Richard.



Re: Unicode "no-op" Character?

2019-06-23 Thread Richard Wordingham via Unicode
On Sat, 22 Jun 2019 23:56:50 +
Shawn Steele via Unicode  wrote:

> + the list.  For some reason the list's reply header is confusing.
> 
> From: Shawn Steele
> Sent: Saturday, June 22, 2019 4:55 PM
> To: Sławomir Osipiuk 
> Subject: RE: Unicode "no-op" Character?
> 
> The original comment about putting it between the base character and
> the combining diacritic seems peculiar.  I'm having a hard time
> visualizing how that kind of markup could be interesting?

There are a number of possible interesting scenarios:

1) Chopping the string into user perceived characters.  For example,
the Khmer sequences of COENG plus letter are named sequences.  Akin to
this is identifying resting places for a simple cursor, e.g. allowing it
to be positioned between a base character and a spacing, unreordered
subscript.  (This last possibility overlaps with rendering.)

2) Chopping the string into collating elements.  (This can require
renormalisation, and may raise a rendering issue with HarfBuzz, where
renomalisation is required to get marks into a suitable order for
shaping.  I suspect no-op characters would disrupt this
renormalisation; CGJ may legitimately be used to affect rendering this
way, even though it is supposed to have no other effect* on rendering.)

3) Chopping the string into default grapheme clusters.  That
separates a coeng from the following character with which it
interacts.

*Is a Unicode-compliant *renderer* allowed to distinguish diaeresis
from the umlaut mark?

Richard. 



Re: Unicode "no-op" Character?

2019-06-22 Thread Richard Wordingham via Unicode
On Sat, 22 Jun 2019 23:56:11 +
Shawn Steele via Unicode  wrote:

> Assuming you were using any of those characters as "markup", how
> would you know when they were intentionally in the string and not
> part of your marking system?

If they're conveying an invisible message, one would have to strip out
original ZWNBSP/WJ/ZWSP that didn't affect line-breaking.  The weak
point is that that assumes that line-break opportunities are
well-defined.  For example, they aren't for SE Asian text.

Richard.


Re: Unicode "no-op" Character?

2019-06-22 Thread Richard Wordingham via Unicode
On Sat, 22 Jun 2019 17:50:49 -0400
Sławomir Osipiuk via Unicode  wrote:

> If faced with the same problem today, I’d
> probably just go with U+FEFF (really only need a single char, not a
> whole delimited substring) or a different C0 control (maybe SI/LS0)
> and clean up the string if it needs to be presented to the user.

You'd really want an intelligent choice between U+FEFF (ZWNBSP) (better
U+2060 WJ) and U+200B (ZWSP).  

> I still think an “idle”/“null tag”/“noop”  character would be a neat
> addition to Unicode, but I doubt I can make a convincing enough case
> for it.

You'd still only be able to insert it between characters, not between
code units, unless you were using UTF-32.

Richard.



Re: What is the best way to work around the current USE CV+C limitation in Tai Tham?

2019-05-22 Thread Richard Wordingham via Unicode
On Wed, 22 May 2019 00:14:57 -0400
Ed Trager  wrote:

> I'm hoping one or both of you can provide me some guidance on this,
> thank you! Unfortunately, my OpenType skills are not at the "ninja"
> level required to get around all of the limitations in USE ...

If blind copying of Lamphun or Da Lekh, which I allow and encourage in
this respect, is not possible, then one can reduce the skill level in
one ways. One is to indiscriminately eliminate dotted circles that
follow marks; that would simplify the conditions in those fonts.  (I
know that eliminating dotted circles present in the original string is
wrong - it's collateral damage in opposing oppression.)

Unfortunately, it's not as simple as that.  If the USE is still
misclassifying the InSC medial consonants as USE-medial consonants,
then they can still leave one with the need to do Indic reordering in
the font, e.g. with the reflexes of /ria/, to be encoded
, as the dotted circles
prevent SIGN E reordering to the start of the cluster. 

The second way is to attack individual cases.  For example, one
probably get away with a special substitutions to repair the HarfBuzz
and Windows corruptions of ᨯᩪᩕᩣ.  I don't know if CoreText has yet
another corruption.

Richard.



Re: Lao Sign Pali Virama and vowels above

2019-05-21 Thread Richard Wordingham via Unicode
On Tue, 21 May 2019 00:36:33 +
Andrew Glass via Unicode  wrote:

> This is because the sequences include U+0EBA which was added in
> Unicode 12.0. Edge has not updated for Unicode 12 at this time.

That suspicion was why I was hoping it was a temporary aberration.  When
it is so updated, will it support these sequences? Yes, no or undecided?
The Lao section of Microsoft Typography has not yet been updated.  I've
raised a formal issue at
https://github.com/MicrosoftDocs/typography-issues/issues/238 .

Richard.


Re: Lao Sign Pali Virama and vowels above

2019-05-20 Thread Richard Wordingham via Unicode
On Mon, 20 May 2019 22:53:36 +0100
Richard Wordingham via Unicode  wrote:

> MS Edge is currently giving me dotted circles for the sequences
>  and  UU>.  I trust this is just a temporary aberration.  

Also with the sequence , as in the
nominative singular ສັນທິຕ຺ຖ຺ິໂກ of ສັນທິຕ຺ຖ຺ິກະ, which transliterates
as sandiṭṭhika.  This last example displays perfectly well on HarfBuzz
renderers. 

Richard.



Lao Sign Pali Virama and vowels above

2019-05-20 Thread Richard Wordingham via Unicode
When a consonant bears both U+0EBA LAO SIGN PALI VIRAMA (acting as a
nukta) and a vowel above, is there or is there intended to be any
constraint on there relative order?  While U+0EBA has canonical
combining class 9, the vowels above have canonical combining class 0,
so the order makes a difference.

Typographically, these marks don't interfere, but renderers may
consider that to be a problem. 

The example of nukta and vowel below has now gone up on Wiktionary at
https://en.wiktionary.org/wiki/ວິຍ຺ຍ຺ູ .  The rendering worry arises
with the other form of the instrumental plural masculine.

MS Edge is currently giving me dotted circles for the sequences
 and .  I trust this is just a temporary aberration.

Richard.



Re: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?

2019-05-15 Thread Richard Wordingham via Unicode
On Wed, 15 May 2019 05:56:54 -0700
Asmus Freytag via Unicode  wrote:

> On 5/15/2019 4:22 AM, Costello, Roger L. via Unicode wrote:
> Hello Unicode experts!
> 
> Which is correct:
> 
> (a) The input file contains a string. The string is encoded using
> UTF-8.
> 
> (b) The input file contains a string. The string is encoded with
> UTF-8.
> 
> (c) The input file contains a string. The string is encoded in UTF-8.
> 
> (d) Something else (what?)
> 
> /Roger
> 
> 
> I would say I've seen all three uses about equally.
> 
> If you search for each phrase, though, "in" comes up as the most
> frequent one.
> 
> That would make the last one, or simply "in UTF-8" (that is, without
> the "encoded") good choices for general audiences.

Additionally, the latter is about the current form of the string; the
others refer to its history, suggesting it might once have been
represented in some other way.

Richard.


Lao Nukta

2019-05-14 Thread Richard Wordingham via Unicode
I was looking though Maha Sena's textbook on Tai Tham for Pali, and I
noticed that he had a Lao script Pali section that made use of a nukta
that seems to me to be indistinguishable from U+0EBA LAO SIGN PALI
VIRAMA.  Is it therefore in order to use that character for this nukta,
just as U+0E3A THAI CHARACTER PHINTHU functions as a nukta?

Now the nukta and the vowels below slightly interact, with the nukta on
the left and the vowel below in the right.  As U+0EBA has ccc=9 and the
Lao vowels below have ccc=118, this seems to be fine.  (of course, I
may have to wait to find a font that arranges them correctly.)

I attach an example of the word "viññūhīti".

Richard.


Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-05-14 Thread Richard Wordingham via Unicode
On Tue, 14 May 2019 03:08:04 +0100
Richard Wordingham via Unicode  wrote:

> Together,
> these call for (Sk B)* to be replaced by ().

Correction:
Together, these call for (Sk B)* to be replaced by ()*.

Richard.


Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-05-13 Thread Richard Wordingham via Unicode
On Tue, 14 May 2019 00:58:07 +
Andrew Glass via Unicode  wrote:

> Here is the essence of the initial changes needed to support CV+C.
> Open to feedback.
> 
> 
>   *   Create new SAKOT class
> SAKOT (Sk) based on UISC = Invisible_Stacker
>   *   Reduced HALANT class
> Now only HALANT (H) based on UISC = Virama
>   *   Updated Standard cluster mode
> 
> [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB
> > [VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)*
> > (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk B)*
> > (FAbv)* (FBlw)* (FPst)* [FM]

This comes a lot closer to supporting Tai Tham monosyllabic clusters.

Although this shouldn't affect Tai Tham, some of those medials need to
be made repeatable; I belief this has already been done in HarfBuzz.

I trust you'll be reclassifying U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA
and U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA into the category SUB so
that we can write about bananas forever (ᨠᩖ᩠ᩅ᩠᩶ᨿᨲᩕ᩠ᩃᩬᨯ):

 /kluai/ 'banana'

 /tʰalɔːt/ 'for ever'

The issues here are that WA in a medial rôle is indistinguishable from
a coda ('sakot') consonant and that MEDIAL RA can act as a consonant
aspirator.

Unfortunately, we didn't define a consonant HIGH RATTHA with a
canonical decomposition to .  The problem is that 'HIGH RATTHA', widely seen as an alternative
form of HIGH RATHA, can act as a subscript coda consonant.  There are
also a couple of words in the Northern Thai Dictionary of Palm-Leaf
Manuscripts where MEDIAL LA acts as a coda consonant.  Together,
these call for (Sk B)* to be replaced by ().

This next question does not, I believe, affect HarfBuzz.  Will NFC
code render as well as unnormalised code?  In the first example above,
 normalises to , which
does not match any portion of the regular expression.

Richard.



Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-05-09 Thread Richard Wordingham via Unicode
On Thu, 9 May 2019 11:55:23 -0400
Ed Trager via Unicode  wrote:
 
> ** A good use case is the Tai Tham word U+1A27 U+1A6A U+1A60 U+1A37 ,
> transcribed to Central Thai script as จูบ, (*to kiss*). Currently,
> people are writing this as U+1A27 U+1A60 U+1A37 U+1A6A ("จบู") which
> violates the "phonetic ordering" but is the current workaround
> because USE is still broken for TAI THAM.
> 
> REFERENCE DOCUMENT:
> http://www.unicode.org/L2/L2018/18332-tai-tham-ad-hoc-report.pdf

How is this a good test case?  The 6th preliminary recommendation
reads, "To represent a cluster, regardless of the phonetic order CCV or
CVC, a consonant sign should always be encoded before the vowel sign,
unless the vowel sign has inline advance and is apparently followed by
the consonant sign".  If this recommendation is adopted, then the
spelling "U+1A27 U+1A6A U+1A60 U+1A37" will be  wrong.

Now, SIGN U and SIGN UU before subscript BA, HIGH PA and LOW YA aren't
always written as though they followed the subscript consonants in
phonetic order.  Sometimes the vowel sign is written in the bottom left
of the syllable.  Presumably we'll need 3 or 4 new signs:

TAI THAM UNAMBIGUOUS UB

TAI THAM UNAMBIGUOUS UUB

TAI THAM UNAMBIGUOUS UY

TAI THAM UNAMBIGUOUS UUY (?)

I'm not sure that the fourth one can occur.

An example of the contrast is shown in the attached files luynam.png,
with first orthographic syllable , and
yukya.png, with the first orthographic syllable . 

I wonder how we'd be supposed to encode ᩉᩖᩩ᩠᩶ᨿ (currently  'to crawl'?  The simplest
way would be to encode it as , which currently encodes
the unlikely ᩉᩖ᩠ᨿᩩ᩶. Will good fonts be expected to move the vowel left
and down from the subscript LOW YA to the MEDIAL LA?  Or will we need to
encode it with *TAI THAM UNAMBIGUOUS UY?

Richard.


Choice between Identical Tai Tham Characters

2019-05-06 Thread Richard Wordingham via Unicode
What authoritative recommendations or injunctions have been given for
choosing between the encodings  and  for the
subscript character known natively as 'hang ba'?  The choice has no
implication as to glyph shape or the pronunciation of the character, and
the only difference in Unicode-associated properties is that the
difference is a primary difference in the DUCET default and CLDR root
collations.

It is quite conceivable that a prescribed choice may be intended to
distinguish homophonous homographs, e.g. ᩈᩣ᩠ᨷ 'bad smell' v. 'curse',
which are usually spelt differently in Northern Thai in the Thai script
and are spelt differently in Thai (สาบ v. สาป).

This subscript consonant is used in all the languages that regularly use
the script.

I can think of some common sense rules such as, "A Pali writing system
should use only one of U+1A37 and U+1A38", but it's not impossible that
even this has been overridden.

The Khmer script has a similar issue with COENG DA and COENG TA, but
between them they represent two different sounds, and TUS recommends
that the encoding be chosen on the basis of the sound.

Richard.




Re: asking advice of the Unicode community on new character proposal

2019-05-03 Thread Richard Wordingham via Unicode
On Fri, 3 May 2019 11:01:33 +0300
Jack Rueter via Unicode  wrote:

> The additional Latin characters to be proposed include Latin capital
> and small letters C, D, L, S, T and  ɜ with descenders. They also
> include a number of Cyrillic letters, capital and small Ukrainian IE
> (in Komi a hard affricate CHA) and Soft Sign (in Komi a high central
> unrounded vowel), used together with Latin letters.  Could/should
> these (four) be encoded as Latin characters (which would clearly add
> to confusables) or how could the mix of scripts be best handled?

The latter pair may already be encoded as U+0184/U+0185 LATIN
CAPITAL/SMALL LETTER TONE 6, which was once intended to use the glyph of
the Cyrillic soft sign.

The Ukrainian IE used in Latin script is certainly eligible, by the
principle of separation of scripts. The only challenge I can see would
be a claim that it was already encoded as U+0190/U+025B LATIN
CAPITAL/SMALL LETTER OPEN E.

Richard.



  1   2   3   4   5   >