Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Richard Wordingham via Unicode
On Sat, 21 Mar 2020 13:33:18 -0600 Doug Ewell via Unicode wrote: > Eli Zaretskii wrote: > > Emacs uses some of that for supporting charsets that cannot be > > mapped into Unicode. GB18030 is one example of such charsets. The > > internal representation of characters in Emacs is UTF-8, so it

Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread Richard Wordingham via Unicode
On Fri, 20 Mar 2020 13:46:25 +0100 Adam Borowski via Unicode wrote: > On Fri, Mar 20, 2020 at 12:21:26PM +, Costello, Roger L. via > Unicode wrote: > > [Definition] Property: an attribute, quality, or characteristic of > > something. > > > > JPEG is a binary data format. > > CSV is a text

Re: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Richard Wordingham via Unicode
On Fri, 21 Feb 2020 15:53:52 + "Costello, Roger L. via Unicode" wrote: > Based on a private correspondence, I now realize that this statement: > > > > > Text files do not contain binary > > > > is not correct. > > > > Text files may indeed contain binary (i.e., bytes that are not

Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Richard Wordingham via Unicode
On Thu, 13 Feb 2020 20:15:07 + Shawn Steele via Unicode wrote: > I confess that even though I know nothing about Hieroglyphs, that I > find it fascinating that such a thoroughly dead script might still be > living in some way, even if it's only a little bit. Plenty of people have learnt how

Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Richard Wordingham via Unicode
On Thu, 13 Feb 2020 10:18:40 +0100 Hans Åberg via Unicode wrote: > > On 13 Feb 2020, at 00:26, Shawn Steele > > wrote: > >> From the point of view of Unicode, it is simpler: If the character > >> is in use or have had use, it should be included somehow. > > > > That bar, to me, seems too

Re: Combining Marks and Variation Selectors

2020-02-02 Thread Richard Wordingham via Unicode
On Sun, 2 Feb 2020 16:20:07 -0800 Eric Muller via Unicode wrote: > That would imply some coordination among variations sequences on > different code points, right? > > E.g. <0B48> ≡ <0B47, 0B56>, so a variation sequence on 0B56 (Mn, > ccc=0) would imply the existence of a variation sequence on

Re: Combining Marks and Variation Selectors

2020-02-02 Thread Richard Wordingham via Unicode
On Sun, 2 Feb 2020 07:51:56 -0800 Ken Whistler via Unicode wrote: > What it comes down to is avoidance of conundrums involving canonical > reordering for normalization. The effect of variation selectors is > defined in terms of an immediate adjacency. If you allowed variation > selectors to

Re: Combining Marks and Variation Selectors

2020-02-01 Thread Richard Wordingham via Unicode
On Sat, 1 Feb 2020 17:59:57 -0800 Roozbeh Pournader via Unicode wrote: > They are actually allowed on combining marks of ccc=0. We even define > one such variation sequence for Myanmar, IIRC. > > On Sat, Feb 1, 2020, 2:12 PM Richard Wordingham via Unicode < > unicode@

Combining Marks and Variation Selectors

2020-02-01 Thread Richard Wordingham via Unicode
Why are variation selectors not allowed for combining marks? I can see a reason for them not being allowed on characters with non-zero canonical combining classes, but not for them being prohibited for combining marks that are starters, i.e. have ccc=0. Richard.

Adding Experimental Control Characters for Tai Tham

2020-01-25 Thread Richard Wordingham via Unicode
This topic is very similar to the recent topic "How to make custom combining diacritical marks for arabic letters?". There is a suggestion that the encoding of Tai Tham syllables be changed (https://www.unicode.org/L2/L2019/19365-tai-tham-structure.pdf, by Martin Hosken), and there is a strong

Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-04 Thread Richard Wordingham via Unicode
On Sat, 4 Jan 2020 22:15:59 + James Kass via Unicode wrote: > For the Grantha examples above, Grantha (1) displays much better > here. It seems daft to put a spacing character between a base > character and any mark which is supposed to combine with the base > character. Although it's not

Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-04 Thread Richard Wordingham via Unicode
On Thu, 2 Jan 2020 20:20:34 + Richard Wordingham via Unicode wrote: > There's a project whose basis I can't find to convert Indian Indic > rendering at least to use the USE. Now, according to the > specification of the USE, visarga, anusvara and cantillation marks > are al

Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-02 Thread Richard Wordingham via Unicode
On Thu, 2 Jan 2020 15:07:04 -0800 Norbert Lindenberg wrote: >> On Jan 2, 2020, at 12:20, Richard Wordingham via Unicode >> wrote: >> So, the problem should already be solved for Grantha, and, >> if the plans come to fruition, will work with a font whose >>

Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-02 Thread Richard Wordingham via Unicode
On Thu, 2 Jan 2020 07:52:55 + James Kass via Unicode wrote: > > I've been looking at Microsoft's specification of Devanagari > > character order.  In > > > https://docs.microsoft.com/en-us/typography/script-development/devanagari, > > the consonant syllable ends > > > > [N]+[A] + [<

Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread Richard Wordingham via Unicode
On Wed, 1 Jan 2020 20:11:04 + James Kass via Unicode wrote: > On 2020-01-01 11:17 AM, Richard Wordingham via Unicode wrote: > > > That's exactly the sort of mess that jack-booted renderers are > > trying to minimise.  Their principle is that there should be only >

Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread Richard Wordingham via Unicode
On Wed, 1 Jan 2020 23:09:49 + James Kass via Unicode wrote: > On 2020-01-01 8:11 PM, James Kass wrote: > > It’s too bad that ISCII didn’t accomodate the needs of Vedic > > Sanskrit, but here we are. > > Sorry, that might be wrong to say.  It's possible that it's Unicode's > adaptation of

Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-01 Thread Richard Wordingham via Unicode
On Wed, 1 Jan 2020 01:19:02 + James Kass via Unicode wrote: > A workaround until some kind of satisfactory adjustment is made might > be to simply use COLON for VISARGA.  Or... > >  VISARGA ⇒ U+02F8 MODIFIER LETTER RAISED COLON > ANUSVARA⇒U+02D9 DOT ABOVE > > ...as long as the font(s)

Re: NBSP supposed to stretch, right?

2019-12-20 Thread Richard Wordingham via Unicode
On Fri, 20 Dec 2019 17:25:17 +0530 Shriramana Sharma via Unicode wrote: > So I never asked for NBSP to disappear. I said I want it to *stretch*. > And to my mind "stretch" means to become wider than one's normal > width. It doesn't include decreasing or disappearing width. Don't spaces

Re: NBSP supposed to stretch, right?

2019-12-17 Thread Richard Wordingham via Unicode
On Tue, 17 Dec 2019 06:20:39 +0530 Shriramana Sharma via Unicode wrote: > Hello. I've just tested LibreOffice, Google Docs and MS Office on > Linux, Android and Windows, and it seems that NBSP doesn't get > stretched like the normal space character when justified alignment > requires it. > >

Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-03 Thread Richard Wordingham via Unicode
On Tue, 3 Dec 2019 17:35:14 +0530 विश्वासो वासुकिजः (Vishvas Vasuki) via Unicode wrote: > On Tue, Dec 3, 2019 at 3:48 PM Richard Wordingham via Unicode < > unicode@unicode.org> wrote: > > On Tue, 3 Dec 2019 02:05:35 + > > Richard Wordingham via Unicode wrote: &

Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-03 Thread Richard Wordingham via Unicode
I think the 'Latn' in sa-Latn-t-sa-m0-iast is unnecessary, though it partly depends on the range of the IAST transform. If the transformation can only convert to the Roman script then 'Latn' is superfluous; I'm not sure if the extension is formally enough to rule out Devanagari. On the other

Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-03 Thread Richard Wordingham via Unicode
On Tue, 3 Dec 2019 02:05:35 + Richard Wordingham via Unicode wrote: > I'm still trying to work out what to do for IAST. Is it just: > > sa-t-m0-iast > > if one finds that > > sa-Latn > > allows too much latitude? For material that is a transcription ra

Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-02 Thread Richard Wordingham via Unicode
On Tue, 3 Dec 2019 01:27:39 + Richard Wordingham wrote: > On Mon, 2 Dec 2019 09:09:02 -0800 > Markus Scherer via Unicode wrote: > > > On Mon, Dec 2, 2019 at 8:42 AM Roozbeh Pournader via Unicode < > > unicode@unicode.org> wrote: > > > > &g

Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-02 Thread Richard Wordingham via Unicode
On Mon, 2 Dec 2019 09:09:02 -0800 Markus Scherer via Unicode wrote: > On Mon, Dec 2, 2019 at 8:42 AM Roozbeh Pournader via Unicode < > unicode@unicode.org> wrote: > > > You don't need an ISO 15924 script code. You need to think in terms > > of BCP 47. Sanskrit in Latin would be sa-Latn. > >

Re: A neat description of encoding characters

2019-12-02 Thread Richard Wordingham via Unicode
On Mon, 2 Dec 2019 12:01:52 + "Costello, Roger L. via Unicode" wrote: > From the book titled "Computer Power and Human Reason" by Joseph > Weizenbaum, p.74-75 > > Suppose that the alphabet with which we wish to concern ourselves > consists of 256 distinct symbols... Why should I wish to

Re: Is the Unicode Standard "The foundation for all modern software and communications around the world"?

2019-11-20 Thread Richard Wordingham via Unicode
On Tue, 19 Nov 2019 20:02:55 + James Kass via Unicode wrote: > On 2019-11-19 6:59 PM, Costello, Roger L. via Unicode wrote: > > Today I received an email from the Unicode organization. The email > > said this: (italics and yellow highlighting are mine) > > > > The Unicode Standard is the

Re: Grapheme clusters & backspace (was: Unicode Digest, Vol 70, Issue 17)

2019-10-23 Thread Richard Wordingham via Unicode
On Wed, 23 Oct 2019 02:31:09 + Ben Morphett via Unicode wrote: > It totally depends on the editor. In Notepad++, when I backspace > over "Man Teacher: Dark Skin Tone", I get "Man Teacher: Dark Skin > Tone" => ""Man: Dark Skin Tone" => gone. In MS Word 2016 on Windows 10, I get an

Re: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)

2019-10-23 Thread Richard Wordingham via Unicode
On Tue, 22 Oct 2019 23:15:57 + Martin J. Dürst via Unicode wrote: > I think this to some extent is a question of "reality in the users' > minds". But to a very large extent, this is an issue of muscle > memory. If a user works with a keyboard/input method that deletes a > whole combination,

Re: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)

2019-10-22 Thread Richard Wordingham via Unicode
On Tue, 22 Oct 2019 23:27:27 +0200 Daniel Bünzli via Unicode wrote: > Thanks for you answer. > > > The compromise that has generally been reached is that 'delete' > > deletes a grapheme cluster and 'backspace' deletes a scalar value. > > (There are good editors like Emacs that delete only a

Re: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)

2019-10-22 Thread Richard Wordingham via Unicode
On Tue, 22 Oct 2019 11:04:01 +0200 Daniel Bünzli via Unicode wrote: > On 22 October 2019 at 09:37:22, Richard Wordingham via Unicode > (unicode@unicode.org) wrote: > > > When it comes to the second sentence of the text of Slide 7 > > 'Grapheme Clusters', my overwhe

Re: Annoyances from Implementation of Canonical Equivalence

2019-10-18 Thread Richard Wordingham via Unicode
On Fri, 18 Oct 2019 09:45:14 +0300 Eli Zaretskii via Unicode wrote: > > Date: Thu, 17 Oct 2019 21:58:50 +0100 > > From: Richard Wordingham via Unicode > > > > > Sounds arbitrary to me. How do we know that all the users will > > > want that? &

Re: Collation Grapheme Clusters and Canonical Equivalence

2019-10-18 Thread Richard Wordingham via Unicode
On Thu, 17 Oct 2019 23:11:55 +0100 Richard Wordingham via Unicode wrote: > There seems to be a Unicode non-compliance (C6) issue in the > definition of collation grapheme clusters (defined in UTS#10 Section > 9.9). Using the DUCET collation, the canonically equivalent strings > รู้

Collation Grapheme Clusters and Canonical Equivalence

2019-10-17 Thread Richard Wordingham via Unicode
There seems to be a Unicode non-compliance (C6) issue in the definition of collation grapheme clusters (defined in UTS#10 Section 9.9). Using the DUCET collation, the canonically equivalent strings รู้ and รัู decompose into collation grapheme clusters in two different ways. The first

Re: Annoyances from Implementation of Canonical Equivalence

2019-10-17 Thread Richard Wordingham via Unicode
On Thu, 17 Oct 2019 10:42:19 +0300 Eli Zaretskii via Unicode wrote: > > Date: Thu, 17 Oct 2019 02:26:35 +0100 > > From: Richard Wordingham > > Cc: Eli Zaretskii > > > > (c) A search for 'n' finding 'ñ'. > > > > When it comes to canonical equivale

Re: Annoyances from Implementation of Canonical Equivalence

2019-10-16 Thread Richard Wordingham via Unicode
On Wed, 16 Oct 2019 09:33:38 +0300 Eli Zaretskii via Unicode wrote: > > These are complaints about primary-level searches, not canonical > > equivalence. > > Not sure what you call primary-level searches, but if you deduced the > complaints were only about searches for base characters, then

Re: Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters)

2019-10-15 Thread Richard Wordingham via Unicode
On Tue, 15 Oct 2019 09:43:23 +0300 Eli Zaretskii via Unicode wrote: > > Date: Tue, 15 Oct 2019 00:23:59 +0100 > > From: Richard Wordingham via Unicode > > > > > I'm well aware of the official position. However, when we > > > attempted to implement it u

Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters)

2019-10-14 Thread Richard Wordingham via Unicode
On Mon, 14 Oct 2019 21:41:19 +0300 Eli Zaretskii via Unicode wrote: > > Date: Mon, 14 Oct 2019 19:29:39 +0100 > > From: Richard Wordingham via Unicode > > The official position is that text that is canonically > > equivalent is the same. There are problem areas whe

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Richard Wordingham via Unicode
On Mon, 14 Oct 2019 10:05:49 +0300 Eli Zaretskii via Unicode wrote: > > Date: Mon, 14 Oct 2019 01:10:45 +0100 > > From: Richard Wordingham via Unicode > > They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, > > and were expecting normalisatio

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 21:28:34 -0700 Mark Davis ☕️ via Unicode wrote: > The problem is that most regex engines are not written to handle some > "interesting" features of canonical equivalence, like discontinuity. > Suppose that X is canonically equivalent to AB. > >- A query /X/ can match the

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 20:25:25 -0700 Asmus Freytag via Unicode wrote: > On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote: > On Sun, 13 Oct 2019 17:13:28 -0700 >> Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so >> [:Lu:] should not match > COMBIN

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 17:13:28 -0700 Asmus Freytag via Unicode wrote: > On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote: > Besides invalidating complexity metrics, the issue was what \p{Lu} > should match. For example, with PCRE syntax, GNU grep Version 2.25 > \p{Lu} ma

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Mon, 14 Oct 2019 00:22:36 +0200 Hans Åberg via Unicode wrote: > > On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode > > wrote: >> Besides invalidating complexity metrics, the issue was what \p{Lu} >> should match. For example, with PCRE syntax, GNU gr

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 22:14:10 +0200 Hans Åberg via Unicode wrote: > > On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode > > wrote: > > Incidentally, at least some of the sizes and timings I gave seem to > > be wrong even for strings. They won't work with

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 15:29:04 +0200 Hans Åberg via Unicode wrote: > > On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode > > I'm now beginning to wonder what you are claiming. > I start with a NFA with no empty transitions and apply the subset DFA > construction dynam

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 10:04:34 +0200 Hans Åberg via Unicode wrote: > > On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode > > wrote: > > > > On Sat, 12 Oct 2019 21:36:45 +0200 > > Hans Åberg via Unicode wrote: > > > >>> On 12 Oc

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-12 Thread Richard Wordingham via Unicode
On Sat, 12 Oct 2019 21:36:45 +0200 Hans Åberg via Unicode wrote: > > On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode > > wrote: > > > > But remember that 'having longer first' is meaningless for a > > non-deterministic finite automaton that does a sing

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-12 Thread Richard Wordingham via Unicode
On Fri, 11 Oct 2019 12:39:56 +0200 Elizabeth Mattijsen via Unicode wrote: > Furthermore, Perl 6 uses Normalization Form Grapheme for matching: > https://docs.perl6.org/type/Cool#index-entry-Grapheme This approach does address the issue Mark Davis mentioned about regex engines working at

Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?

2019-10-12 Thread Richard Wordingham via Unicode
On Sat, 12 Oct 2019 18:15:38 +0800 Fred Brennan via Unicode wrote: > Indeed - it is extremely unfortunate that users will need to wait > until 2021(!) to get it into Unicode so Google will finally add it to > the Noto fonts. > If that's just how things are done, fine, I certainly can't change >

Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?

2019-10-12 Thread Richard Wordingham via Unicode
On Sat, 12 Oct 2019 18:15:38 +0800 Fred Brennan via Unicode wrote: > Indeed - it is extremely unfortunate that users will need to wait > until 2021(!) to get it into Unicode so Google will finally add it to > the Noto fonts. > There seems to be no conscionable reason for such a long delay after

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-12 Thread Richard Wordingham via Unicode
On Fri, 11 Oct 2019 18:37:18 -0700 Mark Davis ☕️ via Unicode wrote: > > > > You claimed the order of alternatives mattered. That is an > > important issue for anyone rash enough to think that the standard > > is fit to be used as a specification. > > > > Regex engines differ in how they

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Richard Wordingham via Unicode
On Fri, 11 Oct 2019 14:35:33 -0700 Markus Scherer via Unicode wrote: > > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters > > > in the alternation -- so this works equivalently if longer > > > strings are sorted first. > > Does conformance UTS#18 to level 2 mandate the

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Richard Wordingham via Unicode
On Thu, 10 Oct 2019 15:23:00 -0700 Markus Scherer via Unicode wrote: > [c \q{ch}]h should work like (ch|c)h. Note that the order matters in > the alternation -- so this works equivalently if longer strings are > sorted first. Thanks for answering the question. Does conformance UTS#18 to level

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Richard Wordingham via Unicode
On Fri, 11 Oct 2019 12:39:56 +0200 Elizabeth Mattijsen via Unicode wrote: > Furthermore, Perl 6 uses Normalization Form Grapheme for matching: > https://docs.perl6.org/type/Cool#index-entry-Grapheme I seriously doubt that a Thai considers each combination of consonant (44), non-spacing

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-10 Thread Richard Wordingham via Unicode
On Tue, 8 Oct 2019 15:25:34 +0100 Richard Wordingham via Unicode wrote: > An example UTS#18 gives for matching a literal cluster can be > simplified to, in its notation: > > [c \q{ch}] > > This is interpreted as 'match against "ch" if possible, otherwise > aga

Pure Regular Expression Engines and Literal Clusters

2019-10-08 Thread Richard Wordingham via Unicode
I've been puzzling over how a pure regular expression engine that works via a non-deterministic finite automaton can be bent to accommodate 'literal clusters' as in Requirement RL2.2 'Extended Grapheme Clusters' of UTS#18 'Unicode Regular Expressions' - "To meet this requirement, an implementation

Re: Manipuri/Meitei customary writing system

2019-10-04 Thread Richard Wordingham via Unicode
On Fri, 4 Oct 2019 07:12:59 + Martin J. Dürst via Unicode wrote: > On 2019/10/04 15:35, Martin J. Dürst via Unicode wrote: > > Hello Markus, > > > > On 2019/10/04 01:53, Markus Scherer via Unicode wrote: > >> Dear Unicoders, > >> > >> Is Manipuri/Meitei customarily written in

Re: On the lack of a SQUARE TB glyph

2019-09-30 Thread Richard Wordingham via Unicode
On Mon, 30 Sep 2019 01:32:02 -0700 Asmus Freytag via Unicode wrote: > On 9/30/2019 1:01 AM, Andre Schappo via Unicode wrote: > > On Sep 27, 1 Reiwa, at 08:17, Julian Bradfield via Unicode > wrote: > > Or one could allow IDS to have leaf components that are any > characters, not just

Re: Proposing mostly invisible characters

2019-09-13 Thread Richard Wordingham via Unicode
On Fri, 13 Sep 2019 08:56:02 +0300 Henri Sivonen via Unicode wrote: > On Thu, Sep 12, 2019, 15:53 Christoph Päper via Unicode > wrote: > > > ISHY/SIHY is especially useful for encoding (German) noun compounds > > in wrapped titles, e.g. on product labeling, where hyphens are often > >

Re: Proposing mostly invisible characters

2019-09-12 Thread Richard Wordingham via Unicode
On Thu, 12 Sep 2019 14:53:45 +0200 (CEST) Christoph Päper via Unicode wrote: > Dear Unicoders > > There are some characters that have no precedent in existing > encodings and are also hard to attest directly from printed sources. > Can one still make a solid case for encoding those in Unicode?

Re: LDML Keyboard Descriptions and Normalisation

2019-09-10 Thread Richard Wordingham via Unicode
On Sat, 7 Sep 2019 20:41:34 +0100 Richard Wordingham via Unicode wrote: > I don't think the model will run with Python Version 2.7. I was wrong. It does run under Version 2.7. Richard.

Re: LDML Keyboard Descriptions and Normalisation

2019-09-07 Thread Richard Wordingham via Unicode
On Sat, 7 Sep 2019 20:02:09 +0100 Cibu via Unicode wrote: > Slightly off topic: Is there a CLDR tool to try out transformations > specified in a keyboard spec? No CLDR tool, or so far as I am aware, CLDR-endorsed tool. Martin Hoksen has put together a reference model in Python at

Re: LDML Keyboard Descriptions and Normalisation

2019-09-07 Thread Richard Wordingham via Unicode
On Tue, 3 Sep 2019 18:03:18 + Andrew Glass via Unicode wrote: > Hi Richard, > > This is a good point. A keyboard that is doing transforms should > specify which type of normalization it has been designed to do. I've > filed a ticket to track this. The ticket is

LDML Keyboard Descriptions and Normalisation

2019-09-02 Thread Richard Wordingham via Unicode
I'm getting conflicting indications about how the LDML keyboard description handles issues of canonical equivalence. I have one simple question which some people may be able to answer. Is the keyboard specification intended to distinguish between keyboards that generally output: (a) NFC text;

Re: The native name of Tai Viet script and language(s)

2019-08-27 Thread Richard Wordingham via Unicode
On Tue, 27 Aug 2019 04:56:35 + Peter Constable via Unicode wrote: > The script _is_ related to Thai script, but I’m not sure I would say > it has “the same origin as that of Thai language/script used in > Thailand”, as that is too simplistic a view of the historic > connections: it suggests

Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar

2019-08-21 Thread Richard Wordingham via Unicode
On Tue, 20 Aug 2019 22:43:43 + Andrew Glass via Unicode wrote: > The order of medials in Myanmar clusters is constrained by UTN > #11. So yes, you do need to > follow the preferred order for Myanmar even if the sequence does not > match phonetic order. If

Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar

2019-08-21 Thread Richard Wordingham via Unicode
On Wed, 21 Aug 2019 02:47:28 + James Kass via Unicode wrote: > > Are we are allowed to write Llangollen as the definition of the > > Unicode Collation Algorithm implies we should, with an invisible CGJ > > between the 'n' and the 'g', so that it will collate correctly in > > Welsh?  That CGJ

Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar

2019-08-21 Thread Richard Wordingham via Unicode
On Wed, 21 Aug 2019 02:40:09 + James Kass via Unicode wrote: > On 2019-08-21 2:08 AM, Richard Wordingham via Unicode wrote: > > Are we are allowed to write Llangollen as the definition of the > > Unicode Collation Algorithm implies we should, with an invisible CGJ >

Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar

2019-08-20 Thread Richard Wordingham via Unicode
On Tue, 20 Aug 2019 22:43:43 + Andrew Glass via Unicode wrote: > The order of medials in Myanmar clusters is constrained by UTN > #11. So yes, you do need to > follow the preferred order for Myanmar even if the sequence does not > match phonetic order.

Re: PUA (BMP) planned characters HTML tables

2019-08-15 Thread Richard Wordingham via Unicode
On Wed, 14 Aug 2019 23:32:37 + James Kass via Unicode wrote: > U+0149 has a compatibility decomposition.  It has been deprecated and > is not rendered identically on my system. > 'n ʼn > ( ’n ) Compatibility decompositions are quite a mix, but are generally expected to render differently.

Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread Richard Wordingham via Unicode
On Wed, 14 Aug 2019 09:05:02 + James Kass via Unicode wrote: > The solution is to deprecate "LATIN LOWER CASE I WITH HEART".  It's > only in there because of legacy.  It's presence guarantees > round-tripping with legacy data but it isn't needed for modern data > or display.  Urge Groups One

Re: PUA (BMP) planned characters HTML tables

2019-08-12 Thread Richard Wordingham via Unicode
On Mon, 12 Aug 2019 01:21:42 + James Kass via Unicode wrote: > There was a time when populating the PUA with precomposed glyphs was > necessary for printing or display, but that time has passed. There is still the issue that in pure X one can't put sequences of characters on a key; if the

Re: PUA (BMP) planned characters HTML tables

2019-08-10 Thread Richard Wordingham via Unicode
On Sun, 11 Aug 2019 00:07:05 -0400 Robert Wheelock via Unicode wrote: > I remember that a website that has tables for certain PUA precomposed > accented characters that aren’t yet in Unicode (thing like: > Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital > H-underbar, acute

Re: Fonts and Canonical Equivalence

2019-08-10 Thread Richard Wordingham via Unicode
On Sat, 10 Aug 2019 16:37:48 +0100 Andrew West via Unicode wrote: > On Sat, 10 Aug 2019 at 15:46, Richard Wordingham via Unicode > wrote: > > Does vowel above before vowel below yield a dotted circle? > > Yes. Attached are screenshots for two real world examples, one wh

Re: Fonts and Canonical Equivalence

2019-08-10 Thread Richard Wordingham via Unicode
On Sat, 10 Aug 2019 11:22:01 +0100 Andrew West via Unicode wrote: > On Sat, 10 Aug 2019 at 08:29, Richard Wordingham via Unicode > wrote: > > > > There are similar issues with Tibetan; some fonts do not work > > properly if a vowel below (ccc=132) is separated from the b

Fonts and Canonical Equivalence

2019-08-10 Thread Richard Wordingham via Unicode
I've spun this question off from the issue of what the USE is to do when confronted with the NFC canonical equivalent of a string it will accept when this equivalent does not match its regular expressions when they are applied to strings of characters rather than canonical equivalence classes of

Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-08 Thread Richard Wordingham via Unicode
On Thu, 8 Aug 2019 00:33:47 + Andrew Glass via Unicode wrote: > I agree and understand that accurate representation is important in > this case. It would be good to understand how widespread the issue is > in order to begin to justify the work to retrofit shaping with > normalization. The

Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-08 Thread Richard Wordingham via Unicode
On Wed, 7 Aug 2019 14:19:26 -0700 Asmus Freytag via Unicode wrote: > What about text that must exist normalized for other purposes? > > Domain names must be normalized to NFC, for example. Will such > strings display correctly if passed to USE? One solution, of course, is to minimise the use

Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-07 Thread Richard Wordingham via Unicode
On Tue, 14 May 2019 03:08:04 +0100 Richard Wordingham via Unicode wrote: > On Tue, 14 May 2019 00:58:07 + > Andrew Glass via Unicode wrote: > > > Here is the essence of the initial changes needed to support CV+C. > > Open to feedback. > > > > > >

Re: Akkha script (used by Eastern Magar language) in ISO 15924?

2019-07-23 Thread Richard Wordingham via Unicode
On Mon, 22 Jul 2019 17:42:37 -0700 Anshuman Pandey via Unicode wrote: > As I pointed out in L2/11-144, the “Magar Akkha” script is an > appropriation of Brahmi, renamed to link it to the primordialist > daydreams of an ethno-linguistic community in Nepal. I have never > seen actual usage of the

Re: Displaying Lines of Text as Line-Broken by a Human

2019-07-22 Thread Richard Wordingham via Unicode
On Sun, 21 Jul 2019 20:53:19 -0700 Asmus Freytag via Unicode wrote: > There's really no inherent need for many spacing combining marks to > have a base character. At least the ones that do not reorder and that > don't overhang the base character's glyph. We are in agreement here. > As far as I

Displaying Lines of Text as Line-Broken by a Human

2019-07-21 Thread Richard Wordingham via Unicode
I've been transcribing some Pali text written on palm leaf in the Tai Tham script. I'm looking for a way of reflecting the line boundaries in a manuscript in a transcription. The problem is that lines sometimes start or end with an isolated spacing mark. I want my text to be searchable and

Breaking lines at Grapheme Boundaries

2019-07-19 Thread Richard Wordingham via Unicode
If a renderer claims to support a writing system, should it render the text reasonably if its client breaks lines at extended grapheme cluster boundaries? The writing system itself has no compunction about breaking lines between legacy grapheme clusters, though I've no idea how I should get a

Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-18 Thread Richard Wordingham via Unicode
On Wed, 17 Jul 2019 21:01:30 -0700 Asmus Freytag via Unicode wrote: > On 7/17/2019 6:03 PM, Richard Wordingham via Unicode wrote: >> A significant issue is that the hieratic script is right to left but >> Unicode only standardises the encoding of left-to-right >> transcript

Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-17 Thread Richard Wordingham via Unicode
On Thu, 18 Jul 2019 01:54:52 +0200 Philippe Verdy via Unicode wrote: > In fact the ligatures system for the "cursive" Egyptian Hieratic is so > complex (and may also have its own variants showing its progression > from Hieroglyphs to Demotic or Old Coptic), that probably Hieratic > should no

Re: Unicode "no-op" Character?

2019-07-03 Thread Richard Wordingham via Unicode
On Wed, 3 Jul 2019 17:51:29 -0400 "Mark E. Shoulson via Unicode" wrote: > I think the idea being considered at the outset was not so complex as > these (and indeed, the point of the character was to avoid making > these kinds of decisions). Shawn Steele appeared to be claiming that there was

Re: Unicode "no-op" Character?

2019-06-23 Thread Richard Wordingham via Unicode
On Sat, 22 Jun 2019 21:10:08 -0400 Sławomir Osipiuk via Unicode wrote: > In fact, that might be the best description: It's not just an > "ignorable", it's a "discardable". Unicode doesn't have that, does it? No, though the byte order mark at the start of a file comes close. Discardables are a

Re: Unicode "no-op" Character?

2019-06-23 Thread Richard Wordingham via Unicode
On Sat, 22 Jun 2019 23:56:50 + Shawn Steele via Unicode wrote: > + the list. For some reason the list's reply header is confusing. > > From: Shawn Steele > Sent: Saturday, June 22, 2019 4:55 PM > To: Sławomir Osipiuk > Subject: RE: Unicode "no-op" Character? > > The original comment

Re: Unicode "no-op" Character?

2019-06-22 Thread Richard Wordingham via Unicode
On Sat, 22 Jun 2019 23:56:11 + Shawn Steele via Unicode wrote: > Assuming you were using any of those characters as "markup", how > would you know when they were intentionally in the string and not > part of your marking system? If they're conveying an invisible message, one would have to

Re: Unicode "no-op" Character?

2019-06-22 Thread Richard Wordingham via Unicode
On Sat, 22 Jun 2019 17:50:49 -0400 Sławomir Osipiuk via Unicode wrote: > If faced with the same problem today, I’d > probably just go with U+FEFF (really only need a single char, not a > whole delimited substring) or a different C0 control (maybe SI/LS0) > and clean up the string if it needs to

Re: What is the best way to work around the current USE CV+C limitation in Tai Tham?

2019-05-22 Thread Richard Wordingham via Unicode
On Wed, 22 May 2019 00:14:57 -0400 Ed Trager wrote: > I'm hoping one or both of you can provide me some guidance on this, > thank you! Unfortunately, my OpenType skills are not at the "ninja" > level required to get around all of the limitations in USE ... If blind copying of Lamphun or Da

Re: Lao Sign Pali Virama and vowels above

2019-05-21 Thread Richard Wordingham via Unicode
On Tue, 21 May 2019 00:36:33 + Andrew Glass via Unicode wrote: > This is because the sequences include U+0EBA which was added in > Unicode 12.0. Edge has not updated for Unicode 12 at this time. That suspicion was why I was hoping it was a temporary aberration. When it is so updated, will

Re: Lao Sign Pali Virama and vowels above

2019-05-20 Thread Richard Wordingham via Unicode
On Mon, 20 May 2019 22:53:36 +0100 Richard Wordingham via Unicode wrote: > MS Edge is currently giving me dotted circles for the sequences > and UU>. I trust this is just a temporary aberration. Also with the sequence , as in the nominative singular ສັນທິຕ຺ຖ຺ິໂກ of ສັນທິຕ຺ຖ຺ິ

Lao Sign Pali Virama and vowels above

2019-05-20 Thread Richard Wordingham via Unicode
When a consonant bears both U+0EBA LAO SIGN PALI VIRAMA (acting as a nukta) and a vowel above, is there or is there intended to be any constraint on there relative order? While U+0EBA has canonical combining class 9, the vowels above have canonical combining class 0, so the order makes a

Re: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?

2019-05-15 Thread Richard Wordingham via Unicode
On Wed, 15 May 2019 05:56:54 -0700 Asmus Freytag via Unicode wrote: > On 5/15/2019 4:22 AM, Costello, Roger L. via Unicode wrote: > Hello Unicode experts! > > Which is correct: > > (a) The input file contains a string. The string is encoded using > UTF-8. > > (b) The input file contains a

Lao Nukta

2019-05-14 Thread Richard Wordingham via Unicode
I was looking though Maha Sena's textbook on Tai Tham for Pali, and I noticed that he had a Lao script Pali section that made use of a nukta that seems to me to be indistinguishable from U+0EBA LAO SIGN PALI VIRAMA. Is it therefore in order to use that character for this nukta, just as U+0E3A

Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-05-14 Thread Richard Wordingham via Unicode
On Tue, 14 May 2019 03:08:04 +0100 Richard Wordingham via Unicode wrote: > Together, > these call for (Sk B)* to be replaced by (). Correction: Together, these call for (Sk B)* to be replaced by ()*. Richard.

Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-05-13 Thread Richard Wordingham via Unicode
On Tue, 14 May 2019 00:58:07 + Andrew Glass via Unicode wrote: > Here is the essence of the initial changes needed to support CV+C. > Open to feedback. > > > * Create new SAKOT class > SAKOT (Sk) based on UISC = Invisible_Stacker > * Reduced HALANT class > Now only HALANT (H) based

Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-05-09 Thread Richard Wordingham via Unicode
On Thu, 9 May 2019 11:55:23 -0400 Ed Trager via Unicode wrote: > ** A good use case is the Tai Tham word U+1A27 U+1A6A U+1A60 U+1A37 , > transcribed to Central Thai script as จูบ, (*to kiss*). Currently, > people are writing this as U+1A27 U+1A60 U+1A37 U+1A6A ("จบู") which > violates the

Choice between Identical Tai Tham Characters

2019-05-06 Thread Richard Wordingham via Unicode
What authoritative recommendations or injunctions have been given for choosing between the encodings and for the subscript character known natively as 'hang ba'? The choice has no implication as to glyph shape or the pronunciation of the character, and the only difference in Unicode-associated

Re: asking advice of the Unicode community on new character proposal

2019-05-03 Thread Richard Wordingham via Unicode
On Fri, 3 May 2019 11:01:33 +0300 Jack Rueter via Unicode wrote: > The additional Latin characters to be proposed include Latin capital > and small letters C, D, L, S, T and ɜ with descenders. They also > include a number of Cyrillic letters, capital and small Ukrainian IE > (in Komi a hard

  1   2   3   4   5   6   7   8   9   10   >