Re: Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters)

2019-10-16 Thread Eli Zaretskii via Unicode
> Date: Tue, 15 Oct 2019 20:52:15 +0100 > From: Richard Wordingham via Unicode > > > > > I'm well aware of the official position. However, when we > > > > attempted to implement it unconditionally in Emacs, some people > > > > objected, and brought up good reasons. You can, of course, elect >

Re: Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters)

2019-10-15 Thread Richard Wordingham via Unicode
On Tue, 15 Oct 2019 09:43:23 +0300 Eli Zaretskii via Unicode wrote: > > Date: Tue, 15 Oct 2019 00:23:59 +0100 > > From: Richard Wordingham via Unicode > > > > > I'm well aware of the official position. However, when we > > > attempted to implement it unconditionally in Emacs, some people >

Re: Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters)

2019-10-15 Thread Eli Zaretskii via Unicode
> Date: Tue, 15 Oct 2019 00:23:59 +0100 > From: Richard Wordingham via Unicode > > > I'm well aware of the official position. However, when we attempted > > to implement it unconditionally in Emacs, some people objected, and > > brought up good reasons. You can, of course, elect to disregard

Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters)

2019-10-14 Thread Richard Wordingham via Unicode
On Mon, 14 Oct 2019 21:41:19 +0300 Eli Zaretskii via Unicode wrote: > > Date: Mon, 14 Oct 2019 19:29:39 +0100 > > From: Richard Wordingham via Unicode > > The official position is that text that is canonically > > equivalent is the same. There are problem areas where traditional > > modes of

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Eli Zaretskii via Unicode
> Date: Mon, 14 Oct 2019 19:29:39 +0100 > From: Richard Wordingham via Unicode > > On Mon, 14 Oct 2019 10:05:49 +0300 > Eli Zaretskii via Unicode wrote: > > > I think these are two separate issues: whether search should normalize > > (a.k.a. performs character folding) should be a user option.

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Richard Wordingham via Unicode
On Mon, 14 Oct 2019 10:05:49 +0300 Eli Zaretskii via Unicode wrote: > > Date: Mon, 14 Oct 2019 01:10:45 +0100 > > From: Richard Wordingham via Unicode > > They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, > > and were expecting normalisation (even to NFC) to be a possible > >

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Hans Åberg via Unicode
> On 14 Oct 2019, at 02:10, Richard Wordingham via Unicode > wrote: > > On Mon, 14 Oct 2019 00:22:36 +0200 > Hans Åberg via Unicode wrote: > >>> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode >>> wrote: > >>> Besides invalidating complexity metrics, the issue was what \p{Lu}

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 21:28:34 -0700 Mark Davis ☕️ via Unicode wrote: > The problem is that most regex engines are not written to handle some > "interesting" features of canonical equivalence, like discontinuity. > Suppose that X is canonically equivalent to AB. > >- A query /X/ can match the

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 20:25:25 -0700 Asmus Freytag via Unicode wrote: > On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote: > On Sun, 13 Oct 2019 17:13:28 -0700 >> Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so >> [:Lu:] should not match > COMBINING CIRCUMFLEX ACCENT>.

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Eli Zaretskii via Unicode
> Date: Mon, 14 Oct 2019 01:10:45 +0100 > From: Richard Wordingham via Unicode > > >> Besides invalidating complexity metrics, the issue was what \p{Lu} > >> should match. For example, with PCRE syntax, GNU grep Version 2.25 > >> \p{Lu} matches U+0100 but not . When I'm respecting > >>

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Mark Davis ☕️ via Unicode
The problem is that most regex engines are not written to handle some "interesting" features of canonical equivalence, like discontinuity. Suppose that X is canonically equivalent to AB. - A query /X/ can match the separated A and C in the target string "AbC". So if I have code do [replace

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Asmus Freytag via Unicode
On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote: On Sun, 13 Oct 2019 17:13:28 -0700 Asmus Freytag via Unicode wrote: On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote: Besides invalidating complexity metrics, the issue was

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 17:13:28 -0700 Asmus Freytag via Unicode wrote: > On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote: > Besides invalidating complexity metrics, the issue was what \p{Lu} > should match. For example, with PCRE syntax, GNU grep Version 2.25 > \p{Lu} matches U+0100

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Asmus Freytag via Unicode
On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote: Besides invalidating complexity metrics, the issue was what \p{Lu} should match. For example, with PCRE syntax, GNU grep Version 2.25 \p{Lu} matches U+0100 but not . When I'm respecting canonical

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Mon, 14 Oct 2019 00:22:36 +0200 Hans Åberg via Unicode wrote: > > On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode > > wrote: >> Besides invalidating complexity metrics, the issue was what \p{Lu} >> should match. For example, with PCRE syntax, GNU grep Version 2.25 >> \p{Lu}

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Hans Åberg via Unicode
> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode > wrote: > > The point about these examples is that the estimate of one state per > character becomes a severe underestimate. For example, after > processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can > be in any of

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 22:14:10 +0200 Hans Åberg via Unicode wrote: > > On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode > > wrote: > > Incidentally, at least some of the sizes and timings I gave seem to > > be wrong even for strings. They won't work with numeric > > quantifiers, as in

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Hans Åberg via Unicode
> On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode > wrote: > > On Sun, 13 Oct 2019 15:29:04 +0200 > Hans Åberg via Unicode wrote: > >>> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode >>> I'm now beginning to wonder what you are claiming. > >> I start with a NFA with no

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 15:29:04 +0200 Hans Åberg via Unicode wrote: > > On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode > > I'm now beginning to wonder what you are claiming. > I start with a NFA with no empty transitions and apply the subset DFA > construction dynamically for a given

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Hans Åberg via Unicode
> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode > wrote: > >>> On Sat, 12 Oct 2019 21:36:45 +0200 >>> Hans Åberg via Unicode wrote: >>> > On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode > wrote: > > But remember that 'having longer first' is meaningless

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 10:04:34 +0200 Hans Åberg via Unicode wrote: > > On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode > > wrote: > > > > On Sat, 12 Oct 2019 21:36:45 +0200 > > Hans Åberg via Unicode wrote: > > > >>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode > >>>

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Hans Åberg via Unicode
> On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode > wrote: > > On Sat, 12 Oct 2019 21:36:45 +0200 > Hans Åberg via Unicode wrote: > >>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode >>> wrote: >>> >>> But remember that 'having longer first' is meaningless for a >>>

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-12 Thread Richard Wordingham via Unicode
On Sat, 12 Oct 2019 21:36:45 +0200 Hans Åberg via Unicode wrote: > > On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode > > wrote: > > > > But remember that 'having longer first' is meaningless for a > > non-deterministic finite automaton that does a single pass through > > the string to

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-12 Thread Richard Wordingham via Unicode
On Fri, 11 Oct 2019 12:39:56 +0200 Elizabeth Mattijsen via Unicode wrote: > Furthermore, Perl 6 uses Normalization Form Grapheme for matching: > https://docs.perl6.org/type/Cool#index-entry-Grapheme This approach does address the issue Mark Davis mentioned about regex engines working at

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-12 Thread Hans Åberg via Unicode
> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode > wrote: > > But remember that 'having longer first' is meaningless for a > non-deterministic finite automaton that does a single pass through the > string to be searched. It is possible to identify all submatches deterministically

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-12 Thread Richard Wordingham via Unicode
On Fri, 11 Oct 2019 18:37:18 -0700 Mark Davis ☕️ via Unicode wrote: > > > > You claimed the order of alternatives mattered. That is an > > important issue for anyone rash enough to think that the standard > > is fit to be used as a specification. > > > > Regex engines differ in how they

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Mark Davis ☕️ via Unicode
> > You claimed the order of alternatives mattered. That is an important > issue for anyone rash enough to think that the standard is fit to be > used as a specification. > Regex engines differ in how they handle the interpretation of the matching of alternatives, and it is not possible for us

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Richard Wordingham via Unicode
On Fri, 11 Oct 2019 14:35:33 -0700 Markus Scherer via Unicode wrote: > > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters > > > in the alternation -- so this works equivalently if longer > > > strings are sorted first. > > Does conformance UTS#18 to level 2 mandate the

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Markus Scherer via Unicode
On Fri, Oct 11, 2019 at 12:05 PM Richard Wordingham via Unicode < unicode@unicode.org> wrote: > On Thu, 10 Oct 2019 15:23:00 -0700 > Markus Scherer via Unicode wrote: > > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters in > > the alternation -- so this works equivalently if

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Richard Wordingham via Unicode
On Thu, 10 Oct 2019 15:23:00 -0700 Markus Scherer via Unicode wrote: > [c \q{ch}]h should work like (ch|c)h. Note that the order matters in > the alternation -- so this works equivalently if longer strings are > sorted first. Thanks for answering the question. Does conformance UTS#18 to level

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Richard Wordingham via Unicode
On Fri, 11 Oct 2019 12:39:56 +0200 Elizabeth Mattijsen via Unicode wrote: > Furthermore, Perl 6 uses Normalization Form Grapheme for matching: > https://docs.perl6.org/type/Cool#index-entry-Grapheme I seriously doubt that a Thai considers each combination of consonant (44), non-spacing

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Elizabeth Mattijsen via Unicode
> On 11 Oct 2019, at 00:23, Markus Scherer via Unicode > wrote: > > On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode > wrote: > An example UTS#18 gives for matching a literal cluster can be simplified > to, in its notation: > > [c \q{ch}] > > This is interpreted as 'match

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-10 Thread Markus Scherer via Unicode
On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode < unicode@unicode.org> wrote: > An example UTS#18 gives for matching a literal cluster can be simplified > to, in its notation: > > [c \q{ch}] > > This is interpreted as 'match against "ch" if possible, otherwise > against "c". Thus

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-10 Thread Richard Wordingham via Unicode
On Tue, 8 Oct 2019 15:25:34 +0100 Richard Wordingham via Unicode wrote: > An example UTS#18 gives for matching a literal cluster can be > simplified to, in its notation: > > [c \q{ch}] > > This is interpreted as 'match against "ch" if possible, otherwise > against "c". Thus the strings "ca"

Pure Regular Expression Engines and Literal Clusters

2019-10-08 Thread Richard Wordingham via Unicode
I've been puzzling over how a pure regular expression engine that works via a non-deterministic finite automaton can be bent to accommodate 'literal clusters' as in Requirement RL2.2 'Extended Grapheme Clusters' of UTS#18 'Unicode Regular Expressions' - "To meet this requirement, an implementation