> Date: Tue, 15 Oct 2019 20:52:15 +0100
> From: Richard Wordingham via Unicode
>
> > > > I'm well aware of the official position. However, when we
> > > > attempted to implement it unconditionally in Emacs, some people
> > > > objected, and brought up good reasons. You can, of course, elect
>
On Tue, 15 Oct 2019 09:43:23 +0300
Eli Zaretskii via Unicode wrote:
> > Date: Tue, 15 Oct 2019 00:23:59 +0100
> > From: Richard Wordingham via Unicode
> >
> > > I'm well aware of the official position. However, when we
> > > attempted to implement it unconditionally in Emacs, some people
>
> Date: Tue, 15 Oct 2019 00:23:59 +0100
> From: Richard Wordingham via Unicode
>
> > I'm well aware of the official position. However, when we attempted
> > to implement it unconditionally in Emacs, some people objected, and
> > brought up good reasons. You can, of course, elect to disregard
On Mon, 14 Oct 2019 21:41:19 +0300
Eli Zaretskii via Unicode wrote:
> > Date: Mon, 14 Oct 2019 19:29:39 +0100
> > From: Richard Wordingham via Unicode
> > The official position is that text that is canonically
> > equivalent is the same. There are problem areas where traditional
> > modes of
> Date: Mon, 14 Oct 2019 19:29:39 +0100
> From: Richard Wordingham via Unicode
>
> On Mon, 14 Oct 2019 10:05:49 +0300
> Eli Zaretskii via Unicode wrote:
>
> > I think these are two separate issues: whether search should normalize
> > (a.k.a. performs character folding) should be a user option.
On Mon, 14 Oct 2019 10:05:49 +0300
Eli Zaretskii via Unicode wrote:
> > Date: Mon, 14 Oct 2019 01:10:45 +0100
> > From: Richard Wordingham via Unicode
> > They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*,
> > and were expecting normalisation (even to NFC) to be a possible
> >
> On 14 Oct 2019, at 02:10, Richard Wordingham via Unicode
> wrote:
>
> On Mon, 14 Oct 2019 00:22:36 +0200
> Hans Åberg via Unicode wrote:
>
>>> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode
>>> wrote:
>
>>> Besides invalidating complexity metrics, the issue was what \p{Lu}
On Sun, 13 Oct 2019 21:28:34 -0700
Mark Davis ☕️ via Unicode wrote:
> The problem is that most regex engines are not written to handle some
> "interesting" features of canonical equivalence, like discontinuity.
> Suppose that X is canonically equivalent to AB.
>
>- A query /X/ can match the
On Sun, 13 Oct 2019 20:25:25 -0700
Asmus Freytag via Unicode wrote:
> On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote:
> On Sun, 13 Oct 2019 17:13:28 -0700
>> Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so
>> [:Lu:] should not match > COMBINING CIRCUMFLEX ACCENT>.
> Date: Mon, 14 Oct 2019 01:10:45 +0100
> From: Richard Wordingham via Unicode
>
> >> Besides invalidating complexity metrics, the issue was what \p{Lu}
> >> should match. For example, with PCRE syntax, GNU grep Version 2.25
> >> \p{Lu} matches U+0100 but not . When I'm respecting
> >>
The problem is that most regex engines are not written to handle some
"interesting" features of canonical equivalence, like discontinuity.
Suppose that X is canonically equivalent to AB.
- A query /X/ can match the separated A and C in the target string
"AbC". So if I have code do [replace
On 10/13/2019 6:38 PM, Richard
Wordingham via Unicode wrote:
On Sun, 13 Oct 2019 17:13:28 -0700
Asmus Freytag via Unicode wrote:
On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
Besides invalidating complexity metrics, the issue was
On Sun, 13 Oct 2019 17:13:28 -0700
Asmus Freytag via Unicode wrote:
> On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match. For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100
On 10/13/2019 2:54 PM, Richard
Wordingham via Unicode wrote:
Besides invalidating complexity metrics, the issue was what \p{Lu}
should match. For example, with PCRE syntax, GNU grep Version 2.25
\p{Lu} matches U+0100 but not . When I'm respecting
canonical
On Mon, 14 Oct 2019 00:22:36 +0200
Hans Åberg via Unicode wrote:
> > On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode
> > wrote:
>> Besides invalidating complexity metrics, the issue was what \p{Lu}
>> should match. For example, with PCRE syntax, GNU grep Version 2.25
>> \p{Lu}
> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode
> wrote:
>
> The point about these examples is that the estimate of one state per
> character becomes a severe underestimate. For example, after
> processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can
> be in any of
On Sun, 13 Oct 2019 22:14:10 +0200
Hans Åberg via Unicode wrote:
> > On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode
> > wrote:
> > Incidentally, at least some of the sizes and timings I gave seem to
> > be wrong even for strings. They won't work with numeric
> > quantifiers, as in
> On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode
> wrote:
>
> On Sun, 13 Oct 2019 15:29:04 +0200
> Hans Åberg via Unicode wrote:
>
>>> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode
>>> I'm now beginning to wonder what you are claiming.
>
>> I start with a NFA with no
On Sun, 13 Oct 2019 15:29:04 +0200
Hans Åberg via Unicode wrote:
> > On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode
> > I'm now beginning to wonder what you are claiming.
> I start with a NFA with no empty transitions and apply the subset DFA
> construction dynamically for a given
> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode
> wrote:
>
>>> On Sat, 12 Oct 2019 21:36:45 +0200
>>> Hans Åberg via Unicode wrote:
>>>
> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
> wrote:
>
> But remember that 'having longer first' is meaningless
On Sun, 13 Oct 2019 10:04:34 +0200
Hans Åberg via Unicode wrote:
> > On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode
> > wrote:
> >
> > On Sat, 12 Oct 2019 21:36:45 +0200
> > Hans Åberg via Unicode wrote:
> >
> >>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
> >>>
> On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode
> wrote:
>
> On Sat, 12 Oct 2019 21:36:45 +0200
> Hans Åberg via Unicode wrote:
>
>>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
>>> wrote:
>>>
>>> But remember that 'having longer first' is meaningless for a
>>>
On Sat, 12 Oct 2019 21:36:45 +0200
Hans Åberg via Unicode wrote:
> > On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
> > wrote:
> >
> > But remember that 'having longer first' is meaningless for a
> > non-deterministic finite automaton that does a single pass through
> > the string to
On Fri, 11 Oct 2019 12:39:56 +0200
Elizabeth Mattijsen via Unicode wrote:
> Furthermore, Perl 6 uses Normalization Form Grapheme for matching:
> https://docs.perl6.org/type/Cool#index-entry-Grapheme
This approach does address the issue Mark Davis mentioned about regex
engines working at
> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
> wrote:
>
> But remember that 'having longer first' is meaningless for a
> non-deterministic finite automaton that does a single pass through the
> string to be searched.
It is possible to identify all submatches deterministically
On Fri, 11 Oct 2019 18:37:18 -0700
Mark Davis ☕️ via Unicode wrote:
> >
> > You claimed the order of alternatives mattered. That is an
> > important issue for anyone rash enough to think that the standard
> > is fit to be used as a specification.
> >
>
> Regex engines differ in how they
>
> You claimed the order of alternatives mattered. That is an important
> issue for anyone rash enough to think that the standard is fit to be
> used as a specification.
>
Regex engines differ in how they handle the interpretation of the matching
of alternatives, and it is not possible for us
On Fri, 11 Oct 2019 14:35:33 -0700
Markus Scherer via Unicode wrote:
> > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters
> > > in the alternation -- so this works equivalently if longer
> > > strings are sorted first.
> > Does conformance UTS#18 to level 2 mandate the
On Fri, Oct 11, 2019 at 12:05 PM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:
> On Thu, 10 Oct 2019 15:23:00 -0700
> Markus Scherer via Unicode wrote:
>
> > [c \q{ch}]h should work like (ch|c)h. Note that the order matters in
> > the alternation -- so this works equivalently if
On Thu, 10 Oct 2019 15:23:00 -0700
Markus Scherer via Unicode wrote:
> [c \q{ch}]h should work like (ch|c)h. Note that the order matters in
> the alternation -- so this works equivalently if longer strings are
> sorted first.
Thanks for answering the question.
Does conformance UTS#18 to level
On Fri, 11 Oct 2019 12:39:56 +0200
Elizabeth Mattijsen via Unicode wrote:
> Furthermore, Perl 6 uses Normalization Form Grapheme for matching:
> https://docs.perl6.org/type/Cool#index-entry-Grapheme
I seriously doubt that a Thai considers each combination of consonant
(44), non-spacing
> On 11 Oct 2019, at 00:23, Markus Scherer via Unicode
> wrote:
>
> On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode
> wrote:
> An example UTS#18 gives for matching a literal cluster can be simplified
> to, in its notation:
>
> [c \q{ch}]
>
> This is interpreted as 'match
On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:
> An example UTS#18 gives for matching a literal cluster can be simplified
> to, in its notation:
>
> [c \q{ch}]
>
> This is interpreted as 'match against "ch" if possible, otherwise
> against "c". Thus
On Tue, 8 Oct 2019 15:25:34 +0100
Richard Wordingham via Unicode wrote:
> An example UTS#18 gives for matching a literal cluster can be
> simplified to, in its notation:
>
> [c \q{ch}]
>
> This is interpreted as 'match against "ch" if possible, otherwise
> against "c". Thus the strings "ca"
I've been puzzling over how a pure regular expression engine that works
via a non-deterministic finite automaton can be bent to accommodate
'literal clusters' as in Requirement RL2.2 'Extended Grapheme Clusters'
of UTS#18 'Unicode Regular Expressions' - "To meet this requirement, an
implementation
35 matches
Mail list logo