Re: Regex surprises

2022-09-12 Thread Brad Gilbert
Raku removed all of the regex cruft that has accumulated over the years.
(Much of that cruft was added by Perl.)

I'm not going to respond to the first part of your email, as I think it is
an implementation artifact.

On Mon, Sep 12, 2022 at 3:06 PM Sean McAfee  wrote:

> Hello--
>
> I stumbled across a couple of bits of surprising regex behavior today.
>
> First, consider:
>
> S:g[ x { say 1 } ] = say 2 given "xx"
>
> I expected this to print 1, 2, 1, 2, but it prints 1, 1, 2, 2.  So it
> looks like, in a global substitution like this, Raku doesn't look for
> successive matches and then evaluate the replacements as it goes, but finds
> all of the matches *first* and then works through the substitutions.  In
> my actual problem I was mutating state in the regex code block, and then it
> didn't work because all of the mutations happened before even a single
> replacement was evaluated.  Is it really meant to work this way?
>



Now the following is intentional.

Raku treats regexes as a domain specific sub language.
One of the ways it does that is by having each sub expression act as an
independent sub expression.

Would you expect `/ (a) | (b) /` to act significantly differently to `if
($a) {$0 = ...} elsif ($b) {$0 = ...}`?
(Where `$0` could be thought of as representing the first value on the
stack, or similar.)

Next, consider:
>
> > "y" ~~ /(x)|(y)/
> 「y」
>  0 => 「y」
>
> y is in the second set of grouping parentheses, so I expected it to be in
> group 1, but it's in group 0.  So it looks like the group index starts from
> 0 in every branch of an alternation.  I do so much regex slinging I'm
> amazed it took me so long to discover this, if it's not a relatively recent
> change.  I'm accustomed to being able to determine which alternation branch
> was matched by checking which group is defined (in other languages too, not
> just Raku).  This kind of thing:
>
> S:g[(x)|(y)] = $0 ?? x-replacement !! y-replacement
>
> I guess instead I need to do this:
>
> S:g[x|y] = $/ eq 'x' ?? x-replacement !! y-replacement
>
> It seems very strange that I need to re-examine the match to know what
> matched.  The match should be able to tell me what matched.  Or is there
> perhaps some alternate way for me to tell which alternative matched?
>

Other languages, including Perl, have just added feature after feature to
regexes without thinking about the regex language as a whole.

Raku started over from scratch. Larry then took the knowledge learned over
decades of language design and applied it to regex.
Like I said, one of those things that Larry realized is that independent
sub expressions should be independent.

If you really want to know how to determine which alternation matched,
there are plenty of ways to do it.

/ $0 = (x) | $1 = ($y) /

/ $ = x | $ = y /

/
  x
  :my $*alternation = 0;
|
  y
  :my $*alternation = 1;
/

/
  x
  :my $*replacement = ...;
|
  y
  :my $*replacement = ...;
/

That last one would allow you to remove the `??` `!!` from your code.

(I haven't been doing much with Raku for months, so there are likely some
other methods I'm not thinking of.)


Regex surprises

2022-09-12 Thread Sean McAfee
Hello--

I stumbled across a couple of bits of surprising regex behavior today.

First, consider:

S:g[ x { say 1 } ] = say 2 given "xx"

I expected this to print 1, 2, 1, 2, but it prints 1, 1, 2, 2.  So it looks
like, in a global substitution like this, Raku doesn't look for successive
matches and then evaluate the replacements as it goes, but finds all of the
matches *first* and then works through the substitutions.  In my actual
problem I was mutating state in the regex code block, and then it didn't
work because all of the mutations happened before even a single replacement
was evaluated.  Is it really meant to work this way?

Next, consider:

> "y" ~~ /(x)|(y)/
「y」
 0 => 「y」

y is in the second set of grouping parentheses, so I expected it to be in
group 1, but it's in group 0.  So it looks like the group index starts from
0 in every branch of an alternation.  I do so much regex slinging I'm
amazed it took me so long to discover this, if it's not a relatively recent
change.  I'm accustomed to being able to determine which alternation branch
was matched by checking which group is defined (in other languages too, not
just Raku).  This kind of thing:

S:g[(x)|(y)] = $0 ?? x-replacement !! y-replacement

I guess instead I need to do this:

S:g[x|y] = $/ eq 'x' ?? x-replacement !! y-replacement

It seems very strange that I need to re-examine the match to know what
matched.  The match should be able to tell me what matched.  Or is there
perhaps some alternate way for me to tell which alternative matched?