Re: RFC 110 (v3) counting matches

2000-08-31 Thread Mark-Jason Dominus


 (mystery: how
 can filling in $ be a lot slower than filling in $1?)

It isn't.  It's the same.  $1 might even be more expensive than $.

It appears that many people don't understand the problem with $.  I
will try to explain.

Maintaining the information required by $1 or $ slows down the regex
match, possibly by as much as forty to sixty percent, or more.  (How
much depends on details of the regex and the target string.)

For this reason, Perl has an optimization in it so that if you never
use $ anywhere in your program, Perl never maintains the information,
and every regex in your program runs faster.

But if you do use $ somewhere, Perl cannot apply the optimization,
and it must compute the $ information for every regex in the program.
Every regex becomes much slower.

In particular, if you load a module whose author happened to use $,
all your regexes get slower, which might be an unpleasant surprise,
since you might not be aware of the cause.

A regex with backreferences is *also* slow.  But using backreferences
in one regex does not make all the *other* regexes slow.  If you have

/(...)/   # regex 1
/.../ # regex 2

Perl knows that it must compute the backreference information for
regex 1, and knows that it can skip computing the backreference
information for regex 2, because regex 2 contains no parentheses.

If you use a module that contains regexes that use backreferences,
those regexes run slowly, but there is no effect on *your* regexes.

The cost is just as high for backreferences as for $, but the
backreference cost is paid only by regexes that actually need it.

The $ cost is paid by every regex in the entire program, whether they
used it or not.  This is because Perl has no way to tell which regexes
use $ and which do not. 

One of Uri's suggestions in RFC 158 was to compute $ only for regexes
that have a /k modifier.  This would solve the $ problem because Perl
would compute $ only when asked to, and not for every other regex in
the rest of the program.




Re: RFC 110 (v3) counting matches

2000-08-31 Thread Joe McMahon

Jonathan Scott Duff wrote:
 
 How about something like this?
 
   $re = qr/(\d\d)-(\d\d)-(\d\d)/g;
   $re-onmatch_callback(push @list, makedate(^0,^1,^2));
   $string =~ $re;
 
It's not bad, but it loses one thing that I was trying to keep from the 
SNOBOL model. If you have (again, improvised syntax - I *know* you want 
to use the $ variables, OK? This is just for discussion):

   
/($pat1)($pat2)($pat3)(?{sub1(@\)$pat4|?{sub2(@\)}$pat5|?{sub3(@\)})/

This would translate to "if pat1pat2pat3 matches, call sub1 with all the 
matches to that point  if pat4 matches afterward, otherwise call sub2 
with all the matches if pat5 matches, else just call sub3." The key bit 
here is that you pass over the sub call, deferring it until you've 
decided if the whole match worked, then picking the one that succeeded 
and calling it. If you don't like the syntax, please feel free to 
propose another. @\ seemed a good mnemonic for "the array of 
backreferences I already matched".

And, of course, if you assume that @\ keeps growing when you use /g, 
then doing a scalar @\and dividing by the number of backreferences would 
give you a match count:

   $string /(\d\d)-(\d\d)-(\d\d)/g;
   $hits = scalar(@\)/3;

Of course, with multiple alternatives with different numbers of 
backreferences leads to a problem, so maybe this is all academic. Oh well.
--- Joe M.




Re: RFC 72 (v2) The regexp engine should go backward as well as forward.

2000-08-31 Thread Mike Mulligan

From: "Peter Heslin" [EMAIL PROTECTED]
Sent: Thursday, August 31, 2000 10:51 PM

 I would propose that your version of the syntax might also function in
 the middle of a regexp: /GHI(?`=DEF)JKL(?`=^ABC)MNO/ would match the
 start of the alphabet (fixed-length example used for simplicity).

That's not what I had in mind; I would have the new look-behind look (in
terms of left-to-right placement) and act like the existing (?=pat), except
that it would have non-zero width and create a back-reference.

The example above would (if we remove the ^ ) match GHIDEFJKLABCMNO

Hmm, that non-zero width thing is screwy.  The zero-width analog,
/GHI(?=DEF)JKL(?=^ABC)MNO/, would NEVER match.  That asks for a GHI
immediately followed by a JKL immediately preceeded by a DEF.

Without some better motivating examples, I'd rather keep the old and
proposed look-behinds working the same.  So let me retract what I said above
about matching GHIDEFJKLABCMNO  A match would be had with
/GHI.*(?`=DEF)JKL.*(?`=ABC)MNO/

  mike mulligan