Justin Mason did write:
David Landgren writes:
[...]
That is, the assembled approach runs in under 1/500th of the time of
looping through the list of REs. A_C is even worse, but given that the
pattern is over twice as long, and chock full of metacharacters I
suppose this shouldn't come as a surprise but it does seem odd. I'll
check back with Yves and see if my methodology is sane there.
"perl5.9.4" -- that's using bleadperl, right?
Yes. I wanted to check out the Aho-Corasick matching. I haven't been
following super closely, but I'm pretty sure Yves said that it doesn't
deal well with meta-characters. These results tend to bear that out.
Can you post the output of "use re qw(debug); /....regexp..../"? I'd
like to see what the structure looks like from the R::A regexp -- 500
times is a stunning speedup.
The uncompressed output is 1.7Mb, so I've bzipped it. You can pick it up
here:
http://www.landgren.net/ re.debug.txt.bz2
(remove the space in the middle: I've done it that way to stop robots
from stumbling over it via a list archive and wasting bandwidth).
What may help you follow what's going on more easily is looking at
http://www.landgren.net/ re.indented.txt.bz2
(again, lose the space) which is the result of the assembly, indented.
You can put it directly into a m/.../x and it will work; it's legal Perl.
The speedup comes from the fact that once you hit the engine, for all
the legit hostnames like 'mail.example.com', only the first few
characters will be examined and then you hit a fail state. There is no
backtracking left on the stack, so the entire match fails at that point,
and 99% of the regop tree is never consulted. Whereas the
loop-over-a-list approach will try the pattern, and the next, and the
next...
To put it another way, it's not so much that it matches quickly, but
rather it fails quickly. Of course, patterns that contain .*? sequences
will require backtracking, but that's just the nature of the beast.
David
--
"It's overkill of course, but you can never have too much overkill."