Justin Mason did write:
David Landgren writes:
[...]

That is, the assembled approach runs in under 1/500th of the time of looping through the list of REs. A_C is even worse, but given that the pattern is over twice as long, and chock full of metacharacters I suppose this shouldn't come as a surprise but it does seem odd. I'll check back with Yves and see if my methodology is sane there.

"perl5.9.4" -- that's using bleadperl, right?

Yes. I wanted to check out the Aho-Corasick matching. I haven't been following super closely, but I'm pretty sure Yves said that it doesn't deal well with meta-characters. These results tend to bear that out.

Can you post the output of "use re qw(debug); /....regexp..../"?  I'd
like to see what the structure looks like from the R::A regexp -- 500
times is a stunning speedup.

The uncompressed output is 1.7Mb, so I've bzipped it. You can pick it up here:

  http://www.landgren.net/ re.debug.txt.bz2

(remove the space in the middle: I've done it that way to stop robots from stumbling over it via a list archive and wasting bandwidth).

What may help you follow what's going on more easily is looking at

  http://www.landgren.net/ re.indented.txt.bz2

(again, lose the space) which is the result of the assembly, indented. You can put it directly into a m/.../x and it will work; it's legal Perl.

The speedup comes from the fact that once you hit the engine, for all the legit hostnames like 'mail.example.com', only the first few characters will be examined and then you hit a fail state. There is no backtracking left on the stack, so the entire match fails at that point, and 99% of the regop tree is never consulted. Whereas the loop-over-a-list approach will try the pattern, and the next, and the next...

To put it another way, it's not so much that it matches quickly, but rather it fails quickly. Of course, patterns that contain .*? sequences will require backtracking, but that's just the nature of the beast.

David
--
"It's overkill of course, but you can never have too much overkill."

Reply via email to