On Tue, Aug 10, 2010 at 10:47:15AM +0100, Martin Gregorie wrote: > On Tue, 2010-08-10 at 11:19 +0300, Henrik K wrote: > > Runtime for different methods (memory used including Perl itself): > > > > - Single 70000 name regex, 20s (8MB) > > - 7 regexes of 10000 names each, 141s (9MB) > > - "Martin style", lookups from Perl hash, 8s (12MB) > > > Very interesting indeed. Thanks for trying it. I'm not surprised that > the set of 7 regexes took longer than the one big one, but I am > surprised that the time difference is so close to the factor of 7.
I guess the seven regexes contain lots of similar strings, so it's lots of duplicate work compared to a single trie. Credits to Perl 5.10 enhancements: http://www.regex-engineer.org/slides/img38.html http://taint.org/2006/07/07/184022a.html I don't know if Python implements such.. > Out of interest, did you leave the headers in your test messages? I did > initially when I developed the generic name matches, but then removed > them because most of the hits were in headers while the real-life > scan-and-compare rule would only be applied to the body. Just the body as print get_rendered_body_text_array(). For the record, matching wasn't as simple as one could think.. Normal "while (/foo bar/g)" won't not work since: => word1 word2 word3 word4 .. would result in only two matches: "word1 word2" "word3 word4", but we need to check "word2 word3" also. Big help was page 20+: <http://web.archive.org/web/20050515221554/http://birmingham.pm.org/talks/YAPC-Europe-2003-Gems.pdf> Basically you need to do something like: $pat = qr/\b(([a-z][-a-z]{2,15}[a-z]),? ([a-z][-a-z]{2,15}[a-z]))\b/i; $check = qr/(?{ $found = $1 if defined $names{lc "$2,$3"} || defined $names{lc "$3,$2"} })/; while (<>) { $found = undef; /$pat$check(?!)/; print "$found\n" if defined $found; } Hope this helps someone ;) > One thing this experiment makes clear is that a rule containing a lot of > alternates, such as one scanning the body for misspelt words, will > perform better if it contains one long regex rather than a set of > shorter regexes plus an OR meta to combine them - the latter is easier > to maintain but slower running. > > > In the past I used the second form but now I always use a single long > regex that is built from a rule definition file with my 'portmanteau' > script - its rule definition file is easy to maintain because it holds > each alternate pattern on a separate line. Yep though I guess most rules are so simple that they don't create much penalty. Using sa-compile the difference should be neglible and it's easy to see the exact rule hitting (of course you can find the string with debugging also).