Re: List of "banned" words/bounce to sender

Henrik K Tue, 10 Aug 2010 03:36:27 -0700

On Tue, Aug 10, 2010 at 10:47:15AM +0100, Martin Gregorie wrote:
> On Tue, 2010-08-10 at 11:19 +0300, Henrik K wrote:
> > Runtime for different methods (memory used including Perl itself):
> > 
> > - Single 70000 name regex, 20s (8MB)
> > - 7 regexes of 10000 names each, 141s (9MB)
> > - "Martin style", lookups from Perl hash, 8s (12MB)
> > 
> Very interesting indeed. Thanks for trying it. I'm not surprised that
> the set of 7 regexes took longer than the one big one, but I am
> surprised that the time difference is so close to the factor of 7.


I guess the seven regexes contain lots of similar strings, so it's lots of
duplicate work compared to a single trie.

Credits to Perl 5.10 enhancements:

http://www.regex-engineer.org/slides/img38.html
http://taint.org/2006/07/07/184022a.html

I don't know if Python implements such..

> Out of interest, did you leave the headers in your test messages? I did
> initially when I developed the generic name matches, but then removed
> them because most of the hits were in headers while the real-life
> scan-and-compare rule would only be applied to the body. 

Just the body as print get_rendered_body_text_array().

For the record, matching wasn't as simple as one could think..

Normal "while (/foo bar/g)" won't not work since:
=> word1 word2 word3 word4
.. would result in only two matches: "word1 word2" "word3 word4", but
we need to check "word2 word3" also.

Big help was page 20+:
<http://web.archive.org/web/20050515221554/http://birmingham.pm.org/talks/YAPC-Europe-2003-Gems.pdf>

Basically you need to do something like:

$pat = qr/\b(([a-z][-a-z]{2,15}[a-z]),? ([a-z][-a-z]{2,15}[a-z]))\b/i;
$check = qr/(?{ $found = $1 if defined $names{lc "$2,$3"} || defined $names{lc 
"$3,$2"} })/;
while (<>) {
        $found = undef;
        /$pat$check(?!)/;
        print "$found\n" if defined $found;
}

Hope this helps someone ;)

> One thing this experiment makes clear is that a rule containing a lot of
> alternates, such as one scanning the body for misspelt words, will
> perform better if it contains one long regex rather than a set of
> shorter regexes plus an OR meta to combine them - the latter is easier
> to maintain but slower running. 
>
>
> In the past I used the second form but now I always use a single long
> regex that is built from a rule definition file with my 'portmanteau'
> script - its rule definition file is easy to maintain because it holds
> each alternate pattern on a separate line.

Yep though I guess most rules are so simple that they don't create much
penalty. Using sa-compile the difference should be neglible and it's easy to
see the exact rule hitting (of course you can find the string with debugging
also).

Re: List of "banned" words/bounce to sender

Reply via email to