On Mon, Aug 09, 2010 at 07:28:42AM -0500, Daniel McDonald wrote: > > This technique might cut down the number of rules by 93.5%, but then you > have to do database lookups and some fancy parsing to verify the hit. > Don't know if that would be worth it.
Nope, people constantly underestimate the power of regexes.. of course you can easily make bad ones, but Perl can run huge lists of simple alternations FAST. I downloaded a 10000 random name pack, and made a quick hack to regexify it with my favourite Regexp::Assemble. ------------------------------ #!/usr/bin/perl use Regexp::Assemble; $ra = Regexp::Assemble->new; while (<STDIN>) { chomp; # Read comma separated names from stdin: Firstname,Lastname ($firstname, $lastname) = split(',', lc); # Firstname Lastname $ra->add("$firstname $lastname"); # Lastname,? Firstname $ra->add("$lastname,? $firstname"); # Print rule every 10000 names # (?:^| ) instead of \b since "Kate" would hit "Mary-Kate" if (++$cnt % 10000 == 0 || eof STDIN) { print 'body TEST_NAMES_'.++$idx; print ' /(?:^| )'.$ra->as_string.'(?:$| )/i'."\n"; } } ------------------------------ ./names.pl < names.csv > names.cf The resulting single 170000 byte rule did not affect SA in anyway, there was virtually no difference in my mass check tests. Running the regex through some file manually results in 80000 lines/second. This with one 3Ghz core. I think you can make rules/REs of MBs in size, but gains probably nothing. About ClamAV... + It would probably handle this even faster + Easy logging of exact signature that got hit (single name per sig) - It would also match any header like To: From: etc (PRETTY BAD...) I'd choose SA since it's way more flexible. I doubt performance here is a factor, especially with outgoing mail..