On Wed, 2013-04-24 at 12:32 -0400, Andrew Talbot wrote:
> I have my customized deployment split up into a bunch of separate CF
> files (by category) and I have those further split up into rules based
> on score.
> 
I also use very long rules, mainly due to spamiferous mailing lists,
because all the headers are pretty much the same (apart from sender
names), so about all you're left with for spam recognition is the body
content. 

I found a problem with very long rules, where for me 'very long' means
"rules longer than the width of my editor's screen". I refer to these as
'portmanteau rules' (private slang). As I hate editing anything that's
longer than my editor's text line and find it particularly annoying to
deal with such a line containing a regex consisting of a lot of
alternates, I wrote a portmanteau rule generator to make their
maintenance a bit easier. It is a gawk script that assembles an
arbitrarily long rule from a file containing rule fragments (regexes,
etc) that are each placed on a separate line. Since sounds as though you
may have a similar problem, you may also find it useful. You can find it
and its documentation here:
http://www.libelle-systems.com/free/portmanteau/portmanteau.tgz

I find it particularly helpful to make the portmanteau rules fairly low
scoring and to combine them into higher scoring meta-rules, e.g. if I'm
trapping sales spiel I'll have a portmanteau rule listing selling
phrases, one containing monetary terms and another containing product
terms and names, all scores at 0.001. I'll also have a meta-rule that
ANDs these three rules together and scores around 5. This approach is
much better at distinguishing spam from ham than a series of higher
scoring non-meta rules and has the additional benefit of recognising
sales-related text from previously unseen combinations of elements in
the three rules.
 
BTW, I don't use Bayes because my mail volume is small and I have
difficulty collecting decent training corpuses and find my current setup
easier to manage.


  They are WAY longer than that (and some of them include further
nesting of the pipe), but that's the general idea.
 
> My question is: is it better performance-wise to have the rules set up
> like this, or to have each separate thing have its own separate rule?
> 
What JH said. When I was thinking of setting up this approach I asked
about performance and limits on the size of the generated rules and was
told that I shouldn't worry about rule size until they exceeded a
megabyte or two. Currently my longest rule is just over 9KB, with the
averages being just under 1KB and 51 alternates per rule.

Martin

 


Reply via email to