John, 

Thanks for your prompt response!

A lot of the rules are big jumbles of rules we are generating in real time
and adding to as things come in. Like I said in my original question, we
have them separated into separate cf files by category, and within those cf
files they are separated by score. So we have just absolutely gargantuan
rules for (for instance) sex words that we assign a 5 to automatically.
There's also lists of specific words and phrases that we see in real-time
spam (like the *$#ing garden hose spam).

We are just tacking new rules on to the end to make them easier to read. Our
rules properly work with (this|that|theother) if it hits any one of the
words. 

Should we maybe have separate rules for all the phrases, since they're
longer strings? There's rules in there that are like RULE Subject =~
/you.have.(new|waiting|blah|blah).*(ecard|message|calendar.invite|blah|blah)
)|(garden|new|stretchy|bendy|whatever).*(hose|vaccum|other.thing) . . .  . .
. 


Etc. It goes on. .. My syntax is terrible and obviously those aren't the
actual rules but the point is that it's a bunch of "Or" for some really long
strings. Should I separate them out and have those long (this|that|theother)
rules be only for single words?

Alternately, should I separate out the rules with embedded pipes in them
(like in the example above)? 


-----Original Message-----
From: John Hardin [mailto:jhar...@impsec.org] 
Sent: Wednesday, April 24, 2013 12:58 PM
To: users@spamassassin.apache.org
Subject: Re: More longer rules or fewer shorter ones?

On Wed, 24 Apr 2013, Andrew Talbot wrote:

> Hey, all -
>
> I have my customized deployment split up into a bunch of separate CF 
> files (by category) and I have those further split up into rules based on
score.
>
> So, I have a bunch of stuff like:
>
> header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i
> score RULE_1 1
> describe RULE_1 Rule 1
>
> header RULE_2 Subject =~ /\b(foo|bar|etc)/i score RULE_2 2 describe 
> RULE_2 Rule 2
>
> They are WAY longer than that (and some of them include further 
> nesting of the pipe), but that's the general idea.
>
> My question is: is it better performance-wise to have the rules set up 
> like this, or to have each separate thing have its own separate rule?

For performance, with simple lists of variant values having no repetition
across the list e.g. (x|y|z){n,m}, if the most-likely variants are listed
first a "big" rule will (generally-speaking) process less than a set of
individual rules for each variant. The big rule will stop trying as soon as
a match for one variant is found, whereas all of the individual rules must
be tried regardless of what other rules may have hit. RULE_1 won't try
matching "that", "theother", "blah", etc. if "this" matches.

Ignoring performance, the alternatives are *not* syntactically equivalent. 
Absent "tflags multiple", RULE_1 would hit only once on a subject containing
both "this" and "that" and "theother", but if you split it up into separate
rules *each* would hit. This likely would affect scoring.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Vista "security improvements" consist of attempting to shift blame
   onto the user when things go wrong.
-----------------------------------------------------------------------
  328 days since the first successful private support mission to ISS
(SpaceX)

Reply via email to