John, Thanks for your prompt response!
A lot of the rules are big jumbles of rules we are generating in real time and adding to as things come in. Like I said in my original question, we have them separated into separate cf files by category, and within those cf files they are separated by score. So we have just absolutely gargantuan rules for (for instance) sex words that we assign a 5 to automatically. There's also lists of specific words and phrases that we see in real-time spam (like the *$#ing garden hose spam). We are just tacking new rules on to the end to make them easier to read. Our rules properly work with (this|that|theother) if it hits any one of the words. Should we maybe have separate rules for all the phrases, since they're longer strings? There's rules in there that are like RULE Subject =~ /you.have.(new|waiting|blah|blah).*(ecard|message|calendar.invite|blah|blah) )|(garden|new|stretchy|bendy|whatever).*(hose|vaccum|other.thing) . . . . . . Etc. It goes on. .. My syntax is terrible and obviously those aren't the actual rules but the point is that it's a bunch of "Or" for some really long strings. Should I separate them out and have those long (this|that|theother) rules be only for single words? Alternately, should I separate out the rules with embedded pipes in them (like in the example above)? -----Original Message----- From: John Hardin [mailto:jhar...@impsec.org] Sent: Wednesday, April 24, 2013 12:58 PM To: users@spamassassin.apache.org Subject: Re: More longer rules or fewer shorter ones? On Wed, 24 Apr 2013, Andrew Talbot wrote: > Hey, all - > > I have my customized deployment split up into a bunch of separate CF > files (by category) and I have those further split up into rules based on score. > > So, I have a bunch of stuff like: > > header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i > score RULE_1 1 > describe RULE_1 Rule 1 > > header RULE_2 Subject =~ /\b(foo|bar|etc)/i score RULE_2 2 describe > RULE_2 Rule 2 > > They are WAY longer than that (and some of them include further > nesting of the pipe), but that's the general idea. > > My question is: is it better performance-wise to have the rules set up > like this, or to have each separate thing have its own separate rule? For performance, with simple lists of variant values having no repetition across the list e.g. (x|y|z){n,m}, if the most-likely variants are listed first a "big" rule will (generally-speaking) process less than a set of individual rules for each variant. The big rule will stop trying as soon as a match for one variant is found, whereas all of the individual rules must be tried regardless of what other rules may have hit. RULE_1 won't try matching "that", "theother", "blah", etc. if "this" matches. Ignoring performance, the alternatives are *not* syntactically equivalent. Absent "tflags multiple", RULE_1 would hit only once on a subject containing both "this" and "that" and "theother", but if you split it up into separate rules *each* would hit. This likely would affect scoring. -- John Hardin KA7OHZ http://www.impsec.org/~jhardin/ jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 ----------------------------------------------------------------------- Vista "security improvements" consist of attempting to shift blame onto the user when things go wrong. ----------------------------------------------------------------------- 328 days since the first successful private support mission to ISS (SpaceX)