Matt's generally nailed it. I would say that it should be easy enough to write a plugin which reorders rule priorities into a desired order, then implements the "have_shortcircuited" plugin hook to return 1 at the desired point... so if anyone feels like trying it out to see if they can make an auto-shortcircuiting plugin which outperforms base SpamAssassin over a mixed corpus of 50:50 nonspam and spam, go for it ;)
--j. Matt Kettler writes: > Crocomoth wrote: > > Matt Kettler-3 wrote: > > > >>> 1. Using this method, admin must understand that the fate of every > >>> message > >>> (for all users) will depend from the single rule. > >>> > >> Not if you set it up properly.. You can have multiple rules run with a > >> very early priority (low number), then have another one run with a > >> semi-early priority which does shortcircuiting. All of the "very early" > >> rules will be involved in the decision to shortcircuit or not. > >> > >> > > > > Yes, but low-numbered rules may not generate any points and the desision may > > depend from one rule anyways. This does not change anything. And what is > > more (see (2) with which you have agreed), in default configuration, this > > will be bayes which generates only 3.5 points (not taking into account > > while/black lists because they will not be set up properly in most cases). > > And, I think, number of persons not wishing to reorder standard rules will > > be much more than "semi-professional" admins. > > > > True, but your automated method based on sorting them on "weight" would > pretty much grind spamassassin to a screeching halt by increasing the > average scan time due to forcing multiple passes through the message. > Not to mention false positive problems if negative-scoring rules end up > being considered "heavy" and don't get run. > > Your idea essentially ruins any benefits of memory caching that > SpamAssassin currently exploits. Right now, rules are run in groups > based on what part of the message they need. This lends speed to > spamassassin by allowing that portion of the mesage to already be in > cache for all but the first rule in the group. > > If you start jumping around all over the message for different rules, > the processor memory cache quickly becomes full and pushes out parts > that you're going to be looking at again. If you keep going > back-and-forth header, body, header, body, header, body.. you wind up > going out to ram quite often, and that's painfully slow. (I don't care > what high-speed dual-channel ddr2 memory setup you have, it's abysmally > slow from the processors perspective, generally 20 times slower than > cache is) > > Sure, some messages will bail out faster, but most messages will take > much longer to scan. How is that better? > > I don't debate that the basic idea of having SA do this "automagically" > would be a great thing. However, the reality of doing it efficiently is > much trickier than you think. > > At one point, one idea was to run all the negative scoring rules, and > then run the positive scoring ones, and bail out if the score went over > the spam threshold during the positive phase. > > The end result of that test was abysmally slow, due to having to scan > the message in two passes (negative header, negative body, positive > header, positive body). > > > Sort order may be: negative rules, sorted positive common rules. Any > > user-defined rules should be checked after negative ones and before > > positives, if exists. Of course, sorting should be performed once upon load > > procedure. > Tested, as mentioned above. Resulted in horrible performance due to > over-sorting. > > > Or, such a cut-off may work without any sorting; this is optional. Standard > > priorities could be enough, if they set up. > I'd agree there. SA could exploit priorities better in the default > config, but this kind of thing needs to be done very carefuly to avoid > thrashing the processor cache. Any simple "sort by.." is going to result > in terrible performance.