There are two problems with that. One is that a lot of people use more than
just spam/ham distinctions (some sort based on spam status so that in going
through looking for FPs they can see the most likely candidates first, some
have multiple thresholds of tag/refile/discard, etc). The other is that
negative scores can really mess with that, unless you try to be pretty clever.
You *can* get the right answer (spam/nonspam, at least) in spite of negative
rules with a bit of cleverness. I was thinking about this the other day, so
let me post some pueudocode for people to ignore.
#initialization
@neg_rules = a list of all negative scoring rules, in score order (-100, -99,
etc)
@pos_rules = positive scoring rules, in decreasing order (100, 99, etc)
$total_negative = sum {$score{$_}} @neg_rules; #if it matched all possible
#negative rules,
#what would it get?
$total_positive = sum {$score{$_}} @pos_rules;
$threshold = 5 #or appropriate
foreach $msg (@messages){
$score = 0;
$pos_left = $total_positive;
$neg_left = $total_negative;
while (1){
if ($score + $pos_left < $threshold) {
return "It's ham";
}
if ($score + $neg_left > $threshold) {
return "it's spam";
}
if ($score < $threshold){
$rule = shift @neg_rules;
$neg_left -= $score{$rule};
} else {
$rule = shift @pos_rules;
$pos_left += $score{$rule};
}
$score += $score{$rule} if (rule_is_hit($msg,$rule);
}
}
In English, we bounce back and forth, always trying to push it to the other
side of the threshold, and when there aren't enough points left to succeed,
even if everything else matched, we declare that side a winner. Actually,
sorting the rules by decreasing absolute value doesn't affect the algorithm,
but it seems more likely to get a good approximation faster (leaving the rules
with small scores as tie breakers for messages that stay close to the
threshold). You could also sort them by something like "overall hit% * score"
so that those that are most likely affect the score would occur first, or you
could delay slower, more complicated tests.
But all of that assumes you don't care what the final score is, merely whether
it's over or under some particular number.
On Wed, 17 Mar 2004, Charles Gregory wrote:
> Hiyo!
>
> I realize that this may run afoul of some other objectives, particularly
> those where people want to know all the tests that matched, and perform
> supplemental checks based on that info, but I have to wonder, could we
> improve the efficiency of SpamAssassin by having it make note of the
> 'HITS-REQUIRED' score and have it STOP TESTING after it surpasses that
> score?
>
> IE. Is there really any reason to keep testing a piece of mail once we
> know it is spam? When our threshold is somewhere between 3 and 10, and I
> see mail scoring 20 or 30, I realize that this mail probably passed that
> threshold less than half-way into the tests.
>
> This could, in theory, lower our processor 'cost' for spam considerably.
> Thoughts? Good idea? Stupid idea?
>
> - Charles
>
--
Adam Lopresto
http://cec.wustl.edu/~adam/
One for all and the rest for me!