Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

Matus UHLAR - fantomas Tue, 03 Mar 2009 08:42:44 -0800

On 03.03.09 08:32, Marc Perkel wrote:
> Spamassassin works by adding up points. Rule A is 2 points, Rule B is 2 
> points therefore the score is 4 points. But is this the best way to 
> score? I don't think so.
> 
> What I'm seeing in the real world is that it's the combinations of rules 
> in ways where Rule A + Rule B = 10 points rather than 4. Or maybe Rule A 
> + Rule B = -2 points because even though both are spam indicators 
> individually, together that are a ham indicator.


Yes, and that's what meta rules are for.

> As an example. The following by themselves are week indicators of spam.
> 
> Dynamic IP
> Bad HELO
> Hitting high numbers MX records
> Not closing with QUIT
> 
> By themselves each would produce a LOT of false positives. But together 
> it's 100% definite it's a spam bot and not only can the message be 
> rejected, but the IP can be blacklisted. Another example. You do an RBL 
> lookup and the IP is listed in:
> 
> RBL-A 0.5
> RBL-B 0.5
> RBL-C 0.5
> 
> Score = 1.5 - NO - Score = 5! Usually multiple RBLs is a stronger 
> indicator than the sum of the scores. But suppose you find the IP listed 
> in the Hostkarma yellow list (yellow means mixed source of spam such as 
> yahoo, gmail, and hotmail) then the RBLs don't matter. In the above 
> example, say the 3 RBLs are US based and the spam is coming from 
> yahoo.fr. Most everything coming from yahoo France to American users is 
> spam and might get listed on low quality RBLs. But my point is that you 
> wouldn't want to assign a negative score for the yellow listing because 
> yellow doesn't mean it's not spam, it means it shouldn't be blacklisted. 
> The better logic is - if not yellow then add up the black scored. (A + B 
> + C) * !yellow. Better to look up yellow first and then skip the RBLs if 
> found.
> 
> The important point here is that SA needs to evolve beyond the concept 
> of using addition to compute scores. Ideally there should be more hard 
> coded rule combinations or using baysian statistics to find how rule 
> combinations where the combinations are a more accurate indication than 
> the rules themselves.
> 
> Anyhow - just throwing this out there for people to chew on and think about.

I have been already thinking about possibility to combine every two rules
and do a masscheck over them. Then, optionally repeating that again,
skipping duplicates. Finally gather all rules that scored >=0.5 || <=-0.5
- we could have interesting ruleset here.

But that's going to be a HUGE ruleset. 

-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
There's a long-standing bug relating to the x86 architecture that
allows you to install Windows.   -- Matthew D. Fuller

Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

Reply via email to