On 03.03.09 08:32, Marc Perkel wrote: > Spamassassin works by adding up points. Rule A is 2 points, Rule B is 2 > points therefore the score is 4 points. But is this the best way to > score? I don't think so. > > What I'm seeing in the real world is that it's the combinations of rules > in ways where Rule A + Rule B = 10 points rather than 4. Or maybe Rule A > + Rule B = -2 points because even though both are spam indicators > individually, together that are a ham indicator.
Yes, and that's what meta rules are for. > As an example. The following by themselves are week indicators of spam. > > Dynamic IP > Bad HELO > Hitting high numbers MX records > Not closing with QUIT > > By themselves each would produce a LOT of false positives. But together > it's 100% definite it's a spam bot and not only can the message be > rejected, but the IP can be blacklisted. Another example. You do an RBL > lookup and the IP is listed in: > > RBL-A 0.5 > RBL-B 0.5 > RBL-C 0.5 > > Score = 1.5 - NO - Score = 5! Usually multiple RBLs is a stronger > indicator than the sum of the scores. But suppose you find the IP listed > in the Hostkarma yellow list (yellow means mixed source of spam such as > yahoo, gmail, and hotmail) then the RBLs don't matter. In the above > example, say the 3 RBLs are US based and the spam is coming from > yahoo.fr. Most everything coming from yahoo France to American users is > spam and might get listed on low quality RBLs. But my point is that you > wouldn't want to assign a negative score for the yellow listing because > yellow doesn't mean it's not spam, it means it shouldn't be blacklisted. > The better logic is - if not yellow then add up the black scored. (A + B > + C) * !yellow. Better to look up yellow first and then skip the RBLs if > found. > > The important point here is that SA needs to evolve beyond the concept > of using addition to compute scores. Ideally there should be more hard > coded rule combinations or using baysian statistics to find how rule > combinations where the combinations are a more accurate indication than > the rules themselves. > > Anyhow - just throwing this out there for people to chew on and think about. I have been already thinking about possibility to combine every two rules and do a masscheck over them. Then, optionally repeating that again, skipping duplicates. Finally gather all rules that scored >=0.5 || <=-0.5 - we could have interesting ruleset here. But that's going to be a HUGE ruleset. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. There's a long-standing bug relating to the x86 architecture that allows you to install Windows. -- Matthew D. Fuller