Re: Whitelists, not directly useful to spamassassin...
Warren Togami wrote: While whitelists are not directly effective (statistically, when averaged across a large corpus), whitelists are powerful tools in indirect ways including: * Pushing the score beyond the auto-learn threshold for things like Bayes to function without manual intervention. On 17.12.09 11:27, Jason Bertoch wrote: This does not sound like a positive thing to me. E-mail from any sender that is malformed enough to skip auto-learning should not be forced into Bayes as ham simply because some 3rd party promises, for their own monetary benefit, that the sender is a nice guy. Why should any sender that I have not intentionally added to my local whitelist get a break? If you _want_ the mail and whitelist the sender, I think its characteristics should be pushed into the bayes. If you don't want the mail, then autolearning it as spam is least of your problems. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Linux IS user friendly, it's just selective who its friends are...
Re: Whitelists, not directly useful to spamassassin...
Warren Togami wrote: https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c49 https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c51 It turns out that the ReturnPath and DNSWL whitelists have a statistically insignificant impact on spamassassin's ability to determine ham vs. spam. Meanwhile, both whitelists have high levels of accuracy. How can both of these statements be true? I suspect this is because the scores are balanced by the rescoring algorithm to be safe in the majority case where no whitelist rule has triggered. Thus whitelists are not needed or relied upon to prevent false positive classification. I concur, that is what my analysis of HABEAS hits over the last four months showed too. /Per Jessen, Zürich
Re: Whitelists, not directly useful to spamassassin...
Warren Togami wrote: While whitelists are not directly effective (statistically, when averaged across a large corpus), whitelists are powerful tools in indirect ways including: * Pushing the score beyond the auto-learn threshold for things like Bayes to function without manual intervention. This does not sound like a positive thing to me. E-mail from any sender that is malformed enough to skip auto-learning should not be forced into Bayes as ham simply because some 3rd party promises, for their own monetary benefit, that the sender is a nice guy. Why should any sender that I have not intentionally added to my local whitelist get a break? I've had enough problems with DNSWL, HABEAS, and JMF that they have all been disabled here. Unfortunately, that also means I have no recent data to add to the debate. Although I believe that whitelists should be included in the default install for those that want them, I also believe they should be disabled by default so that an admin must knowingly enable them after reading the manual and considering the consequences. The argument has also been made that whitelists should be included simply because blacklists are. I think that argument is flawed. Blacklists are part of the spam fighting community while whitelists are part of the bulk delivery community. Their goals and motives are completely different. For one, blacklists will normally have evidence of abuse to support their listing. Whitelists only have policies and promises. Second, the scoring of whitelists is currently favored over blacklists, and will continue to be at the proposed settings for 3.3.0. Why can a whitelist override the score of a blacklist when it is the blacklist that has evidence of abuse? After reading up on Bug6247, I found that ReturnPath included interesting stats on their lists: Certified Active: 4407 Suspended: 1300 Total: 5707 Safe Active: 6561 Suspended: 283 Total: 6844 The Certified list is supposedly difficult to get on so I'm not sure how to interpret these results. Is 1/5 of the list suspended because of due diligence on the part of ReturnPath? If so, how did they get certified in the first place? If whitelists are to be enabled by default, I believe their score should be moved considerably more toward zero. /Jason
Re: Whitelists, not directly useful to spamassassin...
Thank you, Warren. That (finally) gives some real perspective to this mess, and gets some of the 'real' questions answered. - C On Wed, 16 Dec 2009, Warren Togami wrote: I made a discovery today that surprised even myself. Using the rescore masscheck and weekly masscheck logs while working on Bug #6247 I found some interesting details that throws a wrench into this lively debate. https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c49 https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c51 It turns out that the ReturnPath and DNSWL whitelists have a statistically insignificant impact on spamassassin's ability to determine ham vs. spam. Meanwhile, both whitelists have high levels of accuracy. How can both of these statements be true? I suspect this is because the scores are balanced by the rescoring algorithm to be safe in the majority case where no whitelist rule has triggered. Thus whitelists are not needed or relied upon to prevent false positive classification. While whitelists are not directly effective (statistically, when averaged across a large corpus), whitelists are powerful tools in indirect ways including: * Pushing the score beyond the auto-learn threshold for things like Bayes to function without manual intervention. * The albeit controversial method where some automated spam trap blacklists use whitelists to help determine if they really should list an IP address. https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6247 https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6251 spamassassin-3.3.0 has reduced the score impact of these whitelists to more modest levels, maxing out at -5 points. -5 is PLENTY for spamassassin, as 5 points is the level which the scoreset is tuned. Mail from a whitelisted host would need greater than 10 points to be blocked, which is statistically very rare for ham. I believe that we are striking the right balance with these modest whitelist scores in this release. That being said, whitelists should be constantly policed to maintain their reputation and trust levels. For example, while I currently am impressed by DNSWL's performance, I am not pleased that they seem to lack automated trap-based enforcement. Relying only on manual reports and manual intervention requires too much effort in the long-term for any organization, be it company or volunteer run. Warren Togami wtog...@redhat.com
Re: Whitelists, not directly useful to spamassassin...
On 12/17/2009 11:27 AM, Jason Bertoch wrote: If whitelists are to be enabled by default, I believe their score should be moved considerably more toward zero. /Jason I don't necessarily disagree with this desire, as now we know the whitelists actually are making almost zero difference to spamassassin's results. We did at least reduce the scores from their default values that were in spamassassin-3.2.x as a reasonable compromise. Warren
Re: Whitelists, not directly useful to spamassassin...
Very interesting data indeed -- and a testament to the accuracy of the SpamAssassin rules weighting process. On Dec 16, 2009, at 4:10 PM, Warren Togami wrote: While whitelists are not directly effective (statistically, when averaged across a large corpus), whitelists are powerful tools in indirect ways including: * Pushing the score beyond the auto-learn threshold for things like Bayes to function without manual intervention. * The albeit controversial method where some automated spam trap blacklists use whitelists to help determine if they really should list an IP address. Another indirect benefit (according to other users of our whitelists) is that when they implement a new spam-blocking method, the whitelists serve as kind of a safety valve to let legitimate mail through even when the new rule turns out to have false positives. Site-specific whitelists are important for this, too. That being said, whitelists should be constantly policed to maintain their reputation and trust levels. Agreed. -- J.D. Falk jdf...@returnpath.net Return Path Inc
Whitelists, not directly useful to spamassassin...
I made a discovery today that surprised even myself. Using the rescore masscheck and weekly masscheck logs while working on Bug #6247 I found some interesting details that throws a wrench into this lively debate. https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c49 https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c51 It turns out that the ReturnPath and DNSWL whitelists have a statistically insignificant impact on spamassassin's ability to determine ham vs. spam. Meanwhile, both whitelists have high levels of accuracy. How can both of these statements be true? I suspect this is because the scores are balanced by the rescoring algorithm to be safe in the majority case where no whitelist rule has triggered. Thus whitelists are not needed or relied upon to prevent false positive classification. While whitelists are not directly effective (statistically, when averaged across a large corpus), whitelists are powerful tools in indirect ways including: * Pushing the score beyond the auto-learn threshold for things like Bayes to function without manual intervention. * The albeit controversial method where some automated spam trap blacklists use whitelists to help determine if they really should list an IP address. https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247 https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6251 spamassassin-3.3.0 has reduced the score impact of these whitelists to more modest levels, maxing out at -5 points. -5 is PLENTY for spamassassin, as 5 points is the level which the scoreset is tuned. Mail from a whitelisted host would need greater than 10 points to be blocked, which is statistically very rare for ham. I believe that we are striking the right balance with these modest whitelist scores in this release. That being said, whitelists should be constantly policed to maintain their reputation and trust levels. For example, while I currently am impressed by DNSWL's performance, I am not pleased that they seem to lack automated trap-based enforcement. Relying only on manual reports and manual intervention requires too much effort in the long-term for any organization, be it company or volunteer run. Warren Togami wtog...@redhat.com