Re: Whitelists, not directly useful to spamassassin...

2009-12-21 Thread Matus UHLAR - fantomas
 Warren Togami wrote:
 While whitelists are not directly effective (statistically, when  
 averaged across a large corpus), whitelists are powerful tools in  
 indirect ways including:

 * Pushing the score beyond the auto-learn threshold for things like  
 Bayes to function without manual intervention.

On 17.12.09 11:27, Jason Bertoch wrote:
 This does not sound like a positive thing to me.  E-mail from any sender  
 that is malformed enough to skip auto-learning should not be forced into  
 Bayes as ham simply because some 3rd party promises, for their own  
 monetary benefit, that the sender is a nice guy.  Why should any sender  
 that I have not intentionally added to my local whitelist get a break?

If you _want_ the mail and whitelist the sender, I think its characteristics
should be pushed into the bayes.
If you don't want the mail, then autolearning it as spam is least of your
problems.

-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Linux IS user friendly, it's just selective who its friends are...


Re: Whitelists, not directly useful to spamassassin...

2009-12-17 Thread Per Jessen
Warren Togami wrote:

 https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c49
 https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c51
 It turns out that the ReturnPath and DNSWL whitelists have a
 statistically insignificant impact on spamassassin's ability to
 determine ham vs. spam.  Meanwhile, both whitelists have high levels
 of accuracy.
 
 How can both of these statements be true?  I suspect this is because
 the scores are balanced by the rescoring algorithm to be safe in the
 majority case where no whitelist rule has triggered.  Thus whitelists
 are not needed or relied upon to prevent false positive
 classification.

I concur, that is what my analysis of HABEAS hits over the last four
months showed too. 


/Per Jessen, Z├╝rich



Re: Whitelists, not directly useful to spamassassin...

2009-12-17 Thread Jason Bertoch

Warren Togami wrote:



While whitelists are not directly effective (statistically, when 
averaged across a large corpus), whitelists are powerful tools in 
indirect ways including:


* Pushing the score beyond the auto-learn threshold for things like 
Bayes to function without manual intervention.


This does not sound like a positive thing to me.  E-mail from any sender 
that is malformed enough to skip auto-learning should not be forced into 
Bayes as ham simply because some 3rd party promises, for their own 
monetary benefit, that the sender is a nice guy.  Why should any sender 
that I have not intentionally added to my local whitelist get a break?


I've had enough problems with DNSWL, HABEAS, and JMF that they have all 
been disabled here.  Unfortunately, that also means I have no recent 
data to add to the debate.  Although I believe that whitelists should be 
included in the default install for those that want them, I also believe 
they should be disabled by default so that an admin must knowingly 
enable them after reading the manual and considering the consequences.


The argument has also been made that whitelists should be included 
simply because blacklists are.  I think that argument is flawed. 
Blacklists are part of the spam fighting community while whitelists are 
part of the bulk delivery community.  Their goals and motives are 
completely different.  For one, blacklists will normally have evidence 
of abuse to support their listing.  Whitelists only have policies and 
promises.  Second, the scoring of whitelists is currently favored over 
blacklists, and will continue to be at the proposed settings for 3.3.0. 
 Why can a whitelist override the score of a blacklist when it is the 
blacklist that has evidence of abuse?



After reading up on Bug6247, I found that ReturnPath included 
interesting stats on their lists:


Certified
Active: 4407
Suspended: 1300
Total: 5707

Safe
Active: 6561
Suspended: 283
Total: 6844


The Certified list is supposedly difficult to get on so I'm not sure how 
to interpret these results.  Is 1/5 of the list suspended because of due 
diligence on the part of ReturnPath?  If so, how did they get certified 
in the first place?


If whitelists are to be enabled by default, I believe their score should 
be moved considerably more toward zero.


/Jason


Re: Whitelists, not directly useful to spamassassin...

2009-12-17 Thread Charles Gregory


Thank you, Warren. That (finally) gives some real perspective to this 
mess, and gets some of the 'real' questions answered.


- C

On Wed, 16 Dec 2009, Warren Togami wrote:
I made a discovery today that surprised even myself.  Using the rescore 
masscheck and weekly masscheck logs while working on Bug #6247 I found some 
interesting details that throws a wrench into this lively debate.


https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c49
https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c51
It turns out that the ReturnPath and DNSWL whitelists have a statistically 
insignificant impact on spamassassin's ability to determine ham vs. spam. 
Meanwhile, both whitelists have high levels of accuracy.


How can both of these statements be true?  I suspect this is because the 
scores are balanced by the rescoring algorithm to be safe in the majority 
case where no whitelist rule has triggered.  Thus whitelists are not needed 
or relied upon to prevent false positive classification.


While whitelists are not directly effective (statistically, when averaged 
across a large corpus), whitelists are powerful tools in indirect ways 
including:


* Pushing the score beyond the auto-learn threshold for things like Bayes to 
function without manual intervention.
* The albeit controversial method where some automated spam trap blacklists 
use whitelists to help determine if they really should list an IP address.


https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6247
https: //issues.apache.org/SpamAssassin/show_bug.cgi?id=6251
spamassassin-3.3.0 has reduced the score impact of these whitelists to more 
modest levels, maxing out at -5 points.  -5 is PLENTY for spamassassin, as 5 
points is the level which the scoreset is tuned. Mail from a whitelisted host 
would need greater than 10 points to be blocked, which is statistically very 
rare for ham.  I believe that we are striking the right balance with these 
modest whitelist scores in this release.


That being said, whitelists should be constantly policed to maintain their 
reputation and trust levels.  For example, while I currently am impressed by 
DNSWL's performance, I am not pleased that they seem to lack automated 
trap-based enforcement.  Relying only on manual reports and manual 
intervention requires too much effort in the long-term for any organization, 
be it company or volunteer run.


Warren Togami
wtog...@redhat.com




Re: Whitelists, not directly useful to spamassassin...

2009-12-17 Thread Warren Togami

On 12/17/2009 11:27 AM, Jason Bertoch wrote:


If whitelists are to be enabled by default, I believe their score should
be moved considerably more toward zero.

/Jason


I don't necessarily disagree with this desire, as now we know the 
whitelists actually are making almost zero difference to spamassassin's 
results.


We did at least reduce the scores from their default values that were in 
spamassassin-3.2.x as a reasonable compromise.


Warren


Re: Whitelists, not directly useful to spamassassin...

2009-12-17 Thread J.D. Falk
Very interesting data indeed -- and a testament to the accuracy of the 
SpamAssassin rules weighting process.

On Dec 16, 2009, at 4:10 PM, Warren Togami wrote:

 While whitelists are not directly effective (statistically, when averaged 
 across a large corpus), whitelists are powerful tools in indirect ways 
 including:
 
 * Pushing the score beyond the auto-learn threshold for things like Bayes to 
 function without manual intervention.
 * The albeit controversial method where some automated spam trap blacklists 
 use whitelists to help determine if they really should list an IP address.

Another indirect benefit (according to other users of our whitelists) is that 
when they implement a new spam-blocking method, the whitelists serve as kind of 
a safety valve to let legitimate mail through even when the new rule turns out 
to have false positives.

Site-specific whitelists are important for this, too.

 That being said, whitelists should be constantly policed to maintain their 
 reputation and trust levels.

Agreed.

--
J.D. Falk jdf...@returnpath.net
Return Path Inc






Whitelists, not directly useful to spamassassin...

2009-12-16 Thread Warren Togami
I made a discovery today that surprised even myself.  Using the rescore 
masscheck and weekly masscheck logs while working on Bug #6247 I found 
some interesting details that throws a wrench into this lively debate.


https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c49
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247#c51
It turns out that the ReturnPath and DNSWL whitelists have a 
statistically insignificant impact on spamassassin's ability to 
determine ham vs. spam.  Meanwhile, both whitelists have high levels of 
accuracy.


How can both of these statements be true?  I suspect this is because the 
scores are balanced by the rescoring algorithm to be safe in the 
majority case where no whitelist rule has triggered.  Thus whitelists 
are not needed or relied upon to prevent false positive classification.


While whitelists are not directly effective (statistically, when 
averaged across a large corpus), whitelists are powerful tools in 
indirect ways including:


* Pushing the score beyond the auto-learn threshold for things like 
Bayes to function without manual intervention.
* The albeit controversial method where some automated spam trap 
blacklists use whitelists to help determine if they really should list 
an IP address.


https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6247
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6251
spamassassin-3.3.0 has reduced the score impact of these whitelists to 
more modest levels, maxing out at -5 points.  -5 is PLENTY for 
spamassassin, as 5 points is the level which the scoreset is tuned. 
Mail from a whitelisted host would need greater than 10 points to be 
blocked, which is statistically very rare for ham.  I believe that we 
are striking the right balance with these modest whitelist scores in 
this release.


That being said, whitelists should be constantly policed to maintain 
their reputation and trust levels.  For example, while I currently am 
impressed by DNSWL's performance, I am not pleased that they seem to 
lack automated trap-based enforcement.  Relying only on manual reports 
and manual intervention requires too much effort in the long-term for 
any organization, be it company or volunteer run.


Warren Togami
wtog...@redhat.com