Re: Annoying auto_whitelist

2009-07-12 Thread Matus UHLAR - fantomas
  RW wrote:
   The much more common scenario is that the first spam hits BAYES_50
   and subsequent BAYES_99 hits are countered by a negative  AWL score.

 On Fri, 10 Jul 2009 08:09:04 -0400
 Matt Kettler mkettler...@verizon.net wrote:
  Technically, this only counters half the score. It also gets paid
  back later. It raises the stored average that will apply to
  subsequent messages.

On 10.07.09 18:57, RW wrote:
 So what's the point of including  BAYES_99 in AWL?

The point is not excluding very usefull info like score of BAYES_00 or
BAYES_99 for later e-mail.

 but there's only a benefit if the BAYES_XX score falls, otherwise
 the distortion to the score just gets less bad - I don't see how you
 can describe that as paid back.   

  I'd also argue it's a rather rare case. Most of my spam hits BAYES_99
  the first shot around, and most has varying sender address and IP. The
  odds of one having increasing score and the same sender address/ip
  seems extraordinarily unlikely to me.

 If something scarcely every makes a difference, and on the occasion it
 does, gets it wrong more often then it gets it right, I don't see the
 point in keeping it.

That paragraph was about AWL as a whole, not about including/excluding BAYES
scores into.
-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Honk if you love peace and quiet. 


Re: Annoying auto_whitelist

2009-07-10 Thread Matus UHLAR - fantomas
  On Sat, 04 Jul 2009 08:56:35 -0400
  Matt Kettler mkettler...@verizon.net wrote:
   Please be aware the AWL is NOT whitelist, or a blacklist, and the
   scores don't really quite work the way they look. The AWL is
   essentially an averager, and as such, it's sometimes going to assign
   negative scores to spam sometimes.

  And it works from its own version of the score that ignores
  whitelisting and bayes scores. So if learning a spam leads to the next
  spam from the same address getting a higher bayes score, that benefit
  isn't washed-out by AWL. 

On 04.07.09 22:42, RW wrote:
 I take that back, I thought the the BAYES_XX rules were ignored by AWL,
 but they aren't.
 
 Personally I think BAYES should be ignored by AWL, emails from the same
 from address and ip address will have a lot of tokens in common.  They
 should train quickly, and there shouldn't be any need to damp-out
 that learning.

I don't think so. Teaching BAYES is a good way to hint AWL which way should
it push scores. By ignoring bayes, you could move much spam the ham-way
since much of spam isn't catched by other scores than BAYES, and vice versa.

-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
WinError #9: Out of error messages.


Re: Annoying auto_whitelist

2009-07-10 Thread RW
On Fri, 10 Jul 2009 12:33:51 +0200
Matus UHLAR - fantomas uh...@fantomas.sk wrote:

   On Sat, 04 Jul 2009 08:56:35 -0400
   Matt Kettler mkettler...@verizon.net wrote:
Please be aware the AWL is NOT whitelist, or a blacklist, and
the scores don't really quite work the way they look. The AWL is
essentially an averager, and as such, it's sometimes going to
assign negative scores to spam sometimes.
 
   And it works from its own version of the score that ignores
   whitelisting and bayes scores. So if learning a spam leads to the
   next spam from the same address getting a higher bayes score,
   that benefit isn't washed-out by AWL. 
 
 On 04.07.09 22:42, RW wrote:
  I take that back, I thought the the BAYES_XX rules were ignored by
  AWL, but they aren't.
  
  Personally I think BAYES should be ignored by AWL, emails from the
  same from address and ip address will have a lot of tokens in
  common.  They should train quickly, and there shouldn't be any need
  to damp-out that learning.
 
 I don't think so. Teaching BAYES is a good way to hint AWL which way
 should it push scores. By ignoring bayes, you could move much spam
 the ham-way since much of spam isn't catched by other scores than
 BAYES, and vice versa.
 
Right, but that's only a benefit if the BAYES score drops - remember
it's an averaging system. Personally I only have a single spam in my
spam corpus that has a AWL hit and doesn't hit BAYES_99, and that hits
BAYES_95. Sending multiple spams from the same from address and IP
address is a gift to Bayesian filters.

The much more common scenario is that the first spam hits BAYES_50 and
subsequent BAYES_99 hits are countered by a negative  AWL score.

 


Re: Annoying auto_whitelist

2009-07-10 Thread Matt Kettler
RW wrote:
 On Fri, 10 Jul 2009 12:33:51 +0200
 Matus UHLAR - fantomas uh...@fantomas.sk wrote:

   
 On Sat, 04 Jul 2009 08:56:35 -0400
 Matt Kettler mkettler...@verizon.net wrote:
 
 Please be aware the AWL is NOT whitelist, or a blacklist, and
 the scores don't really quite work the way they look. The AWL is
 essentially an averager, and as such, it's sometimes going to
 assign negative scores to spam sometimes.
   
 And it works from its own version of the score that ignores
 whitelisting and bayes scores. So if learning a spam leads to the
 next spam from the same address getting a higher bayes score,
 that benefit isn't washed-out by AWL. 
 
 On 04.07.09 22:42, RW wrote:
 
 I take that back, I thought the the BAYES_XX rules were ignored by
 AWL, but they aren't.

 Personally I think BAYES should be ignored by AWL, emails from the
 same from address and ip address will have a lot of tokens in
 common.  They should train quickly, and there shouldn't be any need
 to damp-out that learning.
   
 I don't think so. Teaching BAYES is a good way to hint AWL which way
 should it push scores. By ignoring bayes, you could move much spam
 the ham-way since much of spam isn't catched by other scores than
 BAYES, and vice versa.

 
 Right, but that's only a benefit if the BAYES score drops - remember
 it's an averaging system. Personally I only have a single spam in my
 spam corpus that has a AWL hit and doesn't hit BAYES_99, and that hits
 BAYES_95. Sending multiple spams from the same from address and IP
 address is a gift to Bayesian filters.

 The much more common scenario is that the first spam hits BAYES_50 and
 subsequent BAYES_99 hits are countered by a negative  AWL score.
   
Technically, this only counters half the score. It also gets paid back
later. It raises the stored average that will apply to subsequent messages.

I'd also argue it's a rather rare case. Most of my spam hits BAYES_99
the first shot around, and most has varying sender address and IP. The
odds of one having increasing score and the same sender address/ip seems
extraordinarily unlikely to me.

Besides, the real problem there isn't the AWL, but the fact that the
first message scored low.

Are you really seeing cases where this is causing false negatives, or
are you just pontificating about what's possible?




Re: Annoying auto_whitelist

2009-07-10 Thread RW
On Fri, 10 Jul 2009 08:09:04 -0400
Matt Kettler mkettler...@verizon.net wrote:

 RW wrote:

  The much more common scenario is that the first spam hits BAYES_50
  and subsequent BAYES_99 hits are countered by a negative  AWL score.

 Technically, this only counters half the score. It also gets paid
 back later. It raises the stored average that will apply to
 subsequent messages.

but there's only a benefit if the BAYES_XX score falls, otherwise
the distortion to the score just gets less bad - I don't see how you
can describe that as paid back.   
 
 
 I'd also argue it's a rather rare case. Most of my spam hits BAYES_99
 the first shot around, and most has varying sender address and IP. The
 odds of one having increasing score and the same sender address/ip
 seems extraordinarily unlikely to me.

So what's the point of including  BAYES_99 in AWL?

If something scarcely every makes a difference, and on the occasion it
does, gets it wrong more often then it gets it right, I don't see the
point in keeping it.


Re: Annoying auto_whitelist

2009-07-05 Thread Benny Pedersen

On Sat, July 4, 2009 10:20, Michelle Konzack wrote:

 ...because the Spamer From: is in the auto_whitelist.

aRG :/

from and SENDER IP is in the awl table, where is the problem ?

if you match the sender ip very well (/16 fuzzy) then i see the problem

and btw awl is NOT a whitelist !

-- 
xpoint



Re: Annoying auto_whitelist

2009-07-05 Thread Benny Pedersen

On Sat, July 4, 2009 20:50, Michelle Konzack wrote:
 Goog evening Jari,

 Am 2009-07-04 13:46:45, schrieb Jari Fredriksson:
 http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeAwl

 Thankyou for the link, but if I understand  it  right,  spamassassin  is
 then using ONE Database/Table for ALL users...  This mean, the  Database
 will grow more then 10.000 ROW's a day...

 Is in spamassassin something like an autoexpire?

 Most spams I get are with UNIQUE From: header.  I allready collect  this
 infos using procmail recipes...  And since 2002 I have  collectedt  over
 27 million different E-Mails


CREATE TABLE `awl` (
  `username` varchar(100) NOT NULL default '',
  `email` varchar(200) NOT NULL default '',
  `ip` varchar(10) NOT NULL default '',
  `count` int(11) default '0',
  `totscore` float default '0',
  `lastupdate` timestamp NOT NULL default CURRENT_TIMESTAMP on update 
CURRENT_TIMESTAMP,
  PRIMARY KEY  (`username`,`email`,`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


CREATE TABLE `bayes_seen` (
  `id` int(11) NOT NULL default '0',
  `msgid` varchar(200) character set utf8 collate utf8_bin NOT NULL default '',
  `flag` char(1) NOT NULL default '',
  `lastupdate` timestamp NOT NULL default CURRENT_TIMESTAMP on update 
CURRENT_TIMESTAMP,
  PRIMARY KEY  (`id`,`msgid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


all the rest expire natively in sa, the above 2 tables can now expire in a 
cron, how to do this is upto others to deside :)

-- 
xpoint



Re: Annoying auto_whitelist

2009-07-05 Thread Benny Pedersen

On Sat, July 4, 2009 20:55, Michelle Konzack wrote:

 To prevent manualy learning of the MEDS spams I have set  my  MEDS-Score
 to 8.00 and do not get any spams except caNN and genNN.

perldoc Mail::SpamAssassin::Plugin::AWL

see the awl factor setting, default its 0.5, so if you dont like this, change 
it to 0.25 then it will benefit less for the spammer
if he used your email / ip

got it ?

-- 
xpoint



Annoying auto_whitelist

2009-07-04 Thread Michelle Konzack
Hello,

while I get currently several 1000 shop/meds/pill/gen spams  a  day  and
some are going throug my filters, I have to move them to  my  spamfolder
manualy and feed them to sa-learn --spam but this does not work...

...because the Spamer From: is in the auto_whitelist.

For me, this seems to be a bug, becuase sa-learn has to remove the From:
from the auto_whitelist and then RESCAN this crap.

the two last days I have uncompressed the spamarchives from the last  27
weeks (from this year), used formail  to  extract  all  From:  E-Mails
unified them and used

for FROM in ${LIST} ; do
spamassassin --remove--addr-from-whitelist=${FROM}
done

which took over 52 hours for 487000 EMails.  Hell, I have a  super  fast
machine with 15000 RpM SCSI drives and 32 GByte of memory.  This are 2.6
E-Mails per second...

Why is this so slow?

On my Interanet Server NEC 4500MH  (Quad-Xeon,  550MHz/4GByte)  it  take
arround 5-11 seconds for a singel E-Mail to remove.

michelle.konz...@vserver1:~$ apt-cache policy spamassassin
spamassassin:
  Installiert: 3.2.5-2
  Kandidat: 3.2.5-2
  Versions-Tabelle:
 *** 3.2.5-2 0
500 http://ftp.de.debian.org lenny/main Packages
100 /var/lib/dpkg/status


Thanks, Greetings and nice Day/Evening
Michelle Konzack
Systemadministrator
25.9V Electronic Engineer
Tamay Dogan Network
Debian GNU/Linux Consultant

-- 
Linux-User #280138 with the Linux Counter, http://counter.li.org/
# Debian GNU/Linux Consultant #
http://www.tamay-dogan.net/ Michelle Konzack
http://www.can4linux.org/   c/o Vertriebsp. KabelBW
http://www.flexray4linux.org/   Blumenstrasse 2
Jabber linux4miche...@jabber.ccc.de   77694 Kehl/Germany
IRC #Debian (irc.icq.com) Tel. DE: +49 177 9351947
ICQ #328449886Tel. FR: +33  6  61925193


signature.pgp
Description: Digital signature


Re: Annoying auto_whitelist

2009-07-04 Thread Jari Fredriksson
 Hello,

 while I get currently several 1000 shop/meds/pill/gen spams  a  day  and
 some are going throug my filters, I have to move them to  my  spamfolder
 manualy and feed them to sa-learn --spam but this does not work...

 ...because the Spamer From: is in the auto_whitelist.

 For me, this seems to be a bug, becuase sa-learn has to remove the From:
 from the auto_whitelist and then RESCAN this crap.

 the two last days I have uncompressed the spamarchives from the last  27
 weeks (from this year), used formail  to  extract  all  From:  E-Mails
 unified them and used

 for FROM in ${LIST} ; do
 spamassassin --remove--addr-from-whitelist=${FROM}
 done

 which took over 52 hours for 487000 EMails.  Hell, I have a  super  fast
 machine with 15000 RpM SCSI drives and 32 GByte of memory.  This are 2.6
 E-Mails per second...

Do You have SQL based AWL? If not, it might  be worth a consideration,
given your amounts of email.

With SQL

 for FROM in ${LIST} ; do
 mysql -u spamassassin -psecret spamassassin EOF
 delete from awl where email='${FROM}' ;
 EOF
 done

Should be MUCH faster.



Re: Annoying auto_whitelist

2009-07-04 Thread Michelle Konzack
Am 2009-07-04 11:53:27, schrieb Jari Fredriksson:
 Do You have SQL based AWL? If not, it might  be worth a consideration,
 given your amounts of email.

AWL in SQL?

Yes, I have a PostgreSQL database available (mean, each user  has  one),
but how can I setup spamassassin to use it?

 With SQL
 
  for FROM in ${LIST} ; do
  mysql -u spamassassin -psecret spamassassin EOF
  delete from awl where email='${FROM}' ;
  EOF
  done
 
 Should be MUCH faster.

Like to try it out, but how to setup?

Thanks, Greetings and nice Day/Evening
Michelle Konzack
Systemadministrator
Tamay Dogan Network
Debian GNU/Linux Consultant

-- 
Linux-User #280138 with the Linux Counter, http://counter.li.org/
# Debian GNU/Linux Consultant #
http://www.tamay-dogan.net/ Michelle Konzack
http://www.can4linux.org/   c/o Vertriebsp. KabelBW
http://www.flexray4linux.org/   Blumenstrasse 2
Jabber linux4miche...@jabber.ccc.de   77694 Kehl/Germany
IRC #Debian (irc.icq.com) Tel. DE: +49 177 9351947
ICQ #328449886Tel. FR: +33  6  61925193


signature.pgp
Description: Digital signature


Re: Annoying auto_whitelist

2009-07-04 Thread Henrik K
On Sat, Jul 04, 2009 at 11:53:27AM +0300, Jari Fredriksson wrote:
  Hello,
 
  while I get currently several 1000 shop/meds/pill/gen spams  a  day  and
  some are going throug my filters, I have to move them to  my  spamfolder
  manualy and feed them to sa-learn --spam but this does not work...
 
  ...because the Spamer From: is in the auto_whitelist.
 
  For me, this seems to be a bug, becuase sa-learn has to remove the From:
  from the auto_whitelist and then RESCAN this crap.
 
  the two last days I have uncompressed the spamarchives from the last  27
  weeks (from this year), used formail  to  extract  all  From:  E-Mails
  unified them and used
 
  for FROM in ${LIST} ; do
  spamassassin --remove--addr-from-whitelist=${FROM}
  done
 
  which took over 52 hours for 487000 EMails.  Hell, I have a  super  fast
  machine with 15000 RpM SCSI drives and 32 GByte of memory.  This are 2.6
  E-Mails per second...

You are loading a big perl program for every single email, what do you
expect? ;)

You should edit the database directly. If not using SQL, it's a bit more
trickier.. could modify trim_whitelist to do it etc..

 Do You have SQL based AWL? If not, it might  be worth a consideration,
 given your amounts of email.
 
 With SQL
 
  for FROM in ${LIST} ; do
  mysql -u spamassassin -psecret spamassassin EOF
  delete from awl where email='${FROM}' ;
  EOF
  done
 
 Should be MUCH faster.

It's possible that $FROM may contain quote characters, so it should be
handled. It's always a good practise, even though I doubt any emails contain
SQL injections..

Also you could just output all sql clauses into a file first and then run
it. To avoid the same pitfall as above, though in a smaller scale. ;)



Re: Annoying auto_whitelist

2009-07-04 Thread Jari Fredriksson
 Am 2009-07-04 11:53:27, schrieb Jari Fredriksson:
 Do You have SQL based AWL? If not, it might  be worth a consideration,
 given your amounts of email.

 AWL in SQL?

 Yes, I have a PostgreSQL database available (mean, each user  has  one),
 but how can I setup spamassassin to use it?

http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeAwl



Re: Annoying auto_whitelist

2009-07-04 Thread RW
On Sat, 4 Jul 2009 10:20:06 +0200
Michelle Konzack linux4miche...@tamay-dogan.net wrote:

 Hello,
 
 while I get currently several 1000 shop/meds/pill/gen spams  a  day
 and some are going throug my filters, I have to move them to  my
 spamfolder manualy and feed them to sa-learn --spam but this does
 not work...
 
 ...because the Spamer From: is in the auto_whitelist.
 
 For me, this seems to be a bug, becuase sa-learn has to remove the
 From: from the auto_whitelist and then RESCAN this crap.

So what happens if you don't remove it, what error do you get when you
run sa-learn?


Re: Annoying auto_whitelist

2009-07-04 Thread Matt Kettler
Michelle Konzack wrote:
 Hello,

 while I get currently several 1000 shop/meds/pill/gen spams  a  day  and
 some are going throug my filters, I have to move them to  my  spamfolder
 manualy and feed them to sa-learn --spam but this does not work...

 ...because the Spamer From: is in the auto_whitelist.

Wait a second. The AWL has nothing to do with bayes or sa-learn.

The only reason SA won't learn a message a spam would be if it has
already been learned as spam, as noted in the bayes_seen database (or
corresponding SQL table).

 For me, this seems to be a bug, becuase sa-learn has to remove the From:
 from the auto_whitelist and then RESCAN this crap.
Um, the AWL has nothing to do with sa-learn --spam, and this action will
neither consult, nor modify the AWL.

What makes you think the AWL is inhibiting learning?

The AWL is actually going to contain *EVERY* sender that ever sent you
email (because it is an averager, not a whitelist), so if it would
inhibit learning, you'd never be able to learn anything.



Re: Annoying auto_whitelist

2009-07-04 Thread RW
On Sat, 04 Jul 2009 08:56:35 -0400
Matt Kettler mkettler...@verizon.net wrote:

 Please be aware the AWL is NOT whitelist, or a blacklist, and the
 scores don't really quite work the way they look. The AWL is
 essentially an averager, and as such, it's sometimes going to assign
 negative scores to spam sometimes.

And it works from its own version of the score that ignores
whitelisting and bayes scores. So if learning a spam leads to the next
spam from the same address getting a higher bayes score, that benefit
isn't washed-out by AWL. 


Re: Annoying auto_whitelist

2009-07-04 Thread Matt Kettler
Michelle Konzack wrote:
 Hello,

 while I get currently several 1000 shop/meds/pill/gen spams  a  day  and
 some are going throug my filters, I have to move them to  my  spamfolder
 manualy and feed them to sa-learn --spam but this does not work...

 ...because the Spamer From: is in the auto_whitelist.

 For me, this seems to be a bug, becuase sa-learn has to remove the From:
 from the auto_whitelist and then RESCAN this crap.

Is the AWL actually causing false negatives?

Please be aware the AWL is NOT whitelist, or a blacklist, and the scores
don't really quite work the way they look. The AWL is essentially an
averager, and as such, it's sometimes going to assign negative scores to
spam sometimes.

This does *NOT* necessarily mean the AWL has whitelisted the sender,
unless it pushes it below the required_score. It just means that this
spam scored higher than the last one. i.e.: if a spam scoring +20 gets a
-5 AWL, the AWL still believes the sender is a spammer with a +10
average. If that same sender had instead sent a message scoring 0, the
AWL would have given them a +5.

Please be sure to read:

http://wiki.apache.org/spamassassin/AwlWrongWay

Before you make too many judgments about what the AWL is doing. Looking
at the score it assigns alone does not tell you anything about what the
AWL is doing.






Re: Annoying auto_whitelist

2009-07-04 Thread Michelle Konzack
Goog evening Jari,

Am 2009-07-04 13:46:45, schrieb Jari Fredriksson:
 http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeAwl

Thankyou for the link, but if I understand  it  right,  spamassassin  is
then using ONE Database/Table for ALL users...  This mean, the  Database
will grow more then 10.000 ROW's a day...

Is in spamassassin something like an autoexpire?

Most spams I get are with UNIQUE From: header.  I allready collect  this
infos using procmail recipes...  And since 2002 I have  collectedt  over
27 million different E-Mails

Thanks, Greetings and nice Day/Evening
Michelle Konzack
Systemadministrator
Tamay Dogan Network
Debian GNU/Linux Consultant


-- 
Linux-User #280138 with the Linux Counter, http://counter.li.org/
# Debian GNU/Linux Consultant #
Michelle Konzack   c/o Shared Office KabelBW  ICQ #328449886
+49/177/9351947Blumenstasse 2 MSN LinuxMichi
+33/6/61925193 77694 Kehl/Germany IRC #Debian (irc.icq.com)


signature.pgp
Description: Digital signature


Re: Annoying auto_whitelist

2009-07-04 Thread Michelle Konzack
Am 2009-07-04 13:12:07, schrieb RW:
 So what happens if you don't remove it, what error do you get when you
 run sa-learn?#

If I do not remove it beforre sa-learn --spam, I get an  negative  AWL
score.

If I remove it, and run sa-learn --spam again, AWL is not  mentiioned.

To prevent manualy learning of the MEDS spams I have set  my  MEDS-Score
to 8.00 and do not get any spams except caNN and genNN.

Thanks, Greetings and nice Day/Evening
Michelle Konzack
Systemadministrator
Tamay Dogan Network
Debian GNU/Linux Consultant


-- 
Linux-User #280138 with the Linux Counter, http://counter.li.org/
# Debian GNU/Linux Consultant #
Michelle Konzack   c/o Shared Office KabelBW  ICQ #328449886
+49/177/9351947Blumenstasse 2 MSN LinuxMichi
+33/6/61925193 77694 Kehl/Germany IRC #Debian (irc.icq.com)


signature.pgp
Description: Digital signature


Re: Annoying auto_whitelist

2009-07-04 Thread wolfgang
In an older episode (Saturday, 4. July 2009), Michelle Konzack wrote:

 If I do not remove it beforre sa-learn --spam, I get an  negative 
 AWL score.

 If I remove it, and run sa-learn --spam again, AWL is not 
 mentiioned.

In my understanding, the fact that the From: address is in the AWL 
with a negative score does *not* prevent sa-learn from learning the 
message as spam.

The effect that various tokens from the mail are learned as spammy in 
the Bayes DB is far more important in my view.

And since the sender addresses are unique, their negative AWL score 
won't hurt much IMHO - except for increasing the size of the 
auto_whitelist.

So, removing them may be a good idea, but I don't think it is necessary 
for sa-learn to be effective.

My 0.02 EUR.

Regards,

wolfgang


 To prevent manualy learning of the MEDS spams I have set  my 
 MEDS-Score to 8.00 and do not get any spams except caNN and
 genNN.

 Thanks, Greetings and nice Day/Evening
 Michelle Konzack
 Systemadministrator
 Tamay Dogan Network
 Debian GNU/Linux Consultant


Re: Annoying auto_whitelist

2009-07-04 Thread RW
On Sat, 4 Jul 2009 14:09:29 +0100
RW rwmailli...@googlemail.com wrote:

 On Sat, 04 Jul 2009 08:56:35 -0400
 Matt Kettler mkettler...@verizon.net wrote:
 
  Please be aware the AWL is NOT whitelist, or a blacklist, and the
  scores don't really quite work the way they look. The AWL is
  essentially an averager, and as such, it's sometimes going to assign
  negative scores to spam sometimes.
 
 And it works from its own version of the score that ignores
 whitelisting and bayes scores. So if learning a spam leads to the next
 spam from the same address getting a higher bayes score, that benefit
 isn't washed-out by AWL. 

I take that back, I thought the the BAYES_XX rules were ignored by AWL,
but they aren't.

Personally I think BAYES should be ignored by AWL, emails from the same
from address and ip address will have a lot of tokens in common.  They
should train quickly, and there shouldn't be any need to damp-out
that learning.


Re: Annoying auto_whitelist

2009-07-04 Thread Jari Fredriksson
 Goog evening Jari,

 Am 2009-07-04 13:46:45, schrieb Jari Fredriksson:
 http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeAwl

 Thankyou for the link, but if I understand  it  right,  spamassassin  is
 then using ONE Database/Table for ALL users...  This mean, the  Database
 will grow more then 10.000 ROW's a day...

 Is in spamassassin something like an autoexpire?


You can add to the awl table a timeupdated  field with properties default
current_timestamp on update current_timestamp at least in MySQL.

And cron the autoexpire with it.

 Most spams I get are with UNIQUE From: header.  I allready collect  this
 infos using procmail recipes...  And since 2002 I have  collectedt  over
 27 million different E-Mails


100-200 megabytes data, which your current awl-database must contain
already. No big deal for an rdbms?


Re: Annoying auto_whitelist

2009-07-04 Thread RW
On Sat, 4 Jul 2009 20:55:12 +0200
Michelle Konzack linux4miche...@tamay-dogan.net wrote:

 Am 2009-07-04 13:12:07, schrieb RW:
  So what happens if you don't remove it, what error do you get when
  you run sa-learn?#
 
 If I do not remove it beforre sa-learn --spam, I get an  negative
 AWL score.
 
 If I remove it, and run sa-learn --spam again, AWL is not
 mentiioned.

If you're interested, what I've done is add the following to my
local.cf:

tflags BAYES_00 noautolearn nice learn
tflags BAYES_05 noautolearn nice learn
tflags BAYES_20 noautolearn nice learn
tflags BAYES_40 noautolearn nice learn
tflags BAYES_50 noautolearn learn
tflags BAYES_60 noautolearn learn
tflags BAYES_80 noautolearn learn
tflags BAYES_95 noautolearn learn
tflags BAYES_99 noautolearn learn

This should completely decouple BAYES and AWL, and so remove the lag
between learning and full-scoring (i.e. no more deleting AWL entries
before sa-learn).

*NOTE* that it does require a one-off reset of the AWL database to avoid
weird AWL scores.