Re: Annoying auto_whitelist
RW wrote: The much more common scenario is that the first spam hits BAYES_50 and subsequent BAYES_99 hits are countered by a negative AWL score. On Fri, 10 Jul 2009 08:09:04 -0400 Matt Kettler mkettler...@verizon.net wrote: Technically, this only counters half the score. It also gets paid back later. It raises the stored average that will apply to subsequent messages. On 10.07.09 18:57, RW wrote: So what's the point of including BAYES_99 in AWL? The point is not excluding very usefull info like score of BAYES_00 or BAYES_99 for later e-mail. but there's only a benefit if the BAYES_XX score falls, otherwise the distortion to the score just gets less bad - I don't see how you can describe that as paid back. I'd also argue it's a rather rare case. Most of my spam hits BAYES_99 the first shot around, and most has varying sender address and IP. The odds of one having increasing score and the same sender address/ip seems extraordinarily unlikely to me. If something scarcely every makes a difference, and on the occasion it does, gets it wrong more often then it gets it right, I don't see the point in keeping it. That paragraph was about AWL as a whole, not about including/excluding BAYES scores into. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Honk if you love peace and quiet.
Re: Annoying auto_whitelist
On Sat, 04 Jul 2009 08:56:35 -0400 Matt Kettler mkettler...@verizon.net wrote: Please be aware the AWL is NOT whitelist, or a blacklist, and the scores don't really quite work the way they look. The AWL is essentially an averager, and as such, it's sometimes going to assign negative scores to spam sometimes. And it works from its own version of the score that ignores whitelisting and bayes scores. So if learning a spam leads to the next spam from the same address getting a higher bayes score, that benefit isn't washed-out by AWL. On 04.07.09 22:42, RW wrote: I take that back, I thought the the BAYES_XX rules were ignored by AWL, but they aren't. Personally I think BAYES should be ignored by AWL, emails from the same from address and ip address will have a lot of tokens in common. They should train quickly, and there shouldn't be any need to damp-out that learning. I don't think so. Teaching BAYES is a good way to hint AWL which way should it push scores. By ignoring bayes, you could move much spam the ham-way since much of spam isn't catched by other scores than BAYES, and vice versa. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. WinError #9: Out of error messages.
Re: Annoying auto_whitelist
On Fri, 10 Jul 2009 12:33:51 +0200 Matus UHLAR - fantomas uh...@fantomas.sk wrote: On Sat, 04 Jul 2009 08:56:35 -0400 Matt Kettler mkettler...@verizon.net wrote: Please be aware the AWL is NOT whitelist, or a blacklist, and the scores don't really quite work the way they look. The AWL is essentially an averager, and as such, it's sometimes going to assign negative scores to spam sometimes. And it works from its own version of the score that ignores whitelisting and bayes scores. So if learning a spam leads to the next spam from the same address getting a higher bayes score, that benefit isn't washed-out by AWL. On 04.07.09 22:42, RW wrote: I take that back, I thought the the BAYES_XX rules were ignored by AWL, but they aren't. Personally I think BAYES should be ignored by AWL, emails from the same from address and ip address will have a lot of tokens in common. They should train quickly, and there shouldn't be any need to damp-out that learning. I don't think so. Teaching BAYES is a good way to hint AWL which way should it push scores. By ignoring bayes, you could move much spam the ham-way since much of spam isn't catched by other scores than BAYES, and vice versa. Right, but that's only a benefit if the BAYES score drops - remember it's an averaging system. Personally I only have a single spam in my spam corpus that has a AWL hit and doesn't hit BAYES_99, and that hits BAYES_95. Sending multiple spams from the same from address and IP address is a gift to Bayesian filters. The much more common scenario is that the first spam hits BAYES_50 and subsequent BAYES_99 hits are countered by a negative AWL score.
Re: Annoying auto_whitelist
RW wrote: On Fri, 10 Jul 2009 12:33:51 +0200 Matus UHLAR - fantomas uh...@fantomas.sk wrote: On Sat, 04 Jul 2009 08:56:35 -0400 Matt Kettler mkettler...@verizon.net wrote: Please be aware the AWL is NOT whitelist, or a blacklist, and the scores don't really quite work the way they look. The AWL is essentially an averager, and as such, it's sometimes going to assign negative scores to spam sometimes. And it works from its own version of the score that ignores whitelisting and bayes scores. So if learning a spam leads to the next spam from the same address getting a higher bayes score, that benefit isn't washed-out by AWL. On 04.07.09 22:42, RW wrote: I take that back, I thought the the BAYES_XX rules were ignored by AWL, but they aren't. Personally I think BAYES should be ignored by AWL, emails from the same from address and ip address will have a lot of tokens in common. They should train quickly, and there shouldn't be any need to damp-out that learning. I don't think so. Teaching BAYES is a good way to hint AWL which way should it push scores. By ignoring bayes, you could move much spam the ham-way since much of spam isn't catched by other scores than BAYES, and vice versa. Right, but that's only a benefit if the BAYES score drops - remember it's an averaging system. Personally I only have a single spam in my spam corpus that has a AWL hit and doesn't hit BAYES_99, and that hits BAYES_95. Sending multiple spams from the same from address and IP address is a gift to Bayesian filters. The much more common scenario is that the first spam hits BAYES_50 and subsequent BAYES_99 hits are countered by a negative AWL score. Technically, this only counters half the score. It also gets paid back later. It raises the stored average that will apply to subsequent messages. I'd also argue it's a rather rare case. Most of my spam hits BAYES_99 the first shot around, and most has varying sender address and IP. The odds of one having increasing score and the same sender address/ip seems extraordinarily unlikely to me. Besides, the real problem there isn't the AWL, but the fact that the first message scored low. Are you really seeing cases where this is causing false negatives, or are you just pontificating about what's possible?
Re: Annoying auto_whitelist
On Fri, 10 Jul 2009 08:09:04 -0400 Matt Kettler mkettler...@verizon.net wrote: RW wrote: The much more common scenario is that the first spam hits BAYES_50 and subsequent BAYES_99 hits are countered by a negative AWL score. Technically, this only counters half the score. It also gets paid back later. It raises the stored average that will apply to subsequent messages. but there's only a benefit if the BAYES_XX score falls, otherwise the distortion to the score just gets less bad - I don't see how you can describe that as paid back. I'd also argue it's a rather rare case. Most of my spam hits BAYES_99 the first shot around, and most has varying sender address and IP. The odds of one having increasing score and the same sender address/ip seems extraordinarily unlikely to me. So what's the point of including BAYES_99 in AWL? If something scarcely every makes a difference, and on the occasion it does, gets it wrong more often then it gets it right, I don't see the point in keeping it.
Re: Annoying auto_whitelist
On Sat, July 4, 2009 10:20, Michelle Konzack wrote: ...because the Spamer From: is in the auto_whitelist. aRG :/ from and SENDER IP is in the awl table, where is the problem ? if you match the sender ip very well (/16 fuzzy) then i see the problem and btw awl is NOT a whitelist ! -- xpoint
Re: Annoying auto_whitelist
On Sat, July 4, 2009 20:50, Michelle Konzack wrote: Goog evening Jari, Am 2009-07-04 13:46:45, schrieb Jari Fredriksson: http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeAwl Thankyou for the link, but if I understand it right, spamassassin is then using ONE Database/Table for ALL users... This mean, the Database will grow more then 10.000 ROW's a day... Is in spamassassin something like an autoexpire? Most spams I get are with UNIQUE From: header. I allready collect this infos using procmail recipes... And since 2002 I have collectedt over 27 million different E-Mails CREATE TABLE `awl` ( `username` varchar(100) NOT NULL default '', `email` varchar(200) NOT NULL default '', `ip` varchar(10) NOT NULL default '', `count` int(11) default '0', `totscore` float default '0', `lastupdate` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP, PRIMARY KEY (`username`,`email`,`ip`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; CREATE TABLE `bayes_seen` ( `id` int(11) NOT NULL default '0', `msgid` varchar(200) character set utf8 collate utf8_bin NOT NULL default '', `flag` char(1) NOT NULL default '', `lastupdate` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP, PRIMARY KEY (`id`,`msgid`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; all the rest expire natively in sa, the above 2 tables can now expire in a cron, how to do this is upto others to deside :) -- xpoint
Re: Annoying auto_whitelist
On Sat, July 4, 2009 20:55, Michelle Konzack wrote: To prevent manualy learning of the MEDS spams I have set my MEDS-Score to 8.00 and do not get any spams except caNN and genNN. perldoc Mail::SpamAssassin::Plugin::AWL see the awl factor setting, default its 0.5, so if you dont like this, change it to 0.25 then it will benefit less for the spammer if he used your email / ip got it ? -- xpoint
Annoying auto_whitelist
Hello, while I get currently several 1000 shop/meds/pill/gen spams a day and some are going throug my filters, I have to move them to my spamfolder manualy and feed them to sa-learn --spam but this does not work... ...because the Spamer From: is in the auto_whitelist. For me, this seems to be a bug, becuase sa-learn has to remove the From: from the auto_whitelist and then RESCAN this crap. the two last days I have uncompressed the spamarchives from the last 27 weeks (from this year), used formail to extract all From: E-Mails unified them and used for FROM in ${LIST} ; do spamassassin --remove--addr-from-whitelist=${FROM} done which took over 52 hours for 487000 EMails. Hell, I have a super fast machine with 15000 RpM SCSI drives and 32 GByte of memory. This are 2.6 E-Mails per second... Why is this so slow? On my Interanet Server NEC 4500MH (Quad-Xeon, 550MHz/4GByte) it take arround 5-11 seconds for a singel E-Mail to remove. michelle.konz...@vserver1:~$ apt-cache policy spamassassin spamassassin: Installiert: 3.2.5-2 Kandidat: 3.2.5-2 Versions-Tabelle: *** 3.2.5-2 0 500 http://ftp.de.debian.org lenny/main Packages 100 /var/lib/dpkg/status Thanks, Greetings and nice Day/Evening Michelle Konzack Systemadministrator 25.9V Electronic Engineer Tamay Dogan Network Debian GNU/Linux Consultant -- Linux-User #280138 with the Linux Counter, http://counter.li.org/ # Debian GNU/Linux Consultant # http://www.tamay-dogan.net/ Michelle Konzack http://www.can4linux.org/ c/o Vertriebsp. KabelBW http://www.flexray4linux.org/ Blumenstrasse 2 Jabber linux4miche...@jabber.ccc.de 77694 Kehl/Germany IRC #Debian (irc.icq.com) Tel. DE: +49 177 9351947 ICQ #328449886Tel. FR: +33 6 61925193 signature.pgp Description: Digital signature
Re: Annoying auto_whitelist
Hello, while I get currently several 1000 shop/meds/pill/gen spams a day and some are going throug my filters, I have to move them to my spamfolder manualy and feed them to sa-learn --spam but this does not work... ...because the Spamer From: is in the auto_whitelist. For me, this seems to be a bug, becuase sa-learn has to remove the From: from the auto_whitelist and then RESCAN this crap. the two last days I have uncompressed the spamarchives from the last 27 weeks (from this year), used formail to extract all From: E-Mails unified them and used for FROM in ${LIST} ; do spamassassin --remove--addr-from-whitelist=${FROM} done which took over 52 hours for 487000 EMails. Hell, I have a super fast machine with 15000 RpM SCSI drives and 32 GByte of memory. This are 2.6 E-Mails per second... Do You have SQL based AWL? If not, it might be worth a consideration, given your amounts of email. With SQL for FROM in ${LIST} ; do mysql -u spamassassin -psecret spamassassin EOF delete from awl where email='${FROM}' ; EOF done Should be MUCH faster.
Re: Annoying auto_whitelist
Am 2009-07-04 11:53:27, schrieb Jari Fredriksson: Do You have SQL based AWL? If not, it might be worth a consideration, given your amounts of email. AWL in SQL? Yes, I have a PostgreSQL database available (mean, each user has one), but how can I setup spamassassin to use it? With SQL for FROM in ${LIST} ; do mysql -u spamassassin -psecret spamassassin EOF delete from awl where email='${FROM}' ; EOF done Should be MUCH faster. Like to try it out, but how to setup? Thanks, Greetings and nice Day/Evening Michelle Konzack Systemadministrator Tamay Dogan Network Debian GNU/Linux Consultant -- Linux-User #280138 with the Linux Counter, http://counter.li.org/ # Debian GNU/Linux Consultant # http://www.tamay-dogan.net/ Michelle Konzack http://www.can4linux.org/ c/o Vertriebsp. KabelBW http://www.flexray4linux.org/ Blumenstrasse 2 Jabber linux4miche...@jabber.ccc.de 77694 Kehl/Germany IRC #Debian (irc.icq.com) Tel. DE: +49 177 9351947 ICQ #328449886Tel. FR: +33 6 61925193 signature.pgp Description: Digital signature
Re: Annoying auto_whitelist
On Sat, Jul 04, 2009 at 11:53:27AM +0300, Jari Fredriksson wrote: Hello, while I get currently several 1000 shop/meds/pill/gen spams a day and some are going throug my filters, I have to move them to my spamfolder manualy and feed them to sa-learn --spam but this does not work... ...because the Spamer From: is in the auto_whitelist. For me, this seems to be a bug, becuase sa-learn has to remove the From: from the auto_whitelist and then RESCAN this crap. the two last days I have uncompressed the spamarchives from the last 27 weeks (from this year), used formail to extract all From: E-Mails unified them and used for FROM in ${LIST} ; do spamassassin --remove--addr-from-whitelist=${FROM} done which took over 52 hours for 487000 EMails. Hell, I have a super fast machine with 15000 RpM SCSI drives and 32 GByte of memory. This are 2.6 E-Mails per second... You are loading a big perl program for every single email, what do you expect? ;) You should edit the database directly. If not using SQL, it's a bit more trickier.. could modify trim_whitelist to do it etc.. Do You have SQL based AWL? If not, it might be worth a consideration, given your amounts of email. With SQL for FROM in ${LIST} ; do mysql -u spamassassin -psecret spamassassin EOF delete from awl where email='${FROM}' ; EOF done Should be MUCH faster. It's possible that $FROM may contain quote characters, so it should be handled. It's always a good practise, even though I doubt any emails contain SQL injections.. Also you could just output all sql clauses into a file first and then run it. To avoid the same pitfall as above, though in a smaller scale. ;)
Re: Annoying auto_whitelist
Am 2009-07-04 11:53:27, schrieb Jari Fredriksson: Do You have SQL based AWL? If not, it might be worth a consideration, given your amounts of email. AWL in SQL? Yes, I have a PostgreSQL database available (mean, each user has one), but how can I setup spamassassin to use it? http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeAwl
Re: Annoying auto_whitelist
On Sat, 4 Jul 2009 10:20:06 +0200 Michelle Konzack linux4miche...@tamay-dogan.net wrote: Hello, while I get currently several 1000 shop/meds/pill/gen spams a day and some are going throug my filters, I have to move them to my spamfolder manualy and feed them to sa-learn --spam but this does not work... ...because the Spamer From: is in the auto_whitelist. For me, this seems to be a bug, becuase sa-learn has to remove the From: from the auto_whitelist and then RESCAN this crap. So what happens if you don't remove it, what error do you get when you run sa-learn?
Re: Annoying auto_whitelist
Michelle Konzack wrote: Hello, while I get currently several 1000 shop/meds/pill/gen spams a day and some are going throug my filters, I have to move them to my spamfolder manualy and feed them to sa-learn --spam but this does not work... ...because the Spamer From: is in the auto_whitelist. Wait a second. The AWL has nothing to do with bayes or sa-learn. The only reason SA won't learn a message a spam would be if it has already been learned as spam, as noted in the bayes_seen database (or corresponding SQL table). For me, this seems to be a bug, becuase sa-learn has to remove the From: from the auto_whitelist and then RESCAN this crap. Um, the AWL has nothing to do with sa-learn --spam, and this action will neither consult, nor modify the AWL. What makes you think the AWL is inhibiting learning? The AWL is actually going to contain *EVERY* sender that ever sent you email (because it is an averager, not a whitelist), so if it would inhibit learning, you'd never be able to learn anything.
Re: Annoying auto_whitelist
On Sat, 04 Jul 2009 08:56:35 -0400 Matt Kettler mkettler...@verizon.net wrote: Please be aware the AWL is NOT whitelist, or a blacklist, and the scores don't really quite work the way they look. The AWL is essentially an averager, and as such, it's sometimes going to assign negative scores to spam sometimes. And it works from its own version of the score that ignores whitelisting and bayes scores. So if learning a spam leads to the next spam from the same address getting a higher bayes score, that benefit isn't washed-out by AWL.
Re: Annoying auto_whitelist
Michelle Konzack wrote: Hello, while I get currently several 1000 shop/meds/pill/gen spams a day and some are going throug my filters, I have to move them to my spamfolder manualy and feed them to sa-learn --spam but this does not work... ...because the Spamer From: is in the auto_whitelist. For me, this seems to be a bug, becuase sa-learn has to remove the From: from the auto_whitelist and then RESCAN this crap. Is the AWL actually causing false negatives? Please be aware the AWL is NOT whitelist, or a blacklist, and the scores don't really quite work the way they look. The AWL is essentially an averager, and as such, it's sometimes going to assign negative scores to spam sometimes. This does *NOT* necessarily mean the AWL has whitelisted the sender, unless it pushes it below the required_score. It just means that this spam scored higher than the last one. i.e.: if a spam scoring +20 gets a -5 AWL, the AWL still believes the sender is a spammer with a +10 average. If that same sender had instead sent a message scoring 0, the AWL would have given them a +5. Please be sure to read: http://wiki.apache.org/spamassassin/AwlWrongWay Before you make too many judgments about what the AWL is doing. Looking at the score it assigns alone does not tell you anything about what the AWL is doing.
Re: Annoying auto_whitelist
Goog evening Jari, Am 2009-07-04 13:46:45, schrieb Jari Fredriksson: http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeAwl Thankyou for the link, but if I understand it right, spamassassin is then using ONE Database/Table for ALL users... This mean, the Database will grow more then 10.000 ROW's a day... Is in spamassassin something like an autoexpire? Most spams I get are with UNIQUE From: header. I allready collect this infos using procmail recipes... And since 2002 I have collectedt over 27 million different E-Mails Thanks, Greetings and nice Day/Evening Michelle Konzack Systemadministrator Tamay Dogan Network Debian GNU/Linux Consultant -- Linux-User #280138 with the Linux Counter, http://counter.li.org/ # Debian GNU/Linux Consultant # Michelle Konzack c/o Shared Office KabelBW ICQ #328449886 +49/177/9351947Blumenstasse 2 MSN LinuxMichi +33/6/61925193 77694 Kehl/Germany IRC #Debian (irc.icq.com) signature.pgp Description: Digital signature
Re: Annoying auto_whitelist
Am 2009-07-04 13:12:07, schrieb RW: So what happens if you don't remove it, what error do you get when you run sa-learn?# If I do not remove it beforre sa-learn --spam, I get an negative AWL score. If I remove it, and run sa-learn --spam again, AWL is not mentiioned. To prevent manualy learning of the MEDS spams I have set my MEDS-Score to 8.00 and do not get any spams except caNN and genNN. Thanks, Greetings and nice Day/Evening Michelle Konzack Systemadministrator Tamay Dogan Network Debian GNU/Linux Consultant -- Linux-User #280138 with the Linux Counter, http://counter.li.org/ # Debian GNU/Linux Consultant # Michelle Konzack c/o Shared Office KabelBW ICQ #328449886 +49/177/9351947Blumenstasse 2 MSN LinuxMichi +33/6/61925193 77694 Kehl/Germany IRC #Debian (irc.icq.com) signature.pgp Description: Digital signature
Re: Annoying auto_whitelist
In an older episode (Saturday, 4. July 2009), Michelle Konzack wrote: If I do not remove it beforre sa-learn --spam, I get an negative AWL score. If I remove it, and run sa-learn --spam again, AWL is not mentiioned. In my understanding, the fact that the From: address is in the AWL with a negative score does *not* prevent sa-learn from learning the message as spam. The effect that various tokens from the mail are learned as spammy in the Bayes DB is far more important in my view. And since the sender addresses are unique, their negative AWL score won't hurt much IMHO - except for increasing the size of the auto_whitelist. So, removing them may be a good idea, but I don't think it is necessary for sa-learn to be effective. My 0.02 EUR. Regards, wolfgang To prevent manualy learning of the MEDS spams I have set my MEDS-Score to 8.00 and do not get any spams except caNN and genNN. Thanks, Greetings and nice Day/Evening Michelle Konzack Systemadministrator Tamay Dogan Network Debian GNU/Linux Consultant
Re: Annoying auto_whitelist
On Sat, 4 Jul 2009 14:09:29 +0100 RW rwmailli...@googlemail.com wrote: On Sat, 04 Jul 2009 08:56:35 -0400 Matt Kettler mkettler...@verizon.net wrote: Please be aware the AWL is NOT whitelist, or a blacklist, and the scores don't really quite work the way they look. The AWL is essentially an averager, and as such, it's sometimes going to assign negative scores to spam sometimes. And it works from its own version of the score that ignores whitelisting and bayes scores. So if learning a spam leads to the next spam from the same address getting a higher bayes score, that benefit isn't washed-out by AWL. I take that back, I thought the the BAYES_XX rules were ignored by AWL, but they aren't. Personally I think BAYES should be ignored by AWL, emails from the same from address and ip address will have a lot of tokens in common. They should train quickly, and there shouldn't be any need to damp-out that learning.
Re: Annoying auto_whitelist
Goog evening Jari, Am 2009-07-04 13:46:45, schrieb Jari Fredriksson: http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeAwl Thankyou for the link, but if I understand it right, spamassassin is then using ONE Database/Table for ALL users... This mean, the Database will grow more then 10.000 ROW's a day... Is in spamassassin something like an autoexpire? You can add to the awl table a timeupdated field with properties default current_timestamp on update current_timestamp at least in MySQL. And cron the autoexpire with it. Most spams I get are with UNIQUE From: header. I allready collect this infos using procmail recipes... And since 2002 I have collectedt over 27 million different E-Mails 100-200 megabytes data, which your current awl-database must contain already. No big deal for an rdbms?
Re: Annoying auto_whitelist
On Sat, 4 Jul 2009 20:55:12 +0200 Michelle Konzack linux4miche...@tamay-dogan.net wrote: Am 2009-07-04 13:12:07, schrieb RW: So what happens if you don't remove it, what error do you get when you run sa-learn?# If I do not remove it beforre sa-learn --spam, I get an negative AWL score. If I remove it, and run sa-learn --spam again, AWL is not mentiioned. If you're interested, what I've done is add the following to my local.cf: tflags BAYES_00 noautolearn nice learn tflags BAYES_05 noautolearn nice learn tflags BAYES_20 noautolearn nice learn tflags BAYES_40 noautolearn nice learn tflags BAYES_50 noautolearn learn tflags BAYES_60 noautolearn learn tflags BAYES_80 noautolearn learn tflags BAYES_95 noautolearn learn tflags BAYES_99 noautolearn learn This should completely decouple BAYES and AWL, and so remove the lag between learning and full-scoring (i.e. no more deleting AWL entries before sa-learn). *NOTE* that it does require a one-off reset of the AWL database to avoid weird AWL scores.