How to automatically train each users Bayes?
Hi, I would like automatically learn each users Bayes database in the following way: Do the following once a day for each user: 1.) sa-learn -u username --ham ../maildir/cur 2.) sa-learn -u username --spam ../maildir/.Spam/cur The idea is to train the Bayes for each user without the need to take care of learning Spam/Ham on their own. The reason for taking the cur folder instead of the new folder is that I assume that the contents of these folders have already been verified for false-positives/negatives by the user. A problem that could occur is when the user always deletes all mails in .Spam/cur. Then the Bayes is only trained with Ham, but never Spam. Or isn't that a problem? What do you think about this strategy? Thanks, Michael
Re: How to automatically train each users Bayes?
Am 27.03.2015 um 16:16 schrieb Michael: I would like automatically learn each users Bayes database in the following way: Do the following once a day for each user: 1.) sa-learn -u username --ham ../maildir/cur 2.) sa-learn -u username --spam ../maildir/.Spam/cur The idea is to train the Bayes for each user without the need to take care of learning Spam/Ham on their own. The reason for taking the cur folder instead of the new folder is that I assume that the contents of these folders have already been verified for false-positives/negatives by the user. A problem that could occur is when the user always deletes all mails in .Spam/cur. Then the Bayes is only trained with Ham, but never Spam. Or isn't that a problem? What do you think about this strategy? nothing good because in that case you can just stay at autolearning which is on by default after a bayes has at least 200 ham and 200 spam samples to get enabled at all signature.asc Description: OpenPGP digital signature
Re: Uptick in spam
On 03/27/2015 08:20 PM, Amir Caspi wrote: On Mar 27, 2015, at 12:56 PM, Matus UHLAR - fantomas uh...@fantomas.sk wrote: I see no network checks here... do you use network checks? On Mar 27, 2015, at 1:11 PM, Kevin A. McGrail kmcgr...@pccc.com wrote: Are you using network tests? These are scoring pretty high for me. I presume you're talking about things like Razor, Pyzor, DCC, and various RBLs? Yes, those are enabled. The reason you're not seeing them is because they didn't hit when the messages were first received. I'm getting the same hits NOW that you are seeing, but those did NOT hit when the messages first arrived. Remember that these messages were received a number of hours ago, so they have had plenty of time to be listed on RBLs and hash DBs in the intervening period. They were clearly not listed there when these messages were received, which is exactly why these messages are FNs. If they were received now, they wouldn't be... but they were back then. This is why I said in the prior message that it appears my user is one of the unlucky folks getting these in the very first distribution, before they've had a chance to be reported to RBLs and hash DBs. Some poor schmoe has to be in the first distribution, and it appears that he's one of them. This is why I'm looking for other, template-like rules that can be used to identify these things, because right now it seems my user is getting them on the first run before the network tests are useful. But, yes, network tests are absolutely enabled. Are you using Mailscanner? if yes then it's you munging URIS so they breaking lookups on any hash type as in http://pastebin.com/LaKT5ZZK And if you're indeed using MailScanner are you sending it the full message or some chunk only? (can't remember the settings's names)
Re: Uptick in spam
On Mar 27, 2015, at 12:56 PM, Matus UHLAR - fantomas uh...@fantomas.sk wrote: I see no network checks here... do you use network checks? On Mar 27, 2015, at 1:11 PM, Kevin A. McGrail kmcgr...@pccc.com wrote: Are you using network tests? These are scoring pretty high for me. I presume you're talking about things like Razor, Pyzor, DCC, and various RBLs? Yes, those are enabled. The reason you're not seeing them is because they didn't hit when the messages were first received. I'm getting the same hits NOW that you are seeing, but those did NOT hit when the messages first arrived. Remember that these messages were received a number of hours ago, so they have had plenty of time to be listed on RBLs and hash DBs in the intervening period. They were clearly not listed there when these messages were received, which is exactly why these messages are FNs. If they were received now, they wouldn't be... but they were back then. This is why I said in the prior message that it appears my user is one of the unlucky folks getting these in the very first distribution, before they've had a chance to be reported to RBLs and hash DBs. Some poor schmoe has to be in the first distribution, and it appears that he's one of them. This is why I'm looking for other, template-like rules that can be used to identify these things, because right now it seems my user is getting them on the first run before the network tests are useful. But, yes, network tests are absolutely enabled. Cheers. --- Amir
Re: Uptick in spam
On Mar 27, 2015, at 1:20 PM, Axb axb.li...@gmail.com wrote: These three samples are very different in the sense that #1 is a hacked site, #2 #3 are the regular snowshoe. Of course, I picked three different samples on purpose. But, I have hundreds that replicate these. What I miss in your sample's SA reports are any URIBL hits of some sort. Because there were no hits. That's exactly the point. Are you doing URIBL lookups? and using RAZOR PYZOR? Yes, using Razor, Pyzor, and DCC. Also using all default RBLs and URIBLs. Per my last message, the whole issue is that my user appears to be getting the hot of the presses run of these spams, before they have been reported to the RBLs, URIBLs, and hash DBs like Razor and Pyzor. Therefore, none of the network checks are getting hit... they are absolutely enabled, and a few hours later they would hit high scores, but upon initial receipt they simply do not hit because the spam is too new. This is my whole issue -- since my user appears to be very high up on the recipient list for all these spammers, and is therefore getting spams before the network checks are effective, how can we combat these new spams _before_ the network checks become effective? Thanks. --- Amir
Re: Uptick in spam
On Mar 27, 2015, at 1:33 PM, Axb axb.li...@gmail.com wrote: Are you using Mailscanner? if yes then it's you munging URIS so they breaking lookups on any hash type as in Yes, I am using MailScanner. Some URIs are munged, others are not. For example, you can see in that very pastebin you noted that there are a number of perfectly good URIs. MailScanner will munge the embedded image web bugs and the embedded JavaScript, but will not munge regular href links or regular img links. In that sample, the only MailScanner munging is on JavaScript. But, you're saying MailScanner is changing the message and therefore changing the hash overall... yes? Would you recommend not running MailScanner? If so, what would you recommend for virus scanning? Or, would you recommend turning off munging for embedded JS and web bugs? (But, keeping the virus scanning?) Of course, removing munging opens other vulnerabilities... Note that my spam setup is as follows: sendmail - MailScanner (system-wide, root-owned) - spamc/spamd (per-user, via procmail) Unfortunately due to the nature of the virtual-host setup on this machine I _cannot_ have MailScanner be the SA glue, nor can I easily switch to SA milters like spamass-milter or amavisd or whatever. Right now, this setup is unfortunately not changeable. And if you're indeed using MailScanner are you sending it the full message or some chunk only? (can't remember the settings's names) I am passing in the entire message. Thanks. --- Amir
Re: Uptick in spam
On 03/27/2015 08:45 PM, Amir Caspi wrote: On Mar 27, 2015, at 1:33 PM, Axb axb.li...@gmail.com wrote: Are you using Mailscanner? if yes then it's you munging URIS so they breaking lookups on any hash type as in Yes, I am using MailScanner. Some URIs are munged, others are not. For example, you can see in that very pastebin you noted that there are a number of perfectly good URIs. MailScanner will munge the embedded image web bugs and the embedded JavaScript, but will not munge regular href links or regular img links. In that sample, the only MailScanner munging is on JavaScript. But, you're saying MailScanner is changing the message and therefore changing the hash overall... yes? Would you recommend not running MailScanner? If so, what would you recommend for virus scanning? Or, would you recommend turning off munging for embedded JS and web bugs? (But, keeping the virus scanning?) Of course, removing munging opens other vulnerabilities... I used MS for few years - It did the job. As an AV product I'd recommend Sophos AND ESETS/Nod32. I'd also suggest you disable msg munging if you want hashers to work. URI lists may also list URIs to .js and web bugs - you could be missing on them. Note that my spam setup is as follows: sendmail - MailScanner (system-wide, root-owned) - spamc/spamd (per-user, via procmail) __ Unfortunately due to the nature of the virtual-host setup on this machine I _cannot_ have MailScanner be the SA glue, nor can I easily switch to SA milters like spamass-milter or amavisd or whatever. Right now, this setup is unfortunately not changeable. Are you an ISP/ASP or is this a corporate box? What are you really using MailScanner for? I also wonder if you're doing any rejects at SMTP level.
Re: Uptick in spam
On 03/27/2015 07:51 PM, Amir Caspi wrote: Here are a few spamples: http://pastebin.com/3nSLurGv (this scored BAYES_99 but would still have been FN with BAYES_999) http://pastebin.com/LaKT5ZZK (I have a rule template for these URIs but recent spams have modified them to cause high risk of FPs for such rules) http://pastebin.com/qSgBxR5B (BAYES_999; could potentially be caught by an excessive HTML entity rule, but none seemed to hit... is there one?) For the first and last one, the URIs are way too similar to blog URIs that would be in use by legitimate agencies, so I suspect there is a high risk for FPs on those. The middle one uses a template that I have URI rules for, but the URIs are evolving to use randomized server names which are also basically impossible to template against without risk of FPs. I have hundreds more like these... These three samples are very different in the sense that #1 is a hacked site, #2 #3 are the regular snowshoe. What I miss in your sample's SA reports are any URIBL hits of some sort. Are you doing URIBL lookups? and using RAZOR PYZOR? Axb
Re: Uptick in spam
Apologies if this is an overly obvious answer, but are you using any greylisting? This would (potentially) move your user away from the wavefront of a spam's distribution, and give it a better chance of triggering the network-based tests. On Fri, 27 Mar 2015, Amir Caspi wrote: This is my whole issue -- since my user appears to be very high up on the recipient list for all these spammers, and is therefore getting spams before the network checks are effective, how can we combat these new spams _before_ the network checks become effective? Thanks. --- Amir -- Public key #7BBC68D9 at| Shane Williams http://pgp.mit.edu/| System Admin - UT CompSci =--+--- All syllogisms contain three lines | sha...@shanew.net Therefore this is not a syllogism | www.ischool.utexas.edu/~shanew
Re: Uptick in spam
On Mar 27, 2015, at 1:38 PM, sha...@shanew.net wrote: Apologies if this is an overly obvious answer, but are you using any greylisting? This would (potentially) move your user away from the wavefront of a spam's distribution, and give it a better chance of triggering the network-based tests. No, unfortunately not. It's something I've been considering but with my current system setup I don't know of an easy way to implement it. Unfortunately the system setup is fixed due to the virtual hosting software being run on it. There is a possibility this can change in the future, depending on our client setup, but right now we're stuck with it, so I can't do things like use amavisd or dovecot or whatever. If I can easily implement greylisting from within sendmail without breaking the current setup, that's certainly something I'd consider doing... Of course, I am aware of the debate regarding greylisting. In particular, this can cause significant problems for one-time password emails, e.g. from banks, where a significant delay in delivery causes huge problems. I'm not sure how to work around that. Thanks. --- Amir
Re: Uptick in spam
On Feb 16, 2015, at 11:47 AM, Kevin A. McGrail kmcgr...@pccc.com wrote: I'm happy to look at a recent sample and throw it through my system to see what it hits but overall, I've been seeing the exact opposite. So, one of my users has been getting dozens (sometimes nearly 100) FNs per DAY over the last few weeks. Even though many of these emails are hitting BAYES_999, they are not hitting any other non-negligible scoring rules. I have set BAYES_99 + BAYES_999 to a combined score of 4.9 because I don't want it to be a complete poison pill, but this is contributing to something like 50% of the FNs (where only BAYES_999 is contributing to the score because no other rules are hitting). The other 50% are not getting high-enough Bayes scores, but even then, many still don't hit many (or any) other scoring rules so that they would still have this problem even if they scored BAYES_999. In many cases, it would appear that he is getting a fresh batch that hasn't yet hit the RBLs or hash DBs, which is why even with BAYES_999 they don't score over the 5.0 threshold... it's causing some severe inbox unpleasantness. I've been trying to come up with some good URI template rules to block many of these but spammers are getting sufficiently generic in their URIs that I worry strongly about FPs for these. I haven't been able to identify any other distinctive markers in the template against which I can reliably write rules, although I also don't have a program that does strong comparisons to look for patterns (I'm just doing this by eye). I have his spam corpus of a few thousand messages... simple Bayes training doesn't seem to help, so some sort of template matching would really be useful here, but as I said, I haven't really found anything that I feel comfortable writing rules against without significant risk of FPs. Might anyone have some ideas? This is getting to be a serious issue for this user and I'm getting complaints... Thanks. (For reference: running SA 3.4.0 on CentOS 5.11.) --- Amir
Re: Uptick in spam
On Mar 27, 2015, at 12:20 PM, Axb axb.li...@gmail.com wrote: - Please post missed spam samples in pastebin.com - do not post samples to mailing lists Of course, I would never post it to the list. I will put up a few in pastebin but there are so many of them, and there are a few different templates in use, so I don't know if I can really capture them all. I obviously can't post the entire corpus on pastebin. ;-) Here are a few spamples: http://pastebin.com/3nSLurGv (this scored BAYES_99 but would still have been FN with BAYES_999) http://pastebin.com/LaKT5ZZK (I have a rule template for these URIs but recent spams have modified them to cause high risk of FPs for such rules) http://pastebin.com/qSgBxR5B (BAYES_999; could potentially be caught by an excessive HTML entity rule, but none seemed to hit... is there one?) For the first and last one, the URIs are way too similar to blog URIs that would be in use by legitimate agencies, so I suspect there is a high risk for FPs on those. The middle one uses a template that I have URI rules for, but the URIs are evolving to use randomized server names which are also basically impossible to template against without risk of FPs. I have hundreds more like these... Cheers. --- Amir
Re: How to automatically train each users Bayes?
On 27.03.15 15:16, Michael wrote: I would like automatically learn each users Bayes database in the following way: Do the following once a day for each user: 1.) sa-learn -u username --ham ../maildir/cur 2.) sa-learn -u username --spam ../maildir/.Spam/cur What do you think about this strategy? the easiest way is to train on false positives and false negatives. dovecot imapd has plugin to train when mail is moved from/to spam. you use something other, you should create pair of special folders for users to train both ham and spam. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. To Boot or not to Boot, that's the question. [WD1270 Caviar]
Re: How to automatically train each users Bayes?
On 27.03.2015 19:09, RW wrote: On Fri, 27 Mar 2015 15:16:13 + Michael wrote: Hi, I would like automatically learn each users Bayes database in the following way: Do the following once a day for each user: 1.) sa-learn -u username --ham ../maildir/cur 2.) sa-learn -u username --spam ../maildir/.Spam/cur The idea is to train the Bayes for each user without the need to take care of learning Spam/Ham on their own. The reason for taking the cur folder instead of the new folder is that I assume that the contents of these folders have already been verified for false-positives/negatives by the user. cur doesn't imply that the mail has been read; for that you need to check the seen flag in the filename, an S somewhere after the colon. Yes, that's true. But if I'm right, new mails stay in new until the appropriate folder in the IMAP client has been opened, right? I just assume, if the use has some false negatives in the folder, he will either immediately delete it or just move it into the Spam folder. A problem that could occur is when the user always deletes all mails in .Spam/cur. Then the Bayes is only trained with Ham, but never Spam. Or isn't that a problem? Not if you tell them - then it's their fault if it doesn't work. Alternately you could have a separate train-spam folder and empty it after training. I think it's more easy for the user if they just leave Spam in the Spam folder for at least one day. Most of them will not move Spam into a learn-folder. You could also supplement spam training by autolearning only spam, e.g. I have: bayes_auto_learn 1 bayes_auto_learn_on_error 1 bayes_auto_learn_threshold_nonspam -2000.0 But that learns spam only if its score is above 12.0. And learns no nonspam. And then maybe the default config which auto learns spam and ham is already the best... My setup is already configured retrain when the user moves mail from Inbox to Spam or from Spam to another folder. Personally I've never seen a spam miss-trained as a ham with the default threshold, and sensible rule scores. I think where some people go wrong is that they don't specify aggressive custom scores correctly. With autolearning it's better to keep conservative scores in the non-Bayes scoresets e.g. score SOME_RULE 2 2 8 8 not score SOME_RULE 8 There's no difference in classification, but the latter is more like to cause miss-training on FPs.
Re: How to automatically train each users Bayes?
On 27.03.2015 19:54, Matus UHLAR - fantomas wrote: On 27.03.15 15:16, Michael wrote: I would like automatically learn each users Bayes database in the following way: Do the following once a day for each user: 1.) sa-learn -u username --ham ../maildir/cur 2.) sa-learn -u username --spam ../maildir/.Spam/cur What do you think about this strategy? the easiest way is to train on false positives and false negatives. dovecot imapd has plugin to train when mail is moved from/to spam. My concerns are the following: Sometimes new kind of spam is appearing. This new kind often gets low scores so that they are just 0.1 to 0.5 points above the limit. And the auto learner gets no hit. If the same spam then comes from another sending server, the score is just a little bit below the border so that I'm getting a false-negative. If the previous spam would have already been learned, the second mail would have been scored as spam. you use something other, you should create pair of special folders for users to train both ham and spam.
Re: Uptick in spam
Am 27.03.2015 um 19:13 schrieb Amir Caspi: On Feb 16, 2015, at 11:47 AM, Kevin A. McGrail kmcgr...@pccc.com wrote: I'm happy to look at a recent sample and throw it through my system to see what it hits but overall, I've been seeing the exact opposite. So, one of my users has been getting dozens (sometimes nearly 100) FNs per DAY over the last few weeks. Even though many of these emails are hitting BAYES_999, they are not hitting any other non-negligible scoring rules what here helps a lot are custom subject rules * contains * starts with * ends with * equal 4 different score levels * very low: 0.5 * low: 1.5 * medium: 2.5 * high: 3.5 very high: 4.5 we have currently 577 different subjects and subject-parts scored , i don't want to publish them because i'd like the spammers don't change to new ones :-) signature.asc Description: OpenPGP digital signature
Re: Uptick in spam
On 27.03.15 12:51, Amir Caspi wrote: Here are a few spamples: http://pastebin.com/3nSLurGv (this scored BAYES_99 but would still have been FN with BAYES_999) http://pastebin.com/LaKT5ZZK (I have a rule template for these URIs but recent spams have modified them to cause high risk of FPs for such rules) http://pastebin.com/qSgBxR5B (BAYES_999; could potentially be caught by an excessive HTML entity rule, but none seemed to hit... is there one?) I see no network checks here... do you use network checks? -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. 10 GOTO 10 : REM (C) Bill Gates 1998, All Rights Reserved!
Re: How to automatically train each users Bayes?
On 27.03.2015 16:21, Reindl Harald wrote: Am 27.03.2015 um 16:16 schrieb Michael: I would like automatically learn each users Bayes database in the following way: Do the following once a day for each user: 1.) sa-learn -u username --ham ../maildir/cur 2.) sa-learn -u username --spam ../maildir/.Spam/cur The idea is to train the Bayes for each user without the need to take care of learning Spam/Ham on their own. The reason for taking the cur folder instead of the new folder is that I assume that the contents of these folders have already been verified for false-positives/negatives by the user. A problem that could occur is when the user always deletes all mails in .Spam/cur. Then the Bayes is only trained with Ham, but never Spam. Or isn't that a problem? What do you think about this strategy? nothing good because in that case you can just stay at autolearning which is on by default after a bayes has at least 200 ham and 200 spam samples to get enabled at all You are probably right. Auto learning is already working for all users because I'm always training new users with a preselected ham/spam folder
Re: How to automatically train each users Bayes?
On Fri, 27 Mar 2015 15:16:13 + Michael wrote: Hi, I would like automatically learn each users Bayes database in the following way: Do the following once a day for each user: 1.) sa-learn -u username --ham ../maildir/cur 2.) sa-learn -u username --spam ../maildir/.Spam/cur The idea is to train the Bayes for each user without the need to take care of learning Spam/Ham on their own. The reason for taking the cur folder instead of the new folder is that I assume that the contents of these folders have already been verified for false-positives/negatives by the user. cur doesn't imply that the mail has been read; for that you need to check the seen flag in the filename, an S somewhere after the colon. A problem that could occur is when the user always deletes all mails in .Spam/cur. Then the Bayes is only trained with Ham, but never Spam. Or isn't that a problem? Not if you tell them - then it's their fault if it doesn't work. Alternately you could have a separate train-spam folder and empty it after training. You could also supplement spam training by autolearning only spam, e.g. I have: bayes_auto_learn 1 bayes_auto_learn_on_error 1 bayes_auto_learn_threshold_nonspam -2000.0 Personally I've never seen a spam miss-trained as a ham with the default threshold, and sensible rule scores. I think where some people go wrong is that they don't specify aggressive custom scores correctly. With autolearning it's better to keep conservative scores in the non-Bayes scoresets e.g. score SOME_RULE 2 2 8 8 not score SOME_RULE 8 There's no difference in classification, but the latter is more like to cause miss-training on FPs.
Re: Uptick in spam
On 03/27/2015 07:13 PM, Amir Caspi wrote: On Feb 16, 2015, at 11:47 AM, Kevin A. McGrail kmcgr...@pccc.com wrote: I'm happy to look at a recent sample and throw it through my system to see what it hits but overall, I've been seeing the exact opposite. So, one of my users has been getting dozens (sometimes nearly 100) FNs per DAY over the last few weeks. Even though many of these emails are hitting BAYES_999, they are not hitting any other non-negligible scoring rules. I have set BAYES_99 + BAYES_999 to a combined score of 4.9 because I don't want it to be a complete poison pill, but this is contributing to something like 50% of the FNs (where only BAYES_999 is contributing to the score because no other rules are hitting). The other 50% are not getting high-enough Bayes scores, but even then, many still don't hit many (or any) other scoring rules so that they would still have this problem even if they scored BAYES_999. In many cases, it would appear that he is getting a fresh batch that hasn't yet hit the RBLs or hash DBs, which is why even with BAYES_999 they don't score over the 5.0 threshold... it's causing some severe inbox unpleasantness. I've been trying to come up with some good URI template rules to block many of these but spammers are getting sufficiently generic in their URIs that I worry strongly about FPs for these. I haven't been able to identify any other distinctive markers in the template against which I can reliably write rules, although I also don't have a program that does strong comparisons to look for patterns (I'm just doing this by eye). I have his spam corpus of a few thousand messages... simple Bayes training doesn't seem to help, so some sort of template matching would really be useful here, but as I said, I haven't really found anything that I feel comfortable writing rules against without significant risk of FPs. Might anyone have some ideas? This is getting to be a serious issue for this user and I'm getting complaints... - Please post missed spam samples in pastebin.com - do not post samples to mailing lists
Re: Uptick in spam
On Fri, 27 Mar 2015 12:13:30 -0600 Amir Caspi wrote: On Feb 16, 2015, at 11:47 AM, Kevin A. McGrail kmcgr...@pccc.com wrote: I'm happy to look at a recent sample and throw it through my system to see what it hits but overall, I've been seeing the exact opposite. So, one of my users has been getting dozens (sometimes nearly 100) FNs per DAY over the last few weeks. Even though many of these emails are hitting BAYES_999, they are not hitting any other non-negligible scoring rules. I have set BAYES_99 + BAYES_999 to a combined score of 4.9 because I don't want it to be a complete poison pill, Personally I've found that trying to work around BAYES_99 not being a poison pill causes more FPs making it one YMMV.
Re: Uptick in spam
On Mar 27, 2015, at 12:22 PM, Reindl Harald h.rei...@thelounge.net wrote: we have currently 577 different subjects and subject-parts scored , i don't want to publish them because i'd like the spammers don't change to new ones :-) Sadly, that doesn't help me. I don't have time to compile hundreds of subject rules, managing email is not my full-time job and I don't want it to become one. If you care to share, that would be much appreciated, but otherwise I can't spend time writing hundreds of custom rules. This is why I look for URI templates where regexps work well... looking for keywords or key phrases would be a huge quagmire, and that's what Bayes is supposed to be for. As to publishing, I personally feel holding rules to one's self is not productive. Spammers evolve regardless, and in the meantime those templates benefit nobody but one's own system. Distributing them publicly will help everyone and could help others publish better rules in the future. Obviously, others may disagree. Cheers. --- Amir
Re: Uptick in spam
On 3/27/2015 2:51 PM, Amir Caspi wrote: On Mar 27, 2015, at 12:20 PM, Axb axb.li...@gmail.com wrote: - Please post missed spam samples in pastebin.com - do not post samples to mailing lists Of course, I would never post it to the list. I will put up a few in pastebin but there are so many of them, and there are a few different templates in use, so I don't know if I can really capture them all. I obviously can't post the entire corpus on pastebin. ;-) Are you using network tests? These are scoring pretty high for me.
Re: How to automatically train each users Bayes?
On 27.03.2015 19:54, Matus UHLAR - fantomas wrote: the easiest way is to train on false positives and false negatives. dovecot imapd has plugin to train when mail is moved from/to spam. On 27.03.15 20:10, Michael wrote: My concerns are the following: Sometimes new kind of spam is appearing. This new kind often gets low scores so that they are just 0.1 to 0.5 points above the limit. And the auto learner gets no hit. If the same spam then comes from another sending server, the score is just a little bit below the border so that I'm getting a false-negative. If the previous spam would have already been learned, the second mail would have been scored as spam. I don't get this. Or should I add that it's of course good to continue with autolearning, but _also_ allow manual learning ? -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Micro$oft random number generator: 0, 0, 0, 4.33e+67, 0, 0, 0...
Re: Uptick in spam
On Fri, 27 Mar 2015, Amir Caspi wrote: On Mar 27, 2015, at 12:56 PM, Matus UHLAR - fantomas uh...@fantomas.sk wrote: I see no network checks here... do you use network checks? On Mar 27, 2015, at 1:11 PM, Kevin A. McGrail kmcgr...@pccc.com wrote: Are you using network tests? These are scoring pretty high for me. I presume you're talking about things like Razor, Pyzor, DCC, and various RBLs? Yes, those are enabled. The reason you're not seeing them is because they didn't hit when the messages were first received. I'm getting the same hits NOW that you are seeing, but those did NOT hit when the messages first arrived. Have you considered greylisting? -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- The one political issue that strips all politicians bare is individual gun rights. --- 5 days until April Fools' day
Re: Uptick in spam
On Fri, 27 Mar 2015, Amir Caspi wrote: On Mar 27, 2015, at 1:38 PM, sha...@shanew.net wrote: Apologies if this is an overly obvious answer, but are you using any greylisting? This would (potentially) move your user away from the wavefront of a spam's distribution, and give it a better chance of triggering the network-based tests. No, unfortunately not. It's something I've been considering but with my current system setup I don't know of an easy way to implement it. Unfortunately the system setup is fixed due to the virtual hosting software being run on it. There is a possibility this can change in the future, depending on our client setup, but right now we're stuck with it, so I can't do things like use amavisd or dovecot or whatever. If I can easily implement greylisting from within sendmail without breaking the current setup, that's certainly something I'd consider doing... (all caught up now, sheesh). Can you install milters? Take a look at milter-greylist. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- The one political issue that strips all politicians bare is individual gun rights. --- 5 days until April Fools' day
Re: How to automatically train each users Bayes?
On Fri, 27 Mar 2015 20:03:18 +0100 Michael wrote: On 27.03.2015 19:09, RW wrote: On Fri, 27 Mar 2015 15:16:13 + cur doesn't imply that the mail has been read; for that you need to check the seen flag in the filename, an S somewhere after the colon. Yes, that's true. But if I'm right, new mails stay in new until the appropriate folder in the IMAP client has been opened, right? I just assume, if the use has some false negatives in the folder, he will either immediately delete it or just move it into the Spam folder. People can have mail clients running unattended in the background, often on multiple devices, so you can't assume it's been seen by a human. You could also supplement spam training by autolearning only spam, e.g. I have: bayes_auto_learn 1 bayes_auto_learn_on_error 1 bayes_auto_learn_threshold_nonspam -2000.0 But that learns spam only if its score is above 12.0. And learns no nonspam. That's why I suggested using it to *supplement* spam training. When it works, autotraining does have the advantage of happening in real-time. And then maybe the default config which auto learns spam and ham is already the best... the default doesn't learn ham well, I'd only do that as a last resort. My setup is already configured retrain when the user moves mail from Inbox to Spam or from Spam to another folder. This is a really poor way of training Bayes because it trains on SA misclassifications rather than Bayes misclassifications. It's a poor way of training spam and very much worse at training ham. On Fri, 27 Mar 2015 20:14:03 +0100 Matus UHLAR - fantomas wrote: On 27.03.2015 19:54, Matus UHLAR - fantomas wrote: the easiest way is to train on false positives and false negatives. dovecot imapd has plugin to train when mail is moved from/to spam. On 27.03.15 20:10, Michael wrote: My concerns are the following: Sometimes new kind of spam is appearing. This new kind often gets low scores so that they are just 0.1 to 0.5 points above the limit. And the auto learner gets no hit. If the same spam then comes from another sending server, the score is just a little bit below the border so that I'm getting a false-negative. If the previous spam would have already been learned, the second mail would have been scored as spam. I don't get this. By the sound of it the OP is already using the dovecot plugin or equivalent. The first spam wasn't autolearned and was correctly identified as spam. In this case the plugin doesn't provide a way of training it, even if it has BAYES_00, because it's already in the spam folder. People keep recommending the plugin, but IMO it's a poor choice for SpamAssassin.
Re: How to automatically train each users Bayes?
Hi, Yes, that's true. But if I'm right, new mails stay in new until the appropriate folder in the IMAP client has been opened, right? I just assume, if the use has some false negatives in the folder, he will either immediately delete it or just move it into the Spam folder. People can have mail clients running unattended in the background, often on multiple devices, so you can't assume it's been seen by a human. Does anyone have any suggestions on how to enable Exchange users to submit samples for analysis they consider to be spam? With the latest Exchange, they've disabled IMAP on public folders. We have one setup where we forward the mail to their internal Exchange system. We used to have spam and ham folders where users would place mail for us to review then train bayes, but we haven't been able to do it for a while because of this lack of IMAP issue. Thanks, Alex
Re: Uptick in spam
On 03/27/2015 11:51 AM, Amir Caspi wrote: On Mar 27, 2015, at 12:20 PM, Axb axb.li...@gmail.com wrote: - Please post missed spam samples in pastebin.com - do not post samples to mailing lists Of course, I would never post it to the list. I will put up a few in pastebin but there are so many of them, and there are a few different templates in use, so I don't know if I can really capture them all. I obviously can't post the entire corpus on pastebin. ;-) Here are a few spamples: http://pastebin.com/3nSLurGv (this scored BAYES_99 but would still have been FN with BAYES_999) http://pastebin.com/LaKT5ZZK (I have a rule template for these URIs but recent spams have modified them to cause high risk of FPs for such rules) http://pastebin.com/qSgBxR5B (BAYES_999; could potentially be caught by an excessive HTML entity rule, but none seemed to hit... is there one?) All of these were From: domains created today. For the first and last one, the URIs are way too similar to blog URIs that would be in use by legitimate agencies, so I suspect there is a high risk for FPs on those. The middle one uses a template that I have URI rules for, but the URIs are evolving to use randomized server names which are also basically impossible to template against without risk of FPs. I have hundreds more like these... Cheers. --- Amir
Re: Uptick in spam
On Mar 27, 2015, at 2:09 PM, Axb axb.li...@gmail.com wrote: As an AV product I'd recommend Sophos AND ESETS/Nod32. I'll look into Sophos, I'm not entirely sure if I can deploy it on my system or not. We have to use RPMs that can be distributed to the virtual hosts, etc... I'll definitely look into it. Haven't heard about ESETS/Nod32, will check it out. I'd also suggest you disable msg munging if you want hashers to work. I'll certainly consider that if this is a major issue. I see hashers working on many other messages, but I'm not sure how munged those messages are. I'll try to investigate to see if I've seen hash hits on munged messages... Turning off munging will unfortunately reduce security since it allows embedded JS and web bugs, but if it improves the chances of those things getting properly tagged as spam then they won't open them anyway, so I guess it may come out in the wash. URI lists may also list URIs to .js and web bugs - you could be missing on them. Very good point. Are you an ISP/ASP or is this a corporate box? A bit of both. We run a dedicated server that is owned by a major ISP, but they basically only handle the upstream end. We are root on the box and handle everything downstream. We run a virtual hosting panel and our corporate clients run domains (for email and web hosting) as virtual hosts on the box. Each virthost is operated in a chroot environment, and the control panel distributes the central RPMs to each virthost. So, everything we do has to work with the framework of the control panel and its virtual hosting environment. What are you really using MailScanner for? Primarily as glue to clamav (via clamd) and for attachment policy enforcement (e.g., no .exe payloads), and secondarily for URI munging. I also wonder if you're doing any rejects at SMTP level. Yes, I've implemented enhdnsbl in sendmail, querying SpamCop, Barracuda, and SpamHaus Zen (in that order). I know Barracuda is often overzealous but we haven't seen any FP rejections (that we know of) yet. Are there any other RBLs you suggest I add to sendmail's checks? (I used to use NJABL but that's dead, and last time I asked on this list, I was told SORBS wasn't a good idea due to too many FP rejections.) I also have greetpause enabled (at 1 sec) to reject trigger-happy spammers. Cheers. --- Amir
Re: Uptick in spam
On Mar 27, 2015, at 3:34 PM, Richard Doyle lists...@islandnetworks.com wrote: All of these were From: domains created today. Shouldn't they have been picked up by DOB? Or do I need to manually enable some DOB plugin in SA? (If so, please let me know how...) When I ran the third spample manually a few hours ago, I still didn't see any DOB hit. I see there is a URIBL_RHS_DOB... is there a SENDER_DOB rule as well? If not, it seems like it would be a good idea to implement one... do I need to file a bug for it? However, it would appear that all of the From: domains are the same as in the body URIs, which means URIBL_RHS_DOB should have popped... unless you mean that the subdomain (sub.domain.com) was DOB, but the main domain (www.domain.com and/or domain.com) were not DOB? Or am I missing something? Thanks. --- Amir
Re: Uptick in spam
On 03/27/2015 11:44 PM, Amir Caspi wrote: On Mar 27, 2015, at 3:34 PM, Richard Doyle lists...@islandnetworks.com wrote: All of these were From: domains created today. Shouldn't they have been picked up by DOB? Or do I need to manually enable some DOB plugin in SA? (If so, please let me know how...) When I ran the third spample manually a few hours ago, I still didn't see any DOB hit. I see there is a URIBL_RHS_DOB... is there a SENDER_DOB rule as well? If not, it seems like it would be a good idea to implement one... do I need to file a bug for it? However, it would appear that all of the From: domains are the same as in the body URIs, which means URIBL_RHS_DOB should have popped... unless you mean that the subdomain (sub.domain.com) was DOB, but the main domain (www.domain.com and/or domain.com) were not DOB? Or am I missing something? DOB isn't realtime/zero hour. I have zero Sendmail clue but if you can do it, also check sender/helo/rdns against dbl.spamhaus.org's reply 127.0.1.2 (I can only provide Postfix config for this) if you want to check sender in DOB you can use eval:check_rbl_envfrom for a rule. A few days ago I posted dbl_env_from.cf which should show how it's done (the rule is untested) http://mail-archives.apache.org/mod_mbox/spamassassin-users/201503.mbox/%3C55128D61.2020308%40gmail.com%3E You also may want to look at the Invaluement IP/URI lists. (Invaluement.com). Detection rate is real good and FP level is extraordinary. IIRC you can get a test drive. I wouldn't want to miss it.
Re: Uptick in spam
On Mar 27, 2015, at 5:12 PM, Axb axb.li...@gmail.com wrote: DOB isn't realtime/zero hour. That kind of defeats the point, isn't it? I mean, if you wait too long, it's no longer DOB, it's few-DOB... I would have imagined that a DOB server would operate in a caching mode where the first query on a domain would cause a whois lookup, which then generates a cache table entry with the reg date. Subsequent lookups then don't incur a whois hit, they just check the cache table. In this way it could be effectively realtime since only the first query causes a whois load, and it would always return the correct answer. I guess that's not the case? I have zero Sendmail clue but if you can do it, also check sender/helo/rdns against dbl.spamhaus.org's reply 127.0.1.2 I haven't found a way to do this, but if someone knows, please post... You also may want to look at the Invaluement IP/URI lists. (Invaluement.com). Detection rate is real good and FP level is extraordinary. IIRC you can get a test drive. I wouldn't want to miss it. Unfortunately a paid service is not in the cards right now. Does anyone recommend using the PSBL (Surriel) for sendmail dnsbl? I see that it's enabled by default in SA, but should I promote it to the sendmail level, or is it too prone to FP? On a related note... since I implemented SpamCop, Barracuda, and SpamHaus at the sendmail level, should I disable those RBL lookups in SA, to prevent double-querying the RBLs for those mails that do get through? Or does SA check _all_ Received lines, in which case I should leave it enabled since sendmail only checks the connecting MTA? (I should note that I _HAVE_ seen RCVD_IN_XBL/PBL/SBL and RCVD_IN_BL_SPAMCOP_NET pop up not infrequently, despite implementing dnsbl for those RBLs in sendmail, which means either they're getting listed in the small interval between sendmail and SA, or SA is checking more than just the last hop...) Thanks. --- Amir
Re: Uptick in spam
On 03/27/2015 03:44 PM, Amir Caspi wrote: On Mar 27, 2015, at 3:34 PM, Richard Doyle lists...@islandnetworks.com wrote: All of these were From: domains created today. Shouldn't they have been picked up by DOB? Or do I need to manually enable some DOB plugin in SA? (If so, please let me know how...) When I ran the third spample manually a few hours ago, I still didn't see any DOB hit. I see there is a URIBL_RHS_DOB... is there a SENDER_DOB rule as well? If not, it seems like it would be a good idea to implement one... do I need to file a bug for it? However, it would appear that all of the From: domains are the same as in the body URIs, which means URIBL_RHS_DOB should have popped... unless you mean that the subdomain (sub.domain.com) was DOB, but the main domain (www.domain.com and/or domain.com) were not DOB? Or am I missing something? DOB misses many new domains. Whois often knows what's new, but using it to detect spam doesn't scale. Thanks. --- Amir
Re: Uptick in spam
On Fri, 27 Mar 2015 17:40:58 -0600 Amir Caspi wrote: On Mar 27, 2015, at 5:12 PM, Axb axb.li...@gmail.com wrote: DOB isn't realtime/zero hour. That kind of defeats the point, isn't it? I mean, if you wait too long, it's no longer DOB, it's few-DOB... I think it's 5 days, and the day-old bit is part of the bread metaphor, not the definition. On a related note... since I implemented SpamCop, Barracuda, and SpamHaus at the sendmail level, should I disable those RBL lookups in SA, to prevent double-querying the RBLs for those mails that do get through? Or does SA check _all_ Received lines, in which case I should leave it enabled since sendmail only checks the connecting MTA? (I should note that I _HAVE_ seen RCVD_IN_XBL/PBL/SBL and RCVD_IN_BL_SPAMCOP_NET pop up not infrequently, despite implementing dnsbl for those RBLs in sendmail, which means either they're getting listed in the small interval between sendmail and SA, or SA is checking more than just the last hop...) There are deep checks for SBL (via zen) and SPAMCOP. XBL/PBL are last-external only
Re: Uptick in spam
On 03/28/2015 12:40 AM, Amir Caspi wrote: On Mar 27, 2015, at 5:12 PM, Axb axb.li...@gmail.com wrote: DOB isn't realtime/zero hour. That kind of defeats the point, isn't it? I mean, if you wait too long, it's no longer DOB, it's few-DOB... I would have imagined that a DOB server would operate in a caching mode where the first query on a domain would cause a whois lookup, which then generates a cache table entry with the reg date. Subsequent lookups then don't incur a whois hit, they just check the cache table. In this way it could be effectively realtime since only the first query causes a whois load, and it would always return the correct answer. I guess that's not the case? DOB is based on more or less publicly accessible daily TLD zone data (ICANN ZFA) You're thinking passive DNS, as done by https://www.farsightsecurity.com/ I have access to their DNSDB service for a hobby project and it's amazing. Farsight's NOD service is way out of our means. Does anyone recommend using the PSBL (Surriel) for sendmail dnsbl? I see that it's enabled by default in SA, but should I promote it to the sendmail level, or is it too prone to FP? It works fine for a family server, but I wouldn't use it for rejecting spam in a client's mailflow. On a related note... since I implemented SpamCop, Barracuda, and SpamHaus at the sendmail level, should I disable those RBL lookups in SA, to prevent double-querying the RBLs for those mails that do get through? Or does SA check _all_ Received lines, in which case I should leave it enabled since sendmail only checks the connecting MTA? (I should note that I _HAVE_ seen RCVD_IN_XBL/PBL/SBL and RCVD_IN_BL_SPAMCOP_NET pop up not infrequently, despite implementing dnsbl for those RBLs in sendmail, which means either they're getting listed in the small interval between sendmail and SA, or SA is checking more than just the last hop...) Hard to say without tailing your maillogs. Though, if you have your trusted/internal SA settings right, extra SA checks shouldn't be an issue as you may already have most of the data in your resolver's cache anyway.
Re: Uptick in spam
From: Amir Caspi ceph...@3phase.com Sent: Friday, March 27, 2015 7:30 PM To: RW Cc: users@spamassassin.apache.org Subject: Re: Uptick in spam On Mar 27, 2015, at 6:19 PM, RW rwmailli...@googlemail.com wrote: There are deep checks for SBL (via zen) and SPAMCOP. XBL/PBL are last-external only Interesting. I wonder why I see those XBL/PBL hits, then. Maybe Zen timed out on those queries from sendmail... or something. Either way I guess this means I should retain Zen and SC queries in SA. You should be running a local dns caching server like BIND or PowerDNS Recursor on a mail server to help prevent time outs that can allow RBL checks to become ineffective. It's possible that your outbound mail could be hitting those RBLs in SA in the event of a compromised account or the last-external IP in the Received: depending on what internal mail server you use and if it puts that information in as X-Originating-IP or Received headers of the sending mail client. I would recommend keeping those RBLs in SA to help with outbound scanning and in case they get past the MTA-level RBL checking. It shouldn't be duplicate hits to Zen/XBL/PBL if you have sendmail rejecting that message from making it to SA. If you get any of those RBL hits in SA that sendmail is configured to reject on, then there must be some sendmail access list allowing it to bypass the RBL checks. Esets NOD32 is very fast, very inexpensive, and works well with MailScanner. The invaluement RBL is not expensive either and it is awesome. We pay thousands per year for a Spamhaus feed because of our volume and mailboxes. The invaluement RBL is only hundreds per year and it's almost as good as Spamhaus Zen. I have Spamhaus in front of invaluement in my postfix configuration but I may try flipping the order just to see if it will start blocking more than Spamhaus. Dave Thanks. --- Amir
Re: Uptick in spam
You also may want to look at the Invaluement IP/URI lists. (Invaluement.com). Detection rate is real good and FP level is extraordinary. +1. Very happy with invaluement at $DAYJOB. -- Dave Pooser Cat-Herder-in-Chief, Pooserville.com
Re: Uptick in spam
On Mar 27, 2015, at 6:19 PM, RW rwmailli...@googlemail.com wrote: There are deep checks for SBL (via zen) and SPAMCOP. XBL/PBL are last-external only Interesting. I wonder why I see those XBL/PBL hits, then. Maybe Zen timed out on those queries from sendmail... or something. Either way I guess this means I should retain Zen and SC queries in SA. Thanks. --- Amir
Re: Uptick in spam
David Jones skrev den 2015-03-28 03:13: I have Spamhaus in front of invaluement in my postfix configuration but I may try flipping the order just to see if it will start blocking more than Spamhaus. with postfix posttscreen one can test all ips on all rbls in same single smtpd client check, so there is no just spamhaus here :-) despite its called dnsbl in postscreen it supports whitelist aswell for me i have keeped all rbl checks from spamassassin into postscreen, i know there is more rbl lists i could add, but for me there is no need to, to many quererys makes to much dns trafic without more usefull data, and to make it more stable its nice that postscreen cache results on positive hits a little longer then ttl in dns
how to download updated rules and transfer.
Dear list, i have a system with SpamAssassin 3.4.0 installed. I have installed the rules provided in Downloads link. http://apache.bytenet.in//spamassassin/source/Mail-SpamAssassin -rules-3.4.0.r1565117.tgz the system is not connected to internet. I need to download the rules (new) from a system connected to internet and transfer it to the system on which SpamAsssassin is configured. Please guide. -- anant athavale. bangalore
Re: Uptick in spam
On 3/27/2015 10:13 PM, David Jones wrote: The invaluement RBL is not expensive either and it is awesome. We pay thousands per year for a Spamhaus feed because of our volume and mailboxes. The invaluement RBL is only hundreds per year and it's almost as good as Spamhaus Zen. I have Spamhaus in front of invaluement in my postfix configuration but I may try flipping the order just to see if it will start blocking more than Spamhaus. Just to clarify, the two invaluement sender's IP blacklists, ivmSIP and ivmSIP/24, --combined-- is not (and will probably not ever be) an adequate replacement for Spamhaus's Zen list. So please everyone, don't get the idea that you can turn off Zen, add invaluement, and everything will be ok. David Jones was NOT saying that... but i just want to make sure that nobody mistakenly goes too far with this, beyond what David intended. Having said that... thanks, David, (and others) for your mentioning about your success with ivmSIP and ivmSIP/24, where they are helping you block much of the spam that slips past Spamhaus, etc. -- Rob McEwen