date:20150327


Hi,

I would like automatically learn each users Bayes database in the  
following way:


Do the following once a day for each user:
1.) sa-learn -u username --ham ../maildir/cur
2.) sa-learn -u username --spam ../maildir/.Spam/cur

The idea is to train the Bayes for each user without the need to take  
care of learning Spam/Ham on their own.


The reason for taking the cur folder instead of the new folder is  
that I assume that the contents of these folders have already been  
verified for false-positives/negatives by the user.


A problem that could occur is when the user always deletes all mails  
in .Spam/cur. Then the Bayes is only trained with Ham, but never Spam.  
Or isn't that a problem?


What do you think about this strategy?

Thanks,
Michael

Re: How to automatically train each users Bayes?

2015-03-27 Thread Reindl Harald




Am 27.03.2015 um 16:16 schrieb Michael:

I would like automatically learn each users Bayes database in the
following way:

Do the following once a day for each user:
1.) sa-learn -u username --ham ../maildir/cur
2.) sa-learn -u username --spam ../maildir/.Spam/cur

The idea is to train the Bayes for each user without the need to take
care of learning Spam/Ham on their own.

The reason for taking the cur folder instead of the new folder is
that I assume that the contents of these folders have already been
verified for false-positives/negatives by the user.

A problem that could occur is when the user always deletes all mails in
.Spam/cur. Then the Bayes is only trained with Ham, but never Spam. Or
isn't that a problem?

What do you think about this strategy?


nothing good because in that case you can just stay at autolearning 
which is on by default after a bayes has at least 200 ham and 200 spam 
samples to get enabled at all




signature.asc
Description: OpenPGP digital signature

Re: Uptick in spam


On 03/27/2015 08:20 PM, Amir Caspi wrote:

On Mar 27, 2015, at 12:56 PM, Matus UHLAR - fantomas
uh...@fantomas.sk wrote:


I see no network checks here... do you use network checks?


On Mar 27, 2015, at 1:11 PM, Kevin A. McGrail kmcgr...@pccc.com
wrote:


Are you using network tests?  These are scoring pretty high for
me.


I presume you're talking about things like Razor, Pyzor, DCC, and
various RBLs?  Yes, those are enabled.  The reason you're not seeing
them is because they didn't hit when the messages were first
received.  I'm getting the same hits NOW that you are seeing, but
those did NOT hit when the messages first arrived.

Remember that these messages were received a number of hours ago, so
they have had plenty of time to be listed on RBLs and hash DBs in the
intervening period.  They were clearly not listed there when these
messages were received, which is exactly why these messages are FNs.
If they were received now, they wouldn't be... but they were back
then.

This is why I said in the prior message that it appears my user is
one of the unlucky folks getting these in the very first
distribution, before they've had a chance to be reported to RBLs and
hash DBs.  Some poor schmoe has to be in the first distribution, and
it appears that he's one of them.  This is why I'm looking for other,
template-like rules that can be used to identify these things,
because right now it seems my user is getting them on the first run
before the network tests are useful.

But, yes, network tests are absolutely enabled.


Are you using Mailscanner? if yes then it's you munging URIS so they 
breaking lookups on any hash type as in


http://pastebin.com/LaKT5ZZK

And if you're indeed using MailScanner are you sending it the full 
message or some chunk only?

(can't remember the settings's names)

Re: Uptick in spam

On Mar 27, 2015, at 12:56 PM, Matus UHLAR - fantomas uh...@fantomas.sk wrote:

 I see no network checks here... do you use network checks?

On Mar 27, 2015, at 1:11 PM, Kevin A. McGrail kmcgr...@pccc.com wrote:

 Are you using network tests?  These are scoring pretty high for me.

I presume you're talking about things like Razor, Pyzor, DCC, and various RBLs? 
 Yes, those are enabled.  The reason you're not seeing them is because they 
didn't hit when the messages were first received.  I'm getting the same hits 
NOW that you are seeing, but those did NOT hit when the messages first arrived.

Remember that these messages were received a number of hours ago, so they have 
had plenty of time to be listed on RBLs and hash DBs in the intervening period. 
 They were clearly not listed there when these messages were received, which is 
exactly why these messages are FNs.  If they were received now, they wouldn't 
be... but they were back then.

This is why I said in the prior message that it appears my user is one of the 
unlucky folks getting these in the very first distribution, before they've had 
a chance to be reported to RBLs and hash DBs.  Some poor schmoe has to be in 
the first distribution, and it appears that he's one of them.  This is why I'm 
looking for other, template-like rules that can be used to identify these 
things, because right now it seems my user is getting them on the first run 
before the network tests are useful.

But, yes, network tests are absolutely enabled.

Cheers.

--- Amir

Re: Uptick in spam

On Mar 27, 2015, at 1:20 PM, Axb axb.li...@gmail.com wrote:

 These three samples are very different in the sense that #1 is a hacked
 site, #2  #3 are the regular snowshoe.

Of course, I picked three different samples on purpose.  But, I have hundreds 
that replicate these.

 What I miss in your sample's SA reports are any URIBL hits of some sort.

Because there were no hits.  That's exactly the point.

 Are you doing URIBL lookups? and using RAZOR  PYZOR?

Yes, using Razor, Pyzor, and DCC.  Also using all default RBLs and URIBLs.  Per 
my last message, the whole issue is that my user appears to be getting the hot 
of the presses run of these spams, before they have been reported to the RBLs, 
URIBLs, and hash DBs like Razor and Pyzor.  Therefore, none of the network 
checks are getting hit... they are absolutely enabled, and a few hours later 
they would hit high scores, but upon initial receipt they simply do not hit 
because the spam is too new.

This is my whole issue -- since my user appears to be very high up on the 
recipient list for all these spammers, and is therefore getting spams before 
the network checks are effective, how can we combat these new spams _before_ 
the network checks become effective?

Thanks.

--- Amir

Re: Uptick in spam

On Mar 27, 2015, at 1:33 PM, Axb axb.li...@gmail.com wrote:

 Are you using Mailscanner? if yes then it's you munging URIS so they breaking 
 lookups on any hash type as in

Yes, I am using MailScanner.  Some URIs are munged, others are not.  For 
example, you can see in that very pastebin you noted that there are a number of 
perfectly good URIs.  MailScanner will munge the embedded image web bugs and 
the embedded JavaScript, but will not munge regular href links or regular 
img links.  In that sample, the only MailScanner munging is on JavaScript.

But, you're saying MailScanner is changing the message and therefore changing 
the hash overall... yes?

Would you recommend not running MailScanner?  If so, what would you recommend 
for virus scanning?  Or, would you recommend turning off munging for embedded 
JS and web bugs?  (But, keeping the virus scanning?)  Of course, removing 
munging opens other vulnerabilities...

Note that my spam setup is as follows:

sendmail - MailScanner (system-wide, root-owned) - spamc/spamd (per-user, via 
procmail)

Unfortunately due to the nature of the virtual-host setup on this machine I 
_cannot_ have MailScanner be the SA glue, nor can I easily switch to SA milters 
like spamass-milter or amavisd or whatever.  Right now, this setup is 
unfortunately not changeable.

 And if you're indeed using MailScanner are you sending it the full message or 
 some chunk only?
 (can't remember the settings's names)

I am passing in the entire message.

Thanks.

--- Amir

Re: Uptick in spam


On 03/27/2015 08:45 PM, Amir Caspi wrote:

On Mar 27, 2015, at 1:33 PM, Axb axb.li...@gmail.com wrote:


Are you using Mailscanner? if yes then it's you munging URIS so
they breaking lookups on any hash type as in


Yes, I am using MailScanner.  Some URIs are munged, others are not.
For example, you can see in that very pastebin you noted that there
are a number of perfectly good URIs.  MailScanner will munge the
embedded image web bugs and the embedded JavaScript, but will not
munge regular href links or regular img links.  In that sample,
the only MailScanner munging is on JavaScript.

But, you're saying MailScanner is changing the message and therefore
changing the hash overall... yes?

Would you recommend not running MailScanner?  If so, what would you
recommend for virus scanning?  Or, would you recommend turning off
munging for embedded JS and web bugs?  (But, keeping the virus
scanning?)  Of course, removing munging opens other
vulnerabilities...


I used MS for few years - It did the job.
As an AV product I'd recommend Sophos AND ESETS/Nod32.
I'd also suggest you disable msg munging if you want hashers to work.
URI lists may also list URIs to .js and web bugs - you could be missing 
on them.



Note that my spam setup is as follows:

sendmail - MailScanner (system-wide, root-owned) - spamc/spamd
(per-user, via procmail)

__



Unfortunately due to the nature of the virtual-host setup on this
machine I _cannot_ have MailScanner be the SA glue, nor can I easily
switch to SA milters like spamass-milter or amavisd or whatever.
Right now, this setup is unfortunately not changeable.


Are you an ISP/ASP or is this a corporate box?

What are you really using MailScanner for?

I also wonder if you're doing any rejects at SMTP level.

Re: Uptick in spam


On 03/27/2015 07:51 PM, Amir Caspi wrote:

Here are a few spamples:

http://pastebin.com/3nSLurGv   (this scored BAYES_99 but would still
have been FN with BAYES_999) http://pastebin.com/LaKT5ZZK  (I have a
rule template for these URIs but recent spams have modified them to
cause high risk of FPs for such rules) http://pastebin.com/qSgBxR5B
(BAYES_999; could potentially be caught by an excessive HTML entity
rule, but none seemed to hit... is there one?)

For the first and last one, the URIs are way too similar to blog URIs
that would be in use by legitimate agencies, so I suspect there is a
high risk for FPs on those.  The middle one uses a template that I
have URI rules for, but the URIs are evolving to use randomized
server names which are also basically impossible to template against
without risk of FPs.

I have hundreds more like these...


These three samples are very different in the sense that #1 is a hacked
site, #2  #3 are the regular snowshoe.

What I miss in your sample's SA reports are any URIBL hits of some sort.

Are you doing URIBL lookups? and using RAZOR  PYZOR?


Axb

Re: Uptick in spam

2015-03-27 Thread shanew


Apologies if this is an overly obvious answer, but are you using any
greylisting?  This would (potentially) move your user away from the
wavefront of a spam's distribution, and give it a better chance of
triggering the network-based tests.

On Fri, 27 Mar 2015, Amir Caspi wrote:

This is my whole issue -- since my user appears to be very high up on the recipient list 
for all these spammers, and is therefore getting spams before the network checks are 
effective, how can we combat these new spams _before_ the network checks 
become effective?

Thanks.

--- Amir




--
Public key #7BBC68D9 at| Shane Williams
http://pgp.mit.edu/|  System Admin - UT CompSci
=--+---
All syllogisms contain three lines |  sha...@shanew.net
Therefore this is not a syllogism  | www.ischool.utexas.edu/~shanew

Re: Uptick in spam

On Mar 27, 2015, at 1:38 PM, sha...@shanew.net wrote:

 Apologies if this is an overly obvious answer, but are you using any
 greylisting?  This would (potentially) move your user away from the
 wavefront of a spam's distribution, and give it a better chance of
 triggering the network-based tests.

No, unfortunately not.  It's something I've been considering but with my 
current system setup I don't know of an easy way to implement it.  
Unfortunately the system setup is fixed due to the virtual hosting software 
being run on it.  There is a possibility this can change in the future, 
depending on our client setup, but right now we're stuck with it, so I can't do 
things like use amavisd or dovecot or whatever.

If I can easily implement greylisting from within sendmail without breaking the 
current setup, that's certainly something I'd consider doing...

Of course, I am aware of the debate regarding greylisting.  In particular, this 
can cause significant problems for one-time password emails, e.g. from banks, 
where a significant delay in delivery causes huge problems.  I'm not sure how 
to work around that.

Thanks.

--- Amir

Re: Uptick in spam

On Feb 16, 2015, at 11:47 AM, Kevin A. McGrail kmcgr...@pccc.com wrote:

 I'm happy to look at a recent sample and throw it through my system to see 
 what it hits but overall, I've been seeing the exact opposite.

So, one of my users has been getting dozens (sometimes nearly 100) FNs per DAY 
over the last few weeks.  Even though many of these emails are hitting 
BAYES_999, they are not hitting any other non-negligible scoring rules.  I have 
set BAYES_99 + BAYES_999 to a combined score of 4.9 because I don't want it to 
be a complete poison pill, but this is contributing to something like 50% of 
the FNs (where only BAYES_999 is contributing to the score because no other 
rules are hitting).  The other 50% are not getting high-enough Bayes scores, 
but even then, many still don't hit many (or any) other scoring rules so that 
they would still have this problem even if they scored BAYES_999.  In many 
cases, it would appear that he is getting a fresh batch that hasn't yet hit 
the RBLs or hash DBs, which is why even with BAYES_999 they don't score over 
the 5.0 threshold... it's causing some severe inbox unpleasantness.

I've been trying to come up with some good URI template rules to block many of 
these but spammers are getting sufficiently generic in their URIs that I worry 
strongly about FPs for these.  I haven't been able to identify any other 
distinctive markers in the template against which I can reliably write rules, 
although I also don't have a program that does strong comparisons to look for 
patterns (I'm just doing this by eye).

I have his spam corpus of a few thousand messages... simple Bayes training 
doesn't seem to help, so some sort of template matching would really be useful 
here, but as I said, I haven't really found anything that I feel comfortable 
writing rules against without significant risk of FPs.

Might anyone have some ideas?

This is getting to be a serious issue for this user and I'm getting 
complaints...

Thanks.

(For reference: running SA 3.4.0 on CentOS 5.11.)

--- Amir

Re: Uptick in spam

On Mar 27, 2015, at 12:20 PM, Axb axb.li...@gmail.com wrote:

 - Please post missed spam samples in pastebin.com - do not post samples to 
 mailing lists

Of course, I would never post it to the list.  I will put up a few in pastebin 
but there are so many of them, and there are a few different templates in use, 
so I don't know if I can really capture them all.  I obviously can't post the 
entire corpus on pastebin. ;-)

Here are a few spamples:

http://pastebin.com/3nSLurGv  (this scored BAYES_99 but would still have been 
FN with BAYES_999)
http://pastebin.com/LaKT5ZZK (I have a rule template for these URIs but recent 
spams have modified them to cause high risk of FPs for such rules)
http://pastebin.com/qSgBxR5B (BAYES_999; could potentially be caught by an 
excessive HTML entity rule, but none seemed to hit... is there one?)

For the first and last one, the URIs are way too similar to blog URIs that 
would be in use by legitimate agencies, so I suspect there is a high risk for 
FPs on those.  The middle one uses a template that I have URI rules for, but 
the URIs are evolving to use randomized server names which are also basically 
impossible to template against without risk of FPs.

I have hundreds more like these...

Cheers.

--- Amir

Re: How to automatically train each users Bayes?

2015-03-27 Thread Matus UHLAR - fantomas


On 27.03.15 15:16, Michael wrote:
I would like automatically learn each users Bayes database in the 
following way:


Do the following once a day for each user:
1.) sa-learn -u username --ham ../maildir/cur
2.) sa-learn -u username --spam ../maildir/.Spam/cur



What do you think about this strategy?


the easiest way is to train on false positives and false negatives.
dovecot imapd has plugin to train when mail is moved from/to spam.

you use something other, you should create pair of special folders for users
to train both ham and spam.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
To Boot or not to Boot, that's the question. [WD1270 Caviar]

Re: How to automatically train each users Bayes?

On 27.03.2015 19:09, RW wrote:
 On Fri, 27 Mar 2015 15:16:13 +
 Michael wrote:
 
 Hi,

 I would like automatically learn each users Bayes database in the  
 following way:

 Do the following once a day for each user:
 1.) sa-learn -u username --ham ../maildir/cur
 2.) sa-learn -u username --spam ../maildir/.Spam/cur

 The idea is to train the Bayes for each user without the need to
 take care of learning Spam/Ham on their own.

 The reason for taking the cur folder instead of the new folder
 is that I assume that the contents of these folders have already
 been verified for false-positives/negatives by the user.
 
 cur doesn't imply that the mail has been read; for that you
 need to check the seen flag in the filename, an S somewhere after the
 colon.

Yes, that's true. But if I'm right, new mails stay in new until the
appropriate folder in the IMAP client has been opened, right? I just
assume, if the use has some false negatives in the folder, he will
either immediately delete it or just move it into the Spam folder.

 
 
 A problem that could occur is when the user always deletes all mails  
 in .Spam/cur. Then the Bayes is only trained with Ham, but never
 Spam. Or isn't that a problem?
 
 Not if you tell them - then it's their fault if it doesn't work.
 Alternately you could have a separate train-spam folder and empty it
 after training.

I think it's more easy for the user if they just leave Spam in the Spam
folder for at least one day. Most of them will not move Spam into a
learn-folder.

 
 You could also supplement spam training by autolearning only spam, e.g.
 I have:
 
 bayes_auto_learn 1
 bayes_auto_learn_on_error 1
 bayes_auto_learn_threshold_nonspam -2000.0

But that learns spam only if its score is above 12.0. And learns no nonspam.
And then maybe the default config which auto learns spam and ham is
already the best...
My setup is already configured retrain when the user moves mail from
Inbox to Spam or from Spam to another folder.

 
 Personally I've never seen a spam miss-trained as a ham with the
 default threshold, and sensible rule scores.
 
 I think where some people go wrong is that they don't specify
 aggressive custom scores correctly. With autolearning it's better to
 keep conservative scores in the non-Bayes scoresets e.g.
 
 score SOME_RULE  2 2 8 8
 
 not
 
 score SOME_RULE  8
 
 There's no difference in classification, but the latter is more like to
 cause miss-training on FPs.

Re: How to automatically train each users Bayes?



On 27.03.2015 19:54, Matus UHLAR - fantomas wrote:
 On 27.03.15 15:16, Michael wrote:
 I would like automatically learn each users Bayes database in the
 following way:

 Do the following once a day for each user:
 1.) sa-learn -u username --ham ../maildir/cur
 2.) sa-learn -u username --spam ../maildir/.Spam/cur
 
 What do you think about this strategy?
 
 the easiest way is to train on false positives and false negatives.
 dovecot imapd has plugin to train when mail is moved from/to spam.

My concerns are the following:
Sometimes new kind of spam is appearing. This new kind often gets low
scores so that they are just 0.1 to 0.5 points above the limit. And the
auto learner gets no hit.
If the same spam then comes from another sending server, the score is
just a little bit below the border so that I'm getting a false-negative.
If the previous spam would have already been learned, the second mail
would have been scored as spam.

 
 you use something other, you should create pair of special folders for
 users
 to train both ham and spam.

Re: Uptick in spam

2015-03-27 Thread Reindl Harald



Am 27.03.2015 um 19:13 schrieb Amir Caspi:

On Feb 16, 2015, at 11:47 AM, Kevin A. McGrail kmcgr...@pccc.com wrote:


I'm happy to look at a recent sample and throw it through my system to see what 
it hits but overall, I've been seeing the exact opposite.


So, one of my users has been getting dozens (sometimes nearly 100) FNs per DAY 
over the last few weeks.  Even though many of these emails are hitting 
BAYES_999, they are not hitting any other non-negligible scoring rules


what here helps a lot are custom subject rules

* contains
* starts with
* ends with
* equal

4 different score levels

* very low: 0.5
* low:  1.5
* medium:   2.5
* high: 3.5
very high:  4.5

we have currently 577 different subjects and subject-parts scored , i 
don't want to publish them because i'd like the spammers don't change to 
new ones :-)




signature.asc
Description: OpenPGP digital signature

Re: Uptick in spam

2015-03-27 Thread Matus UHLAR - fantomas


On 27.03.15 12:51, Amir Caspi wrote:

Here are a few spamples:

http://pastebin.com/3nSLurGv  (this scored BAYES_99 but would still have been 
FN with BAYES_999)
http://pastebin.com/LaKT5ZZK (I have a rule template for these URIs but recent 
spams have modified them to cause high risk of FPs for such rules)
http://pastebin.com/qSgBxR5B (BAYES_999; could potentially be caught by an 
excessive HTML entity rule, but none seemed to hit... is there one?)


I see no network checks here... do you use network checks?

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
10 GOTO 10 : REM (C) Bill Gates 1998, All Rights Reserved!

Re: How to automatically train each users Bayes?



On 27.03.2015 16:21, Reindl Harald wrote:
 
 
 Am 27.03.2015 um 16:16 schrieb Michael:
 I would like automatically learn each users Bayes database in the
 following way:

 Do the following once a day for each user:
 1.) sa-learn -u username --ham ../maildir/cur
 2.) sa-learn -u username --spam ../maildir/.Spam/cur

 The idea is to train the Bayes for each user without the need to take
 care of learning Spam/Ham on their own.

 The reason for taking the cur folder instead of the new folder is
 that I assume that the contents of these folders have already been
 verified for false-positives/negatives by the user.

 A problem that could occur is when the user always deletes all mails in
 .Spam/cur. Then the Bayes is only trained with Ham, but never Spam. Or
 isn't that a problem?

 What do you think about this strategy?
 
 nothing good because in that case you can just stay at autolearning
 which is on by default after a bayes has at least 200 ham and 200 spam
 samples to get enabled at all
 

You are probably right. Auto learning is already working for all users
because I'm always training new users with a preselected ham/spam folder

Re: How to automatically train each users Bayes?

On Fri, 27 Mar 2015 15:16:13 +
Michael wrote:

 Hi,
 
 I would like automatically learn each users Bayes database in the  
 following way:
 
 Do the following once a day for each user:
 1.) sa-learn -u username --ham ../maildir/cur
 2.) sa-learn -u username --spam ../maildir/.Spam/cur
 
 The idea is to train the Bayes for each user without the need to
 take care of learning Spam/Ham on their own.
 
 The reason for taking the cur folder instead of the new folder
 is that I assume that the contents of these folders have already
 been verified for false-positives/negatives by the user.

cur doesn't imply that the mail has been read; for that you
need to check the seen flag in the filename, an S somewhere after the
colon.


 A problem that could occur is when the user always deletes all mails  
 in .Spam/cur. Then the Bayes is only trained with Ham, but never
 Spam. Or isn't that a problem?

Not if you tell them - then it's their fault if it doesn't work.
Alternately you could have a separate train-spam folder and empty it
after training.

You could also supplement spam training by autolearning only spam, e.g.
I have:

bayes_auto_learn 1
bayes_auto_learn_on_error 1
bayes_auto_learn_threshold_nonspam -2000.0

Personally I've never seen a spam miss-trained as a ham with the
default threshold, and sensible rule scores.

I think where some people go wrong is that they don't specify
aggressive custom scores correctly. With autolearning it's better to
keep conservative scores in the non-Bayes scoresets e.g.

score SOME_RULE  2 2 8 8

not

score SOME_RULE  8

There's no difference in classification, but the latter is more like to
cause miss-training on FPs.

Re: Uptick in spam


On 03/27/2015 07:13 PM, Amir Caspi wrote:

On Feb 16, 2015, at 11:47 AM, Kevin A. McGrail kmcgr...@pccc.com
wrote:


I'm happy to look at a recent sample and throw it through my system
to see what it hits but overall, I've been seeing the exact
opposite.


So, one of my users has been getting dozens (sometimes nearly 100)
FNs per DAY over the last few weeks.  Even though many of these
emails are hitting BAYES_999, they are not hitting any other
non-negligible scoring rules.  I have set BAYES_99 + BAYES_999 to a
combined score of 4.9 because I don't want it to be a complete poison
pill, but this is contributing to something like 50% of the FNs
(where only BAYES_999 is contributing to the score because no other
rules are hitting).  The other 50% are not getting high-enough Bayes
scores, but even then, many still don't hit many (or any) other
scoring rules so that they would still have this problem even if they
scored BAYES_999.  In many cases, it would appear that he is getting
a fresh batch that hasn't yet hit the RBLs or hash DBs, which is
why even with BAYES_999 they don't score over the 5.0 threshold...
it's causing some severe inbox unpleasantness.

I've been trying to come up with some good URI template rules to
block many of these but spammers are getting sufficiently generic in
their URIs that I worry strongly about FPs for these.  I haven't been
able to identify any other distinctive markers in the template
against which I can reliably write rules, although I also don't have
a program that does strong comparisons to look for patterns (I'm just
doing this by eye).

I have his spam corpus of a few thousand messages... simple Bayes
training doesn't seem to help, so some sort of template matching
would really be useful here, but as I said, I haven't really found
anything that I feel comfortable writing rules against without
significant risk of FPs.

Might anyone have some ideas?

This is getting to be a serious issue for this user and I'm getting
complaints...


- Please post missed spam samples in pastebin.com - do not post samples 
to mailing lists

Re: Uptick in spam

On Fri, 27 Mar 2015 12:13:30 -0600
Amir Caspi wrote:

 On Feb 16, 2015, at 11:47 AM, Kevin A. McGrail kmcgr...@pccc.com
 wrote:
 
  I'm happy to look at a recent sample and throw it through my system
  to see what it hits but overall, I've been seeing the exact
  opposite.
 
 So, one of my users has been getting dozens (sometimes nearly 100)
 FNs per DAY over the last few weeks.  Even though many of these
 emails are hitting BAYES_999, they are not hitting any other
 non-negligible scoring rules.  I have set BAYES_99 + BAYES_999 to a
 combined score of 4.9 because I don't want it to be a complete poison
 pill,

Personally I've found that trying to work around BAYES_99 not being a
poison pill causes more FPs making it one YMMV.

Re: Uptick in spam

On Mar 27, 2015, at 12:22 PM, Reindl Harald h.rei...@thelounge.net wrote:

 we have currently 577 different subjects and subject-parts scored , i don't 
 want to publish them because i'd like the spammers don't change to new ones 
 :-)

Sadly, that doesn't help me.  I don't have time to compile hundreds of subject 
rules, managing email is not my full-time job and I don't want it to become 
one.  If you care to share, that would be much appreciated, but otherwise I 
can't spend time writing hundreds of custom rules.  This is why I look for URI 
templates where regexps work well... looking for keywords or key phrases would 
be a huge quagmire, and that's what Bayes is supposed to be for.

As to publishing, I personally feel holding rules to one's self is not 
productive.  Spammers evolve regardless, and in the meantime those templates 
benefit nobody but one's own system.  Distributing them publicly will help 
everyone and could help others publish better rules in the future.  Obviously, 
others may disagree.

Cheers.

--- Amir

Re: Uptick in spam

2015-03-27 Thread Kevin A. McGrail


On 3/27/2015 2:51 PM, Amir Caspi wrote:

On Mar 27, 2015, at 12:20 PM, Axb axb.li...@gmail.com wrote:


- Please post missed spam samples in pastebin.com - do not post samples to 
mailing lists

Of course, I would never post it to the list.  I will put up a few in pastebin 
but there are so many of them, and there are a few different templates in use, 
so I don't know if I can really capture them all.  I obviously can't post the 
entire corpus on pastebin. ;-)

Are you using network tests?  These are scoring pretty high for me.

Re: How to automatically train each users Bayes?

2015-03-27 Thread Matus UHLAR - fantomas


On 27.03.2015 19:54, Matus UHLAR - fantomas wrote:

the easiest way is to train on false positives and false negatives.
dovecot imapd has plugin to train when mail is moved from/to spam.


On 27.03.15 20:10, Michael wrote:

My concerns are the following:
Sometimes new kind of spam is appearing. This new kind often gets low
scores so that they are just 0.1 to 0.5 points above the limit. And the
auto learner gets no hit.
If the same spam then comes from another sending server, the score is
just a little bit below the border so that I'm getting a false-negative.
If the previous spam would have already been learned, the second mail
would have been scored as spam.


I don't get this. Or should I add that it's of course good to continue with
autolearning, but _also_ allow manual learning ?

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Micro$oft random number generator: 0, 0, 0, 4.33e+67, 0, 0, 0...

Re: Uptick in spam

2015-03-27 Thread John Hardin


On Fri, 27 Mar 2015, Amir Caspi wrote:


On Mar 27, 2015, at 12:56 PM, Matus UHLAR - fantomas uh...@fantomas.sk wrote:


I see no network checks here... do you use network checks?


On Mar 27, 2015, at 1:11 PM, Kevin A. McGrail kmcgr...@pccc.com wrote:


Are you using network tests?  These are scoring pretty high for me.


I presume you're talking about things like Razor, Pyzor, DCC, and 
various RBLs?  Yes, those are enabled.  The reason you're not seeing 
them is because they didn't hit when the messages were first received. 
I'm getting the same hits NOW that you are seeing, but those did NOT hit 
when the messages first arrived.


Have you considered greylisting?

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The one political issue that strips all politicians bare is
  individual gun rights.
---
 5 days until April Fools' day

Re: Uptick in spam

2015-03-27 Thread John Hardin


On Fri, 27 Mar 2015, Amir Caspi wrote:


On Mar 27, 2015, at 1:38 PM, sha...@shanew.net wrote:


Apologies if this is an overly obvious answer, but are you using any
greylisting?  This would (potentially) move your user away from the
wavefront of a spam's distribution, and give it a better chance of
triggering the network-based tests.


No, unfortunately not.  It's something I've been considering but with my 
current system setup I don't know of an easy way to implement it. 
Unfortunately the system setup is fixed due to the virtual hosting 
software being run on it.  There is a possibility this can change in the 
future, depending on our client setup, but right now we're stuck with 
it, so I can't do things like use amavisd or dovecot or whatever.


If I can easily implement greylisting from within sendmail without 
breaking the current setup, that's certainly something I'd consider 
doing...


(all caught up now, sheesh).

Can you install milters? Take a look at milter-greylist.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The one political issue that strips all politicians bare is
  individual gun rights.
---
 5 days until April Fools' day

Re: How to automatically train each users Bayes?

On Fri, 27 Mar 2015 20:03:18 +0100
Michael wrote:

 On 27.03.2015 19:09, RW wrote:
  On Fri, 27 Mar 2015 15:16:13 +

  cur doesn't imply that the mail has been read; for that you
  need to check the seen flag in the filename, an S somewhere after
  the colon.
 
 Yes, that's true. But if I'm right, new mails stay in new until the
 appropriate folder in the IMAP client has been opened, right? I just
 assume, if the use has some false negatives in the folder, he will
 either immediately delete it or just move it into the Spam folder.
 

People can have mail clients running unattended in the background,
often on multiple devices, so you can't assume it's been seen by a
human.

  You could also supplement spam training by autolearning only spam,
  e.g. I have:
  
  bayes_auto_learn 1
  bayes_auto_learn_on_error 1
  bayes_auto_learn_threshold_nonspam -2000.0
 
 But that learns spam only if its score is above 12.0. And learns no
 nonspam.

That's why I suggested using it to *supplement* spam training. When it
works, autotraining does have the advantage of happening in real-time.

 And then maybe the default config which auto learns spam and
 ham is already the best...

the default doesn't learn ham well, I'd only do that as a last resort.

 My setup is already configured retrain when the user moves mail from
 Inbox to Spam or from Spam to another folder.

This is a really poor way of training Bayes because it trains on SA
misclassifications rather than Bayes misclassifications. It's a poor
way of training spam and very much worse at training ham.  


On Fri, 27 Mar 2015 20:14:03 +0100
Matus UHLAR - fantomas wrote:

 On 27.03.2015 19:54, Matus UHLAR - fantomas wrote:
  the easiest way is to train on false positives and false negatives.
  dovecot imapd has plugin to train when mail is moved from/to spam.
 
 On 27.03.15 20:10, Michael wrote:
 My concerns are the following:
 Sometimes new kind of spam is appearing. This new kind often gets low
 scores so that they are just 0.1 to 0.5 points above the limit. And
 the auto learner gets no hit.
 If the same spam then comes from another sending server, the score is
 just a little bit below the border so that I'm getting a
 false-negative. If the previous spam would have already been
 learned, the second mail would have been scored as spam.
 
 I don't get this. 

By the sound of it the OP is already using the dovecot plugin or
equivalent.

The first spam wasn't autolearned and was correctly identified as
spam. In this case the plugin doesn't provide a way of training it,
even if it has BAYES_00, because it's already in the spam folder.

People keep recommending the plugin, but IMO it's a poor choice for
SpamAssassin.

Re: How to automatically train each users Bayes?

2015-03-27 Thread Alex Regan


Hi,


Yes, that's true. But if I'm right, new mails stay in new until the
appropriate folder in the IMAP client has been opened, right? I just
assume, if the use has some false negatives in the folder, he will
either immediately delete it or just move it into the Spam folder.


People can have mail clients running unattended in the background,
often on multiple devices, so you can't assume it's been seen by a
human.


Does anyone have any suggestions on how to enable Exchange users to 
submit samples for analysis they consider to be spam? With the latest 
Exchange, they've disabled IMAP on public folders.


We have one setup where we forward the mail to their internal Exchange 
system. We used to have spam and ham folders where users would place 
mail for us to review then train bayes, but we haven't been able to do 
it for a while because of this lack of IMAP issue.


Thanks,
Alex

Re: Uptick in spam

2015-03-27 Thread Richard Doyle

On 03/27/2015 11:51 AM, Amir Caspi wrote:
 On Mar 27, 2015, at 12:20 PM, Axb axb.li...@gmail.com wrote:

 - Please post missed spam samples in pastebin.com - do not post samples to 
 mailing lists
 Of course, I would never post it to the list.  I will put up a few in 
 pastebin but there are so many of them, and there are a few different 
 templates in use, so I don't know if I can really capture them all.  I 
 obviously can't post the entire corpus on pastebin. ;-)

 Here are a few spamples:

 http://pastebin.com/3nSLurGv  (this scored BAYES_99 but would still have been 
 FN with BAYES_999)
 http://pastebin.com/LaKT5ZZK (I have a rule template for these URIs but 
 recent spams have modified them to cause high risk of FPs for such rules)
 http://pastebin.com/qSgBxR5B (BAYES_999; could potentially be caught by an 
 excessive HTML entity rule, but none seemed to hit... is there one?)
All of these were From: domains created today.



 For the first and last one, the URIs are way too similar to blog URIs that 
 would be in use by legitimate agencies, so I suspect there is a high risk for 
 FPs on those.  The middle one uses a template that I have URI rules for, but 
 the URIs are evolving to use randomized server names which are also basically 
 impossible to template against without risk of FPs.

 I have hundreds more like these...

 Cheers.

 --- Amir

Re: Uptick in spam

On Mar 27, 2015, at 2:09 PM, Axb axb.li...@gmail.com wrote:

 As an AV product I'd recommend Sophos AND ESETS/Nod32.

I'll look into Sophos, I'm not entirely sure if I can deploy it on my system or 
not.  We have to use RPMs that can be distributed to the virtual hosts, etc... 
I'll definitely look into it.  Haven't heard about ESETS/Nod32, will check it 
out.

 I'd also suggest you disable msg munging if you want hashers to work.

I'll certainly consider that if this is a major issue.  I see hashers working 
on many other messages, but I'm not sure how munged those messages are.  I'll 
try to investigate to see if I've seen hash hits on munged messages...  Turning 
off munging will unfortunately reduce security since it allows embedded JS and 
web bugs, but if it improves the chances of those things getting properly 
tagged as spam then they won't open them anyway, so I guess it may come out in 
the wash.

 URI lists may also list URIs to .js and web bugs - you could be missing on 
 them.

Very good point.

 Are you an ISP/ASP or is this a corporate box?

A bit of both.  We run a dedicated server that is owned by a major ISP, but 
they basically only handle the upstream end.  We are root on the box and handle 
everything downstream.  We run a virtual hosting panel and our corporate 
clients run domains (for email and web hosting) as virtual hosts on the box.  
Each virthost is operated in a chroot environment, and the control panel 
distributes the central RPMs to each virthost.  So, everything we do has to 
work with the framework of the control panel and its virtual hosting 
environment.

 What are you really using MailScanner for?

Primarily as glue to clamav (via clamd) and for attachment policy enforcement 
(e.g., no .exe payloads), and secondarily for URI munging.

 I also wonder if you're doing any rejects at SMTP level.

Yes, I've implemented enhdnsbl in sendmail, querying SpamCop, Barracuda, and 
SpamHaus Zen (in that order).  I know Barracuda is often overzealous but we 
haven't seen any FP rejections (that we know of) yet.  Are there any other RBLs 
you suggest I add to sendmail's checks?  (I used to use NJABL but that's dead, 
and last time I asked on this list, I was told SORBS wasn't a good idea due to 
too many FP rejections.)

I also have greetpause enabled (at 1 sec) to reject trigger-happy spammers.

Cheers.

--- Amir

Re: Uptick in spam

On Mar 27, 2015, at 3:34 PM, Richard Doyle lists...@islandnetworks.com wrote:

 All of these were From: domains created today.

Shouldn't they have been picked up by DOB?  Or do I need to manually enable 
some DOB plugin in SA? (If so, please let me know how...)  When I ran the third 
spample manually a few hours ago, I still didn't see any DOB hit.

I see there is a URIBL_RHS_DOB... is there a SENDER_DOB rule as well?  If not, 
it seems like it would be a good idea to implement one... do I need to file a 
bug for it?

However, it would appear that all of the From: domains are the same as in the 
body URIs, which means URIBL_RHS_DOB should have popped... unless you mean that 
the subdomain (sub.domain.com) was DOB, but the main domain (www.domain.com 
and/or domain.com) were not DOB?  Or am I missing something?

Thanks.

--- Amir

Re: Uptick in spam

On 03/27/2015 11:44 PM, Amir Caspi wrote:

On Mar 27, 2015, at 3:34 PM, Richard Doyle
lists...@islandnetworks.com wrote:

All of these were From: domains created today.

Shouldn't they have been picked up by DOB? Or do I need to manually
enable some DOB plugin in SA? (If so, please let me know how...)
When I ran the third spample manually a few hours ago, I still didn't
see any DOB hit.

I see there is a URIBL_RHS_DOB... is there a SENDER_DOB rule as well?
If not, it seems like it would be a good idea to implement one... do
I need to file a bug for it?

However, it would appear that all of the From: domains are the same
as in the body URIs, which means URIBL_RHS_DOB should have popped...
unless you mean that the subdomain (sub.domain.com) was DOB, but the
main domain (www.domain.com and/or domain.com) were not DOB? Or am I
missing something?

DOB isn't realtime/zero hour.

I have zero Sendmail clue but if you can do it, also check
sender/helo/rdns against dbl.spamhaus.org's reply 127.0.1.2

(I can only provide Postfix config for this)

if you want to check sender in DOB you can use eval:check_rbl_envfrom
for a rule.

A few days ago I posted dbl_env_from.cf which should show how it's done
(the rule is untested)

http://mail-archives.apache.org/mod_mbox/spamassassin-users/201503.mbox/%3C55128D61.2020308%40gmail.com%3E

You also may want to look at the Invaluement IP/URI lists.
(Invaluement.com). Detection rate is real good and FP level is
extraordinary. IIRC you can get a test drive.

I wouldn't want to miss it.

Re: Uptick in spam

On Mar 27, 2015, at 5:12 PM, Axb axb.li...@gmail.com wrote:

 DOB isn't realtime/zero hour.

That kind of defeats the point, isn't it?  I mean, if you wait too long, it's 
no longer DOB, it's few-DOB...

I would have imagined that a DOB server would operate in a caching mode where 
the first query on a domain would cause a whois lookup, which then generates a 
cache table entry with the reg date.  Subsequent lookups then don't incur a 
whois hit, they just check the cache table.  In this way it could be 
effectively realtime since only the first query causes a whois load, and it 
would always return the correct answer.

I guess that's not the case?

 I have zero Sendmail clue but if you can do it, also check sender/helo/rdns 
 against dbl.spamhaus.org's reply 127.0.1.2

I haven't found a way to do this, but if someone knows, please post...

 You also may want to look at the Invaluement IP/URI lists.
 (Invaluement.com). Detection rate is real good and FP level is extraordinary. 
 IIRC you can get a test drive.
 I wouldn't want to miss it.

Unfortunately a paid service is not in the cards right now.

Does anyone recommend using the PSBL (Surriel) for sendmail dnsbl?  I see that 
it's enabled by default in SA, but should I promote it to the sendmail level, 
or is it too prone to FP?

On a related note... since I implemented SpamCop, Barracuda, and SpamHaus at 
the sendmail level, should I disable those RBL lookups in SA, to prevent 
double-querying the RBLs for those mails that do get through?  Or does SA check 
_all_ Received lines, in which case I should leave it enabled since sendmail 
only checks the connecting MTA?  (I should note that I _HAVE_ seen 
RCVD_IN_XBL/PBL/SBL and RCVD_IN_BL_SPAMCOP_NET pop up not infrequently, despite 
implementing dnsbl for those RBLs in sendmail, which means either they're 
getting listed in the small interval between sendmail and SA, or SA is checking 
more than just the last hop...)

Thanks.

--- Amir

Re: Uptick in spam

2015-03-27 Thread Richard Doyle

On 03/27/2015 03:44 PM, Amir Caspi wrote:
 On Mar 27, 2015, at 3:34 PM, Richard Doyle lists...@islandnetworks.com 
 wrote:

 All of these were From: domains created today.
 Shouldn't they have been picked up by DOB?  Or do I need to manually enable 
 some DOB plugin in SA? (If so, please let me know how...)  When I ran the 
 third spample manually a few hours ago, I still didn't see any DOB hit.

 I see there is a URIBL_RHS_DOB... is there a SENDER_DOB rule as well?  If 
 not, it seems like it would be a good idea to implement one... do I need to 
 file a bug for it?

 However, it would appear that all of the From: domains are the same as in the 
 body URIs, which means URIBL_RHS_DOB should have popped... unless you mean 
 that the subdomain (sub.domain.com) was DOB, but the main domain 
 (www.domain.com and/or domain.com) were not DOB?  Or am I missing something?
DOB misses many new domains. Whois often knows what's new, but using it
to detect spam doesn't scale. 
 

 Thanks.

 --- Amir

Re: Uptick in spam

On Fri, 27 Mar 2015 17:40:58 -0600
Amir Caspi wrote:

 On Mar 27, 2015, at 5:12 PM, Axb axb.li...@gmail.com wrote:
 
  DOB isn't realtime/zero hour.
 
 That kind of defeats the point, isn't it?  I mean, if you wait too
 long, it's no longer DOB, it's few-DOB...

I think it's 5 days, and the day-old bit is part of the bread
metaphor, not the definition. 


 On a related note... since I implemented SpamCop, Barracuda, and
 SpamHaus at the sendmail level, should I disable those RBL lookups in
 SA, to prevent double-querying the RBLs for those mails that do get
 through?  Or does SA check _all_ Received lines, in which case I
 should leave it enabled since sendmail only checks the connecting
 MTA?  (I should note that I _HAVE_ seen RCVD_IN_XBL/PBL/SBL and
 RCVD_IN_BL_SPAMCOP_NET pop up not infrequently, despite implementing
 dnsbl for those RBLs in sendmail, which means either they're getting
 listed in the small interval between sendmail and SA, or SA is
 checking more than just the last hop...)

There are  deep checks for SBL (via zen) and SPAMCOP. XBL/PBL are
last-external only

Re: Uptick in spam


On 03/28/2015 12:40 AM, Amir Caspi wrote:

On Mar 27, 2015, at 5:12 PM, Axb axb.li...@gmail.com wrote:


DOB isn't realtime/zero hour.


That kind of defeats the point, isn't it?  I mean, if you wait too
long, it's no longer DOB, it's few-DOB...

I would have imagined that a DOB server would operate in a caching
mode where the first query on a domain would cause a whois lookup,
which then generates a cache table entry with the reg date.



Subsequent lookups then don't incur a whois hit, they just check the
cache table.  In this way it could be effectively realtime since only
the first query causes a whois load, and it would always return the
correct answer.

I guess that's not the case?


DOB is based on more or less publicly accessible daily TLD zone data 
(ICANN ZFA)


You're thinking passive DNS, as done by
https://www.farsightsecurity.com/

I have access to their DNSDB service for a hobby project and it's amazing.

Farsight's NOD service is way out of our means.


Does anyone recommend using the PSBL (Surriel) for sendmail dnsbl?  I
see that it's enabled by default in SA, but should I promote it to
the sendmail level, or is it too prone to FP?


It works fine for a family server, but I wouldn't use it for rejecting 
spam in a client's mailflow.



On a related note... since I implemented SpamCop, Barracuda, and
SpamHaus at the sendmail level, should I disable those RBL lookups in
SA, to prevent double-querying the RBLs for those mails that do get
through?  Or does SA check _all_ Received lines, in which case I
should leave it enabled since sendmail only checks the connecting
MTA?  (I should note that I _HAVE_ seen RCVD_IN_XBL/PBL/SBL and
RCVD_IN_BL_SPAMCOP_NET pop up not infrequently, despite implementing
dnsbl for those RBLs in sendmail, which means either they're getting
listed in the small interval between sendmail and SA, or SA is
checking more than just the last hop...)


Hard to say without tailing your maillogs.
Though, if you have your trusted/internal SA settings right, extra SA 
checks shouldn't be an issue as you may already have most of the data in 
your resolver's cache anyway.

Re: Uptick in spam

2015-03-27 Thread David Jones

From: Amir Caspi ceph...@3phase.com
Sent: Friday, March 27, 2015 7:30 PM
To: RW
Cc: users@spamassassin.apache.org
Subject: Re: Uptick in spam

On Mar 27, 2015, at 6:19 PM, RW rwmailli...@googlemail.com wrote:

 There are  deep checks for SBL (via zen) and SPAMCOP. XBL/PBL are
 last-external only

Interesting.  I wonder why I see those XBL/PBL hits, then.  Maybe Zen timed 
out on those queries from sendmail... or something.  Either way I guess this 
means I should retain Zen and SC queries in SA.

You should be running a local dns caching server like BIND or PowerDNS Recursor 
on a mail server to
help prevent time outs that can allow RBL checks to become ineffective.

It's possible that your outbound mail could be hitting those RBLs in SA in the 
event of a compromised
account or the last-external IP in the Received: depending on what internal 
mail server you use and if
it puts that information in as X-Originating-IP or Received headers of the 
sending mail client.  I would
recommend keeping those RBLs in SA to help with outbound scanning and in case 
they get past the
MTA-level RBL checking.

It shouldn't be duplicate hits to Zen/XBL/PBL if you have sendmail rejecting 
that message from
making it to SA.  If you get any of those RBL hits in SA that sendmail is 
configured to reject on, then
there must be some sendmail access list allowing it to bypass the RBL checks.

Esets NOD32 is very fast, very inexpensive, and works well with MailScanner.

The invaluement RBL is not expensive either and it is awesome.  We pay 
thousands per year for
a Spamhaus feed because of our volume and mailboxes.  The invaluement RBL is 
only hundreds
per year and it's almost as good as Spamhaus Zen.  I have Spamhaus in front of 
invaluement  in
my postfix configuration but I may try flipping the order just to see if it 
will start blocking more
than Spamhaus.

Dave

Thanks.

--- Amir

Re: Uptick in spam

2015-03-27 Thread Dave Pooser

You also may want to look at the Invaluement IP/URI lists.
(Invaluement.com). Detection rate is real good and FP level is
extraordinary. 

+1. Very happy with invaluement at $DAYJOB.
-- 
Dave Pooser
Cat-Herder-in-Chief, Pooserville.com

Re: Uptick in spam