Re: Bayes not auto-learning?

2018-02-24 Thread David Jones

On 02/24/2018 01:05 AM, Amir Caspi wrote:

On Feb 23, 2018, at 11:47 PM, David B Funk  wrote:

It could have 20 points from a whole bunch of body rules but if it only hit 2
points via header rules it still will not auto-learn.


Gotcha. The spam in question that triggered this hit a lot of rules, but hard 
for me to tell on cursory inspection whether it satisfies sufficient header and 
body points.  But it LOOKS like there should be at least 3 points from header 
(MISSING_HEADERS, FREEMAIL_FORGED_REPLYTO, among others) and certainly 3 body 
(MONEY_FRAUD_3 at the very least).  The actual spam report is this:

*  0.0 FSL_CTYPE_WIN1251 Content-Type only seen in 419 spam
*  0.0 NSL_RCVD_FROM_USER Received from User
*  1.0 MISSING_HEADERS Missing To: header
*  0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60%
*  [score: 0.5004]
*  1.1 DCC_CHECK Detected as bulk mail by DCC (dcc-servers.net)
*  0.0 FROM_MISSP_MSFT From misspaced + supposed Microsoft tool
*  0.0 FSL_NEW_HELO_USER Spam's using Helo and User
*  2.6 MSOE_MID_WRONG_CASE No description available.
*  0.0 FROM_MISSP_USER From misspaced, from "User"
*  1.0 RDNS_DYNAMIC Delivered to internal network by host with
*  dynamic-looking rDNS
*  0.0 LOTS_OF_MONEY Huge... sums of money
*  0.0 FROM_MISSP_XPRIO Misspaced FROM + X-Priority
*  1.6 REPLYTO_WITHOUT_TO_CC No description available.
*  0.0 AXB_XMAILER_MIMEOLE_OL_024C2 Yet another X header trait
*  0.0 MSGID_FROM_MTA_HEADER Message-Id was added by a relay
*  0.0 FSL_BULK_SIG Bulk signature with no Unsubscribe
*  2.1 FREEMAIL_FORGED_REPLYTO Freemail in Reply-To, but not From
*  1.0 FREEMAIL_REPLYTO Reply-To/From or Reply-To/body contain different
*  freemails
*  0.0 TO_NO_BRKTS_FROM_MSSP Multiple header formatting problems
*  1.9 FORGED_MUA_OUTLOOK Forged mail pretending to be from MS Outlook
*  1.6 TO_NO_BRKTS_DYNIP To: lacks brackets and dynamic rDNS
*  0.0 FILL_THIS_FORM Fill in a form with personal information
*  2.0 TO_NO_BRKTS_MSFT To: lacks brackets and supposed Microsoft tool
*  2.0 FILL_THIS_FORM_LONG Fill in a form with personal information
*  3.1 FROM_MISSP_FREEMAIL From misspaced + freemail provider
*  3.0 MONEY_FRAUD_3 Lots of money and several fraud phrases

But, it still didn't autolearn.

(I can post the entire spample if the above seems like it should have 
autolearned.)


Another possible factor, if you have "bayes_auto_learn_on_error" enabled, then 
autolearn will be skipped if Bayes already agrees with the condition of the message. IE: 
if the message is already classifed as BAYES_99 then it won't bother auto-learning it as 
yet another high-ranking spam.


I do not have that enabled.  Also, as you can see from above, this hit BAYES_50.

Does the above provide an indication as to why it didn't autolearn?

Thanks!

--- Amir




I found the best thing to do is setup a hidden mail server (iRedMail) 
and split a copy of all mail to it to sort and filter into a Ham and 
Spam folder based on rule hits and scoring.  Then I run a nightly 
sa-learn on the Ham and Spam folders (in that order).  The few 
questionable emails that score in the middle stay in the Inbox so I just 
have to drag-n-drop into the Ham or Spam folder taking a few minutes a 
day.  Some that are new phishing campaigns or are from compromised 
accounts are copied into a Spamcop folder that automatically submits it 
to my Spamcop account.


I also use the Ham and Spam folders for the nightly SA masscheck to help 
get new rules validated and new 72_scores.cf update daily via sa-update.


--
David Jones


Re: Bayes not auto-learning?

2018-02-24 Thread Kevin A. McGrail

On 2/24/2018 2:05 AM, Amir Caspi wrote:

Does the above provide an indication as to why it didn't autolearn?


No, the above does not help as the autolearning is complicated. I 
believe a few years ago I added debug output or headers or something 
that tried to make it clearer.  If it doesn't autolearn, I would not 
stress.  It's not a simplistic, black or white decision based on a 
single factor.


Off-hand, I can't find the work I did but 
$status->get_autolearn_points() might help you dig into the code.


Regards,

KAM



Re: Bayes not auto-learning?

2018-02-23 Thread Amir Caspi
On Feb 23, 2018, at 11:47 PM, David B Funk  wrote:
> It could have 20 points from a whole bunch of body rules but if it only hit 2
> points via header rules it still will not auto-learn.

Gotcha. The spam in question that triggered this hit a lot of rules, but hard 
for me to tell on cursory inspection whether it satisfies sufficient header and 
body points.  But it LOOKS like there should be at least 3 points from header 
(MISSING_HEADERS, FREEMAIL_FORGED_REPLYTO, among others) and certainly 3 body 
(MONEY_FRAUD_3 at the very least).  The actual spam report is this:

*  0.0 FSL_CTYPE_WIN1251 Content-Type only seen in 419 spam
*  0.0 NSL_RCVD_FROM_USER Received from User
*  1.0 MISSING_HEADERS Missing To: header
*  0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60%
*  [score: 0.5004]
*  1.1 DCC_CHECK Detected as bulk mail by DCC (dcc-servers.net)
*  0.0 FROM_MISSP_MSFT From misspaced + supposed Microsoft tool
*  0.0 FSL_NEW_HELO_USER Spam's using Helo and User
*  2.6 MSOE_MID_WRONG_CASE No description available.
*  0.0 FROM_MISSP_USER From misspaced, from "User"
*  1.0 RDNS_DYNAMIC Delivered to internal network by host with
*  dynamic-looking rDNS
*  0.0 LOTS_OF_MONEY Huge... sums of money
*  0.0 FROM_MISSP_XPRIO Misspaced FROM + X-Priority
*  1.6 REPLYTO_WITHOUT_TO_CC No description available.
*  0.0 AXB_XMAILER_MIMEOLE_OL_024C2 Yet another X header trait
*  0.0 MSGID_FROM_MTA_HEADER Message-Id was added by a relay
*  0.0 FSL_BULK_SIG Bulk signature with no Unsubscribe
*  2.1 FREEMAIL_FORGED_REPLYTO Freemail in Reply-To, but not From
*  1.0 FREEMAIL_REPLYTO Reply-To/From or Reply-To/body contain different
*  freemails
*  0.0 TO_NO_BRKTS_FROM_MSSP Multiple header formatting problems
*  1.9 FORGED_MUA_OUTLOOK Forged mail pretending to be from MS Outlook
*  1.6 TO_NO_BRKTS_DYNIP To: lacks brackets and dynamic rDNS
*  0.0 FILL_THIS_FORM Fill in a form with personal information
*  2.0 TO_NO_BRKTS_MSFT To: lacks brackets and supposed Microsoft tool
*  2.0 FILL_THIS_FORM_LONG Fill in a form with personal information
*  3.1 FROM_MISSP_FREEMAIL From misspaced + freemail provider
*  3.0 MONEY_FRAUD_3 Lots of money and several fraud phrases

But, it still didn't autolearn.

(I can post the entire spample if the above seems like it should have 
autolearned.)

> Another possible factor, if you have "bayes_auto_learn_on_error" enabled, 
> then autolearn will be skipped if Bayes already agrees with the condition of 
> the message. IE: if the message is already classifed as BAYES_99 then it 
> won't bother auto-learning it as yet another high-ranking spam.

I do not have that enabled.  Also, as you can see from above, this hit BAYES_50.

Does the above provide an indication as to why it didn't autolearn?

Thanks!

--- Amir




Re: Bayes not auto-learning?

2018-02-23 Thread Ian Zimmerman
On 2018-02-23 22:32, Amir Caspi wrote:

> So, I've been trying to tweak my setup and noticed that VERY few of my
> emails are being autolearned as spam, even when their spam threshold
> is far above the autolearn threshold.  The threshold is set to 12; I
> just saw a spam with score >25 not being autolearned.

Sigh.  This really is a FAQ, and I did ask it myself (maybe more than
once).

Read the fine documentation.  Shortned: the score that is compared to
the threshold for autolearning is _not_ the normal score that determines
spam/ham.

Despite the fact that is is documented, I find the algorithm to be too
opaque to feel in control.

-- 
Please don't Cc: me privately on mailing lists and Usenet,
if you also post the followup to the list or newsgroup.
To reply privately _only_ on Usenet and on broken lists
which rewrite From, fetch the TXT record for no-use.mooo.com.


Re: Bayes not auto-learning?

2018-02-23 Thread David B Funk

On Fri, 23 Feb 2018, Amir Caspi wrote:


Hi all,

So, I've been trying to tweak my setup and noticed that VERY few of my 
emails are being autolearned as spam, even when their spam threshold is far above 
the autolearn threshold.  The threshold is set to 12; I just saw a spam with score 
>25 not being autolearned.

Are there rules that prevent autolearning?  If so, why?  If a spam 
scores really high because it hits (let's say) 10 or more rules, but just one 
of those rules is enough to prevent autolearning, that seems overly 
restrictive, no?

For example, for one of my users, out of about 650 spams received in 
the last month, only 10 have been autolearned.  For another user, only 12 of 
nearly 1400.  That seems like a very low percentage, and clearly some 
high-scoring spams are not being auto-learned.

Any explanation is appreciated!

Thanks!

--- Amir


If you read the spamassassin documentation about Bayes auto-learning you will 
see that there are several conditions that must be satisfied.


For example, there are some types of rules which aren't considered at all when 
computing the auto-learning threshold score (such as white/black list scores or 
rules tagged with the noautolearn tflag or the actual Bayes score itself).


Of the types of rules which are allowed, at least 3 of those points must come 
from header type rules and at least 3 of those points must come from body type 
rules.


So a spam can have 100 points from a blacklist and not auto-learn.

It could have 20 points from a whole bunch of body rules but if it only hit 2
points via header rules it still will not auto-learn.

Another possible factor, if you have "bayes_auto_learn_on_error" enabled, then 
autolearn will be skipped if Bayes already agrees with the condition of the 
message. IE: if the message is already classifed as BAYES_99 then it won't 
bother auto-learning it as yet another high-ranking spam.


What I usually see in auto-learned spam is that they hit a number of network RBL 
rules (spamhaus, SORBS, etc) and a number of body rules such as RAZOR, URIBLS, 
etc.



--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{


Bayes not auto-learning?

2018-02-23 Thread Amir Caspi
Hi all,

So, I've been trying to tweak my setup and noticed that VERY few of my 
emails are being autolearned as spam, even when their spam threshold is far 
above the autolearn threshold.  The threshold is set to 12; I just saw a spam 
with score >25 not being autolearned.

Are there rules that prevent autolearning?  If so, why?  If a spam 
scores really high because it hits (let's say) 10 or more rules, but just one 
of those rules is enough to prevent autolearning, that seems overly 
restrictive, no?

For example, for one of my users, out of about 650 spams received in 
the last month, only 10 have been autolearned.  For another user, only 12 of 
nearly 1400.  That seems like a very low percentage, and clearly some 
high-scoring spams are not being auto-learned.

Any explanation is appreciated!

Thanks!

--- Amir