Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread RW
On Thu, 15 Feb 2018 14:32:36 -0600 (CST)
sha...@shanew.net wrote:


> I haven't checked the math in the Bayes plugin, but it explicitly
> mentions using the "chi-square probability combiner" which is
> described at http://www.linuxjournal.com/print.php?sid=6467
> 
> Maybe I'm misunderstanding what that article describes, but I'm pretty
> sure what it boils down to is that when the occurence of a token is
> too small (he uses the phrase "rare words") it can lead to
> probabilities at the extremes (like a token that occurs only once and
> is in spam, so its probability is 1).  The way to address these
> extremely low or extremely high probabilities is to use the Fisher
> calculation (which is described in the second page of the article).

Tokens with low counts are detuned a bit, but not as much as you might
think. In a database with a 1:1 ratio you get hapax token probabilities
of 0.016 and 0.987, IIRC Robinson anticipated something much closer to
neutral.

This is similar to the defaults in spambayes and bogofilter, and I
think at least one of the three project would have derived them from
optimization. My guess it's because enough tokens with low counts are
very strong, but short-lived indicators that it's worth putting with
the noise. 



Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread shanew

On Thu, 15 Feb 2018, RW wrote:


On Thu, 15 Feb 2018 11:56:55 -0600 (CST)
sha...@shanew.net wrote:

So, the sample size doesn't matter when calculating the probability of
a message being spam based on individual tokens, but it can matter
when we bring them all together to make a final calculation.


It's not a matter of how they combine, smaller counts just lead to
less accurate token probabilities.

I'm not saying that it doesn't matter how much you train, I'm saying
that if you have enough spam and enough ham Bayes is insensitive to
the ratio.


I agree that past a certain minimum threshold, the ratio doesn't
matter much.  But as I understand it, larger sample size makes a
difference.

I haven't checked the math in the Bayes plugin, but it explicitly
mentions using the "chi-square probability combiner" which is
described at http://www.linuxjournal.com/print.php?sid=6467

Maybe I'm misunderstanding what that article describes, but I'm pretty
sure what it boils down to is that when the occurence of a token is
too small (he uses the phrase "rare words") it can lead to
probabilities at the extremes (like a token that occurs only once and
is in spam, so its probability is 1).  The way to address these
extremely low or extremely high probabilities is to use the Fisher
calculation (which is described in the second page of the article).

Maybe this is where I'm making a logical leap that I shouldn't, but I
think that "non-rare words" increasingly outnumber "rare words" as the
sample size of messages (and thus tokens) increases.


--
Public key #7BBC68D9 at| Shane Williams
http://pgp.mit.edu/|  System Admin - UT CompSci
=--+---
All syllogisms contain three lines |  sha...@shanew.net
Therefore this is not a syllogism  | www.ischool.utexas.edu/~shanew


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread RW
On Thu, 15 Feb 2018 20:16:24 +0100
Reindl Harald wrote:

> Am 15.02.2018 um 20:10 schrieb RW:

> > I'm not saying that it doesn't matter how much you train, I'm saying
> > that if you have enough spam and enough ham Bayes is insensitive to
> > the ratio  
> 
> but not when the ratio differs in magnitudes like the values from the
> OP not more, and not less

Based on the mathematics of "I reckon", and your database going off the
rails after (by your own admission) you mistrained it.

Actually the ratio was only 4:1, which isn't all that big.


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread RW
On Thu, 15 Feb 2018 11:56:55 -0600 (CST)
sha...@shanew.net wrote:

> On Thu, 15 Feb 2018, RW wrote:
> 

> > As I said, Bayes is based on frequencies.
> >
> > If a token occurs in 10% of ham and 0.5% of spam based on 10,000
> > hams and 10,000 spams, what do you think is likely to happen to
> > those percentages with 10,000 hams and 1,000,000 spams?  
> 
> ...
> So, the sample size doesn't matter when calculating the probability of
> a message being spam based on individual tokens, but it can matter
> when we bring them all together to make a final calculation.

It's not a matter of how they combine, smaller counts just lead to
less accurate token probabilities.

I'm not saying that it doesn't matter how much you train, I'm saying
that if you have enough spam and enough ham Bayes is insensitive to
the ratio.

 


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread RW
On Thu, 15 Feb 2018 19:24:14 +0100
Reindl Harald wrote:

> Am 15.02.2018 um 19:20 schrieb RW:
> > On Thu, 15 Feb 2018 17:15:47 +0100

> > You are talking about ultra-rare tokens here, the chances of these
> > dominating a classification is negligibl  
> it is not - in 2015 i had to purge "in doubt" a few days of training 
> because unreasonable amount of ham was classified as BAYES_50 or even 
> tagged instead BAYES_00 and we talk about a bay with around 100.000 
> sample sin total where with your logic you would not expect to get 
> biased within a few days - yes, that was training-mistakes for sure - 
> but when you are able to bias a bayes with a few years of corpus
> within a few days your exmples are wrong

I have no idea what you are talking about, how it's relevant, or what
you did wrong, but it doesn't trump mathematics.
 


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread RW
On Thu, 15 Feb 2018 17:15:47 +0100
Reindl Harald wrote:

> Am 15.02.2018 um 17:01 schrieb RW:
> > On Thu, 15 Feb 2018 00:01:18 +0100
> > Reindl Harald wrote:
> >   
> >> Am 14.02.2018 um 23:07 schrieb RW:  
> > 
> >>> My point is that an imbalance doesn't create a bias  
> > 
> >> wrong - what you tried to say was "doesn't necessarily create a
> >> bias"
> >> - but in fact when the imbalance is too big *it does*
> >>
> >> simply think about how bayes works makes that clear: eahc word a
> >> token with ham/spam counter - when you have 1 Mio of one type and
> >> 1 of the other type guess how that counter start to get
> >> biased  
> > 
> > As I said, Bayes is based on frequencies.
> > 
> > If a token occurs in 10% of ham and 0.5% of spam based on 10,000
> > hams and 10,000 spams, what do you think is likely to happen to
> > those percentages with 10,000 hams and 1,000,000 spams?  
> 
> the 10% and 0.5% is just an unbacked assumption

It's not an assumption, it's an example.

> what if every word except a few relevant of the spam mail and so
> every token exists in a relevant percent of your 1.4 Mio ham samples
> and so 90% of every token has a high ham-conuter

You are talking about ultra-rare tokens here, the chances of these
dominating a classification is negligible. 


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread shanew

On Thu, 15 Feb 2018, RW wrote:


On Thu, 15 Feb 2018 00:01:18 +0100
Reindl Harald wrote:


Am 14.02.2018 um 23:07 schrieb RW:



My point is that an imbalance doesn't create a bias



wrong - what you tried to say was "doesn't necessarily create a bias"
- but in fact when the imbalance is too big *it does*

simply think about how bayes works makes that clear: eahc word a
token with ham/spam counter - when you have 1 Mio of one type and
1 of the other type guess how that counter start to get biased


As I said, Bayes is based on frequencies.

If a token occurs in 10% of ham and 0.5% of spam based on 10,000 hams
and 10,000 spams, what do you think is likely to happen to those
percentages with 10,000 hams and 1,000,000 spams?


Perhaps it would help to state Bayes' formula explicitly.

The probabality that a message is spam given a specific token is equal
to:

(the probabilty of a token occuring in spam) times (the probability
that a message is spam) divided by (the probabilty of that token
occuring in all messages)

The important feature in this formula is that every value being
operated on is a probability, so if a given token occurs in .5% of
10,000 spams, we would expect it to occur in .5% of 100,000 or
1,000,000.  If that assumption is true, and the .5% probability
doesn't change, the resulting calculated probability also doesn't
change.

For actual spam detection, this is complicated by the fact that we end
up with a whole stack of calculated probabilites for each token
(including the probabilities that a message is non-spam given specific
tokens), and we have to take all of them into account to calculate a
final probability.  In this process, it's not unusual that some
individual calculated probablities "matter" more than others, and one
basis for how much weight a particular probability gets is how much we
can trust that probability.  Here's where the 10,000 vs. 1,000,000
comes into play, because we can rely on the .5% probability out of
1,000,000 samples more than we can the .5% probability out of 10,000
samples, and both of those are better than a .5% probability out of
100 samples (that said, the difference in trust increases more between
100 samples and 10,000 samples than from 10,000 samples to 1,000,000
samples due to diminishing return).

So, the sample size doesn't matter when calculating the probability of
a message being spam based on individual tokens, but it can matter
when we bring them all together to make a final calculation.

--
Public key #7BBC68D9 at| Shane Williams
http://pgp.mit.edu/|  System Admin - UT CompSci
=--+---
All syllogisms contain three lines |  sha...@shanew.net
Therefore this is not a syllogism  | www.ischool.utexas.edu/~shanew


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-15 Thread RW
On Thu, 15 Feb 2018 00:01:18 +0100
Reindl Harald wrote:

> Am 14.02.2018 um 23:07 schrieb RW:
 
> > My point is that an imbalance doesn't create a bias 
 
> wrong - what you tried to say was "doesn't necessarily create a bias"
> - but in fact when the imbalance is too big *it does*
> 
> simply think about how bayes works makes that clear: eahc word a
> token with ham/spam counter - when you have 1 Mio of one type and
> 1 of the other type guess how that counter start to get biased

As I said, Bayes is based on frequencies.

If a token occurs in 10% of ham and 0.5% of spam based on 10,000 hams
and 10,000 spams, what do you think is likely to happen to those
percentages with 10,000 hams and 1,000,000 spams?







Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-14 Thread RW
On Wed, 14 Feb 2018 16:20:30 +0100
Matus UHLAR - fantomas wrote:

> >On Tue, 13 Feb 2018 21:02:46 +
> >Horváth Szabolcs wrote:  
> >> One more question: is there a recommended ham to spam ratio? 1:1?  
> 
> On 14.02.18 15:09, RW wrote:
> >No, this is a myth.  Bayes computes token probabilities from a
> >token's frequencies in spam and ham, so it all scales through. If
> >you have 2000 ham and 200 spam the problem is too few spams, not a
> >bad ratio.  
> 
> my experience says you will need more ham than spam, because you want
> to get rid of false positives (ham marked as spam) much more than of
> false negatives.


My point is that an imbalance doesn't create a bias.



Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-14 Thread David Jones

On 02/14/2018 09:20 AM, Matus UHLAR - fantomas wrote:

On Tue, 13 Feb 2018 21:02:46 +
Horváth Szabolcs wrote:

One more question: is there a recommended ham to spam ratio? 1:1?


On 14.02.18 15:09, RW wrote:

No, this is a myth.  Bayes computes token probabilities from a token's
frequencies in spam and ham, so it all scales through. If you have
2000 ham and 200 spam the problem is too few spams, not a bad ratio.


my experience says you will need more ham than spam, because you want to 
get
rid of false positives (ham marked as spam) much more than of false 
negatives.




This is also my experience.



what really matters is how many of FP/FNs you have, you can decrease
probability by training anything too far from BAYES_00 for ham and BAYES_99
for ham


Correct.  You want to get ham hitting BAYES_00 and spam hitting 
BAYES_80, BAYES_95, BAYES_99, or BAYES_999 which mine does very well.


A problem I have found is you shouldn't blindly train all spam as spam. 
I have some spam hitting BAYES_00 because it truly could be ham based on 
the body contents but it's spam because it was unsolicited email from 
someone "cold" emailing for a meeting or something.


In this case, I block the sender and report it to SpamCop and other 
abuse so the account can be blocked/locked/disabled hopefully.


If I had trained my Bayes with this email as spam, then legit email 
could hit BAYES_99.  That is why my nightly process to train my Bayes DB 
in redis learns ham first then spam second.  This seems to be the best 
order from my experience.


--
David Jones


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-14 Thread Matus UHLAR - fantomas

On Tue, 13 Feb 2018 21:02:46 +
Horváth Szabolcs wrote:

One more question: is there a recommended ham to spam ratio? 1:1?


On 14.02.18 15:09, RW wrote:

No, this is a myth.  Bayes computes token probabilities from a token's
frequencies in spam and ham, so it all scales through. If you have
2000 ham and 200 spam the problem is too few spams, not a bad ratio.


my experience says you will need more ham than spam, because you want to get
rid of false positives (ham marked as spam) much more than of false negatives.

what really matters is how many of FP/FNs you have, you can decrease
probability by training anything too far from BAYES_00 for ham and BAYES_99
for ham
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
LSD will make your ECS screen display 16.7 million colors


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-14 Thread RW
On Tue, 13 Feb 2018 21:02:46 +
Horváth Szabolcs wrote:

> One more question: is there a recommended ham to spam ratio? 1:1? 

No, this is a myth.  Bayes computes token probabilities from a token's 
frequencies in spam and ham, so it all scales through. If you have
2000 ham and 200 spam the problem is too few spams, not a bad ratio.


Theoretically there is a case for new training to match the ratio that's
already in the database because then a new token will get a token
probability that reflects its frequencies in recent mail. But I wouldn't
worry about that, it's hard to stick to, and probably minor. 


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-14 Thread Rupert Gallagher
They cannot (do not want, do not have the know how) study the e-mails, and 
therefore they cannot build a reliable corpus. All they can do is to trust the 
ability of their users to study their own e-mails well enough to do the job, 
hence the mess with ham/spam when feeding the Bayesian filter. They need to 
consult with a lawyer, fix their paperwork, hire people who can teach them 
everything they need to know, and invest at least two years full-time in the 
process. They just cannot install centos and SA and hope Bayesian filters to do 
their job out of magic. It just does not work that way.

Sent from ProtonMail Mobile

On Wed, Feb 14, 2018 at 05:48, Bill Cole 
 wrote:

> On 13 Feb 2018, at 9:33, Horváth Szabolcs wrote: > This is a production mail 
> gateway serving since 2015. I saw that a few > messages (both hams and spams) 
> automatically learned by > amavisd/spamassassin. Today's statistics: > > 3616 
> autolearn=ham > 10076 autolearn=no > 2817 autolearn=spam > 134 
> autolearn=unavailable That's quite high for spam, ham, AND "unavailable" 
> (which indicates something wrong with the Bayes subsystem, usually 
> transient.) This seems like a recipe for a mis-learning disaster. For 
> comparison, my 2018 autolearn counts: spam: 418 ham: 15018 unavailable: 166 
> no: 129555 I also manually train any spam that gets through to me (the 
> biggest spam target,) a small number of spams reported by others, and 'trap' 
> hits. A wide variety of ham is harder to get for training but I have found it 
> useful to give users a well-documented and simple way to help. One way is to 
> look at what happens to mail AFTER delivery which can indicate that a message 
> is ham without needing an admin to try to make a determination based on 
> content. The simplest one is to learn anything users mark as $NotJunk as ham. 
> Another is to create an "Archive" mailbox for every user and learn anything 
> as ham that has been moved there a day after it is moved. The most important 
> factor (especially in jurisdictions where human examination of email is a 
> problem) is to tell users how to protect their email and then do what you 
> tell them, robotically. In the US, Canada, and *SOME* of the EU, this is not 
> risky. However, I have been told by people in *SOME* EU countries that they 
> can't even robotically scan ANY mail content, so you shouldn't take my advice 
> as authoritative: I'm not even a lawyer in the US, much less Hungary... > I 
> think I have no control over what is learnt automatically. Yes, you do. Run 
> "perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold" for details. You can 
> set the learning thresholds, which control what gets learned. The defaults 
> (0.1 and 12) mis-learn far too much spam as ham and not enough spam. I use 
> -0.2 and 6, which means I don't autolearn a lot but everything I autolearn as 
> ham has at least one hit on a substantial "nice" rule or 2 hits on weak ones. 
> There's a lot of vehemence against autolearn expressed here but not a lot of 
> evidence that it operates poorly when configured wisely. The defaults are NOT 
> wise. > Let's just assume for a moment that 1.4M ham-samples are valid. Bad 
> assumption. Your Bayes checks are uncertain about mail you've told SA is 
> definitely spam. That's broken. It's a sort of breakage that cannot exist if 
> you do not have a large quantity of spam that has been learned as ham. > Is 
> there a ham:spam ratio I should stick to it? No. > I presume if we have a 1:1 
> ratio then future messages won't be > considered as spam as well. The 
> ham:spam ratio in the Bayes DB or its autolearning is not a generally useful 
> metric. 1:1 is not magically good and neither is any other ratio, even with 
> reference to a single site's mailstream. A very large ratio *on either side* 
> indicates a likely problem in what is being learned, but you can't correlate 
> the ratio to any particularly wrong bias in Bayes scoring. It is an 
> inherently chaotic relationship. Factors that actually matter are correctness 
> of learning, sample quality, and currency. You can control how current your 
> Bayes DB is (USE AUTO-EXPIRE) but the other two factors are never going to be 
> perfect.

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Bill Cole

On 13 Feb 2018, at 9:33, Horváth Szabolcs wrote:

This is a production mail gateway serving since 2015. I saw that a few 
messages (both hams and spams) automatically learned by 
amavisd/spamassassin. Today's statistics:


   3616 autolearn=ham
  10076 autolearn=no
   2817 autolearn=spam
134 autolearn=unavailable


That's quite high for spam, ham, AND "unavailable" (which indicates 
something wrong with the Bayes subsystem, usually transient.) This seems 
like a recipe for a mis-learning disaster. For comparison, my 2018 
autolearn counts:


spam: 418
ham: 15018
unavailable: 166
no: 129555

I also manually train any spam that gets through to me (the biggest spam 
target,) a small number of spams reported by others, and 'trap' hits. A 
wide variety of ham is harder to get for training but I have found it 
useful to give users a well-documented and simple way to help. One way 
is to look at what happens to mail AFTER delivery which can indicate 
that a message is ham without needing an admin to try to make a 
determination based on content. The simplest one is to learn anything 
users mark as $NotJunk as ham. Another is to create an "Archive" mailbox 
for every user and learn anything as ham that has been moved there a day 
after it is moved. The most important factor (especially in 
jurisdictions where human examination of email is a problem) is to tell 
users how to protect their email and then do what you tell them, 
robotically. In the US, Canada, and *SOME* of the EU, this is not risky. 
However, I have been told by people in *SOME* EU countries that they 
can't even robotically scan ANY mail content, so you shouldn't take my 
advice as authoritative: I'm not even a lawyer in the US, much less 
Hungary...



I think I have no control over what is learnt automatically.


Yes, you do. Run "perldoc 
Mail::SpamAssassin::Plugin::AutoLearnThreshold" for details.


You can set the learning thresholds, which control what gets learned. 
The defaults (0.1 and 12) mis-learn far too much spam as ham and not 
enough spam. I use -0.2 and 6, which means I don't autolearn a lot but 
everything I autolearn as ham has at least one hit on a substantial 
"nice" rule or 2 hits on weak ones.


There's a lot of vehemence against autolearn expressed here but not a 
lot of evidence that it operates poorly when configured wisely. The 
defaults are NOT wise.



Let's just assume for a moment that 1.4M ham-samples are valid.


Bad assumption. Your Bayes checks are uncertain about mail you've told 
SA is definitely spam. That's broken. It's a sort of breakage that 
cannot exist if you do not have a large quantity of spam that has been 
learned as ham.



Is there a ham:spam ratio I should stick to it?


No.

I presume if we have a 1:1 ratio then future messages won't be 
considered as spam as well.


The ham:spam ratio in the Bayes DB or its autolearning is not a 
generally useful metric. 1:1 is not magically good and neither is any 
other ratio, even with reference to a single site's mailstream. A very 
large ratio *on either side* indicates a likely problem in what is being 
learned, but you can't correlate the ratio to any particularly wrong 
bias in Bayes scoring. It is an inherently chaotic relationship. Factors 
that actually matter are correctness of learning, sample quality, and 
currency. You can control how current your Bayes DB is (USE AUTO-EXPIRE) 
but the other two factors are never going to be perfect.


RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread John Hardin

On Tue, 13 Feb 2018, Horváth Szabolcs wrote:


3. populate the ham database


That's the tricky part. As I mentioned earlier, I don't really want 
end-users involved in this.


You might be able to find a few that are somewhat technically competent 
and don't mind their ham samples being manually reviewed.



One more question: is there a recommended ham to spam ratio? 1:1?


I suggest "try to match your ham:spam ratio at your MTA before filtering", 
but others may have different advice. Generally: the more *reliable* data 
you can feed Bayes, the better it does.


I'm thinking about if you see my "populating the ham database 
automatically with the outgoing emails" idea as a complete nonsense, 
then I would find sysadministrator resource to collect 2000 legit emails 
and train those mails as hams, but cannot allocate 2 workhours/day for 
months. (Also I'm not sure if 2000 legit emails are enough for training)


2000 is enough to start, but it would have to be ongoing as the nature of 
mail changes over time.


Generally training on misclassifications is what you do after the initial 
training. So if a ham drops into a user's quarantine folder, you'd want to 
train that as ham.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Windows Genuine Advantage (WGA) means that now you use your
  computer at the sufferance of Microsoft Corporation. They can
  kill it remotely without your consent at any time for any reason;
  it also shuts down in sympathy when the servers at Microsoft crash.
---
 9 days until George Washington's 286th Birthday

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Benny Pedersen

John Hardin skrev den 2018-02-14 02:28:

Properly training your Bayes and increasing the score for BAYES_80, 
BAYES_95, and BAYES_99

and BAYES_999


score BAYES_999 5000

/me hiddes, could not resists :=)


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread John Hardin

On Tue, 13 Feb 2018, David Jones wrote:

Properly training your Bayes and increasing the score for BAYES_80, BAYES_95, 
and BAYES_99


and BAYES_999


is the best bet on this one.



--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Windows Genuine Advantage (WGA) means that now you use your
  computer at the sufferance of Microsoft Corporation. They can
  kill it remotely without your consent at any time for any reason;
  it also shuts down in sympathy when the servers at Microsoft crash.
---
 9 days until George Washington's 286th Birthday


RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Hello,

David Jones [mailto:djo...@ena.com] wrote:
> With non-English email flow, it's more challenging.  If no RBLs hit, then you 
> really must train your Bayes properly which requires some way to accurately 
> determine the ham and spam.  You must keep a copy of the 
ham and spam corpi and be allowed to review suspicious email.

I really appreciate you to take time helping on this. 

Yes, I can confirm that we usually have issues with Hungarian spams. English 
spams often caught by the default rules.

As far as I understood today, I need to re-build the bayes database from 
scratch:

1. turn off autolearning

2. populate then spam database
Guys behind the http://artinvoice.hu/spams/ site are doing an excellent work, 
they publish catched spams in mbox format
I checked, many spam e-mails that was sent for investigation are in their mbox.

3. populate the ham database
That's the tricky part. As I mentioned earlier, I don't really want end-users 
involved in this. And I don't have the necessary resource to do that manually.
I assume I can hack something into the mailflow to copy all outgoing e-mails to 
a separate mailbox and - we'll assume that every outgoing e-mail are hams - 
these mails are learnt.
That should do it?

End-users are working in a heavily controlled environment (both technically and 
legally), in the last ten years, we haven't experienced spams that were sent 
from inside. That's why I would blindly trust outgoing emails as hams.

One more question: is there a recommended ham to spam ratio? 1:1? 

I'm thinking about if you see my "populating the ham database automatically 
with the outgoing emails" idea as a complete nonsense, then I would find 
sysadministrator resource to collect 2000 legit emails and train those mails as 
hams, but cannot allocate 2 workhours/day for months. (Also I'm not sure if 
2000 legit emails are enough for training)

Best regards,
  Szabolcs Horvath


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread David Jones

On 02/13/2018 11:45 AM, Horváth Szabolcs wrote:

Reindl Harald [mailto:h.rei...@thelounge.net] wrote:

I think I have no control over what is learnt automatically.

surely, don't do autolearning at all


This is a mail gateway for multiple companies. I'm not supposed to read e-mails 
on that, or picking mails that can be used for learning ham.
And I can't ask users to use a "ham" mailbox, because they are not IT experts, 
sometimes they have problems with a simple mail forwarding.



If you aren't allowed to check specific emails with a suspicious subject 
or that are reported as spam by your users, there's no way you can do 
your job of accurately filtering email.



Without autolearning and without the help of the end-users, I can't build a 
proper ham bayes database, can I?



SA's autolearning doesn't use the results from BAYES_* rules since that 
could make incorrect training even worse so you are going to have to 
build local rules or get help from RBLs and other SA plugins to get to 
the autolearning thresholds.


With non-English email flow, it's more challenging.  If no RBLs hit, 
then you really must train your Bayes properly which requires some way 
to accurately determine the ham and spam.  You must keep a copy of the 
ham and spam corpi and be allowed to review suspicious email.


Can you setup a split copy of the email that can redact the recipient or 
anonymize it enough to allow for review?  If not, your filtering is not 
going to be accurate.



Best regards
   Szabolcs



--
David Jones


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread David Jones

On 02/13/2018 11:24 AM, Horváth Szabolcs wrote:

Hello,

David Jones [mailto:djo...@ena.com]  wrote:

There should be many more rule hits than just these 3.  It looks like
network tests aren't happening.
Can you post the original email to pastebin.com with minimal redacting
so the rest of us can run it through our SA to see how it scores to help
with suggestions?


Thanks for taking time to answer. Here it is: https://pastebin.com/5XZ5kbus



My SA instance would have blocked it but the 2 rules that did it won't 
apply to your mail flow based on language and non-US relays.


Properly training your Bayes and increasing the score for BAYES_80, 
BAYES_95, and BAYES_99 is the best bet on this one.  It might take some 
local content rules but I can't read the subject or body.  :)



Content analysis details:   (10.2 points, 5.0 required)

 pts rule name  description
 -- 
--

 5.2 BAYES_99   BODY: Bayes spam probability is 99 to 100%
[score: 0.9926]
 0.0 HTML_IMAGE_RATIO_08BODY: HTML has a low ratio of text to image 
area

 2.8 UNWANTED_LANGUAGE_BODY BODY: Message written in an undesired language
 0.0 HTML_MESSAGE   BODY: HTML included in message
 2.2 ENA_RELAY_NOT_US   Relayed from outside the US and not on 
whitelists

 0.0 ENA_BAD_SPAM   Spam hitting really bad rules.


This brings up a good point that we need help with non-English 
masscheckers and SA rules.


The sending mail server 79.96.0.147 is not listed on any major RBLs and 
it has proper FCrDNS.  I can't tell the envelope-from domain but it must 
not have an SPF record.  Definitely no DMARC record for fiok.com.


The "IdeaSmtpServer" might be something to investigate it's relationship 
to spam to see if it's an indicator worthy of a local rule.


The domain in the Message-ID might be worth checking with other spam to 
see if that is a pattern worth a local rule.


If there are unique body phrases or misspellings, then that is 
definitely something to put into a local rule to add a point or two in 
the future.



I suspect there needs to be some MTA tuning in front of SA along with
some SA tuning that is mentioned on this list every couple of months --
add extra RBLs, add KAM.cf, enable some SA plugins, etc.


Oops. I'm a new member on this list. Could you please tell us which 
customizations do you mean?
I already looked KAM.cf, doesn't really help in situation. We're using a lot of 
RBLs.




--
David Jones


RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Reindl Harald [mailto:h.rei...@thelounge.net] wrote:
>> This is a mail gateway for multiple companies. I'm not supposed to read 
>> e-mails on that, or picking mails that can be used for learning ham
> 
> how did you then manage 1.4 Mio ham-samples in your biased corpus

Looks like in this amavisd-spamassassin combo, it automatically learnt a lot of 
ham (which weren't hams)

Feb 11 03:37:31 amavis[20024]: (20024-06) spam-tag,  -> 
, No, score=-0.099 tagged_above=- required=4 
tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, 
HTML_MESSAGE=0.001] autolearn=ham

I never configured autolearning, I assume it came with this centos setup. Man 
spamassassin says, bayes_auto_learn has a default value of 1.

>> Without autolearning and without the help of the end-users, I can't build a 
>> proper ham bayes database, can I?
> surely, or don't you and people around you which can help don't send and 
> reveive mails?

I don't want to go in this "fight", but end-users have limited IT knowledge. 
They are 100% outlook users (forwarding inline and attached always confuse 
them).
If I really want this, I need something user-proof one click solutions like 
gmail's "spam" and "not spam" buttons which magically saves e-mails to the 
proper technical mailbox (which is reviewed by the admins then trained SA).
With outlook users, exchange internal mta's, my options are limited. 

So, if I understood correctly, you all agree on that bayesian database is 
f* up, let's start with a new one, autolearn turned off, and train SA from 
the stratch both with ham and spam mails.

Best regards
  Szabolcs


RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Reindl Harald [mailto:h.rei...@thelounge.net] wrote:
>> I think I have no control over what is learnt automatically.
> surely, don't do autolearning at all

This is a mail gateway for multiple companies. I'm not supposed to read e-mails 
on that, or picking mails that can be used for learning ham.
And I can't ask users to use a "ham" mailbox, because they are not IT experts, 
sometimes they have problems with a simple mail forwarding.

Without autolearning and without the help of the end-users, I can't build a 
proper ham bayes database, can I?

Best regards
  Szabolcs


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread John Hardin

On Tue, 13 Feb 2018, Horváth Szabolcs wrote:


After:

pts rule name  description
 -- --
0.0 HTML_IMAGE_RATIO_08BODY: HTML has a low ratio of text to image area
0.0 HTML_MESSAGE   BODY: HTML included in message
0.8 BAYES_50   BODY: Bayes spam probability is 40 to 60%
   [score: 0.5000]


BAYES_50 is "can't decide".



Version: spamassassin-3.3.2-4.el6.rfx.x86_64

$ sa-learn --dump magic --dbpath /var/spool/amavisd/.spamassassin/
0.000  0  3  0  non-token data: bayes db version
0.000  0 338770  0  non-token data: nspam
0.000  01460807  0  non-token data: nham


That ratio is really suspicious. I'd expect something closer to 1:1 or 
even a bit heavier on spam.


It *seems* that you have spam trained as ham; that would explain BAYES_50 
with that much in the BAYES database.



My questions are:
1) is there any chance to change spamassassin settings to mark similar messages 
as SPAM in the future?
bayes_50 with 0.8 points are really-really low.


No, it's not. "BAYES_50" is "I can't decide" and increasing the score for 
that implies "I can't decide" means "spam". That's not justified.


Don't adjust the score of BAYES_50.

It's recommended (if possible) to retain the training corpora so that it 
can be reviewed and retrained from scratch if necessary.


Your admin is manually vetting user-submitted training messages. Are they 
retained after being trained?


You might consider reviewing the training corpus and retraining Bayes from 
scratch.



Another note: the "before" result:


Before: spamassassin -D -t 

...with *no* BAYES hits at all (not even BAYES_50) suggests your SA is 
*not* using the database whose statistics you reported above.


First: verify which Bayes database your SA install is using, and that it 
is the one you're training into and getting those stats from.



--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Maxim IX: Never turn your back on an enemy.
---
 9 days until George Washington's 286th Birthday

RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Hello,

David Jones [mailto:djo...@ena.com]  wrote:
> There should be many more rule hits than just these 3.  It looks like 
> network tests aren't happening.
> Can you post the original email to pastebin.com with minimal redacting 
> so the rest of us can run it through our SA to see how it scores to help 
> with suggestions?

Thanks for taking time to answer. Here it is: https://pastebin.com/5XZ5kbus

> I suspect there needs to be some MTA tuning in front of SA along with 
> some SA tuning that is mentioned on this list every couple of months -- 
> add extra RBLs, add KAM.cf, enable some SA plugins, etc.

Oops. I'm a new member on this list. Could you please tell us which 
customizations do you mean?
I already looked KAM.cf, doesn't really help in situation. We're using a lot of 
RBLs.


> > It only assigns 0.8. (required_hits around 4.0)
> You are certainly free to set a local score higher if you want but that  is 
> probably not the main resolution to this issue.

I agree.

> > Version: spamassassin-3.3.2-4.el6.rfx.x86_64
> This is very old and no longer supported.  Why not upgrade to 3.4.x?

Because centos6 ships with this version. When the infrastructure was built, 
there were no centos7 around. Migration between the major versions is still not 
an easy thing to do.

> > My questions are:
> > 1) is there any chance to change spamassassin settings to mark similar 
> > messages as SPAM in the future?
> > bayes_50 with 0.8 points are really-really low.
> > 
>
> You should be hitting BAYES_95, BAYES_99, and BAYES_999 on these really 
> bad emails with proper training which would give it a higher probability 
> and thus a higher score.

I agree. Can't wait to see what your results are on this e-mail.

Best regards
  Szabolcs Horvath


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread David Jones

On 02/13/2018 07:55 AM, Horváth Szabolcs wrote:

Dear members,

User repeatedly sends us spam messages to train SA.
Traning - at the moment - requires manual intervention: administrator verifies 
if it's really spam then issues sa-learn.

Then the user thinks the process is done, and the next time when the same email 
arrives, it will automatically marked as spam.

However, that doesn't happen.

Before: spamassassin -D -t 

There should be many more rule hits than just these 3.  It looks like 
network tests aren't happening.


Can you post the original email to pastebin.com with minimal redacting 
so the rest of us can run it through our SA to see how it scores to help 
with suggestions?


I suspect there needs to be some MTA tuning in front of SA along with 
some SA tuning that is mentioned on this list every couple of months -- 
add extra RBLs, add KAM.cf, enable some SA plugins, etc.




It only assigns 0.8. (required_hits around 4.0)



You are certainly free to set a local score higher if you want but that 
is probably not the main resolution to this issue.




Version: spamassassin-3.3.2-4.el6.rfx.x86_64



This is very old and no longer supported.  Why not upgrade to 3.4.x?



$ sa-learn --dump magic --dbpath /var/spool/amavisd/.spamassassin/
0.000  0  3  0  non-token data: bayes db version
0.000  0 338770  0  non-token data: nspam
0.000  01460807  0  non-token data: nham
0.000  0 187804  0  non-token data: ntokens
0.000  0 1512318030  0  non-token data: oldest atime
0.000  0 1518524875  0  non-token data: newest atime
0.000  0 1518524876  0  non-token data: last journal sync atime
0.000  0 1518508126  0  non-token data: last expiry atime
0.000  0  43238  0  non-token data: last expire atime delta
0.000  0 136970  0  non-token data: last expire reduction 
count

I obviously see that nspam is increased after the sa-learn.

When I tried to understand what was happening, I found the following:
# https://wiki.apache.org/spamassassin/BayesInSpamAssassin
The Bayesian classifier in Spamassassin tries to identify spam by looking at 
what are called tokens; words or short character sequences that are commonly 
found in spam or ham. If I've handed 100 messages to sa-learn that have the 
phrase penis enlargement and told it that those are all spam, when the 101st 
message comes in with the words penis and enlargment, the Bayesian classifier 
will be pretty sure that the new message is spam and will increase the spam 
score of that message.


My questions are:
1) is there any chance to change spamassassin settings to mark similar messages 
as SPAM in the future?
bayes_50 with 0.8 points are really-really low.



You should be hitting BAYES_95, BAYES_99, and BAYES_999 on these really 
bad emails with proper training which would give it a higher probability 
and thus a higher score.



I know that I'm able to write custom rules based on e-mail body content but I 
flattered myself that sa-learn would do that by manipulating the bayes database.



I suspect that after the MTA and SA are tuned, this would be blocked 
without requiring a local custom rule but I would need to see the rule 
hits on my SA platform before I could say for sure.  Sometimes it does 
require a header or body rule combine with other hits in a local custom 
meta rule to block them.



2) or tell users that learning process doesn't necessarily mean that future 
messages will be flagged SPAM.
Rather than it should be considered as a "warning sign".

I appreciate any feedback on this.

Already try to find docs that answers those questions, but no luck so far.
If you have a good documentation, just send me. I love reading manuals.

Best regards,
   Szabolcs Horvath



--
David Jones


RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Reindl Harald [mailto:h.rei...@thelounge.net] wrote:

> > However, that doesn't happen.
> > 0.000  0 338770  0  non-token data: nspam
> > 0.000  01460807  0  non-token data: nham

> what do you expect when you train 4 times more ham than spam?
> frankly you "flooded" your bayes with 1.4 Mio ham-samples and i thought 
> our 140k total corpus is large - don' forget that ham messages are 
> typically larger than junk trying to point you with some words to a URL
> 
> 108897   SPAM
> 31492HAM

This is a production mail gateway serving since 2015. I saw that a few messages 
(both hams and spams) automatically learned by amavisd/spamassassin. Today's 
statistics:

   3616 autolearn=ham
  10076 autolearn=no
   2817 autolearn=spam
134 autolearn=unavailable

I think I have no control over what is learnt automatically.

Let's just assume for a moment that 1.4M ham-samples are valid.
Is there a ham:spam ratio I should stick to it? I presume if we have a 1:1 
ratio then future messages won't be considered as spam as well.

Regards
  Szabolcs