Re: my spamassassin has serious config problems

2019-05-28 Thread Matus UHLAR - fantomas

On 28.05.19 15:34, hg user wrote:

I did some more research and I think I have to report the new discovery so
that the thread can be useful to other Readers.

First:
0.000  0   5232  0  non-token data: nspam
0.000  0  70408  0  non-token data: nham
0.000  0 388070  0  non-token data: ntokens
nspam and nham values are definitively the number of messages learnt.

Second:
I saw that nham increased every few seconds. I discovered that
bayes_auto_learn was enabled !
My situation yesterday:
0.000  01042011  0  non-token data: nspam
0.000  0  66472  0  non-token data: nham
0.000  0 663479  0  non-token data: ntokens
My situation now:
0.000  01042049  0  non-token data: nspam
0.000  0  71228  0  non-token data: nham
0.000  01040661  0  non-token data: ntokens

So, at least, I now know that the system is feeding the bayes engine with
some new data and that in this way the results can change.

Third:
in 72_active.cf there are a lot of bayes_ignore_header directives, but they
don't include the ones added by my commercial antivirus. Should I create a
patch?

Fourth:
I added a dbg statement to bayes.pm, sub tokenize, to print the tokens it
extracts from the message.
I agree with some, I don't with others. I'd like to know if there is some
doc that lists why tokens are extracted this way (some notes are in the
source code)
I discovered that probably some words should be added to the stopwords list
but there is no way to do it in a configuration file, I should modify
spamassassin code directly...



To end:
I think that the only way to proceed now is to nuke the bayes db and start
from scratch:
- setup bayes configuration correctly
- double check the corpus to be correctly classified
- run sa-learn


Do you still use Zimbra? if so, have you configured Zimbra?
Did you consult your Zimbra-man?



For the "setup bayes configuration correctly" step I accept your
contributions :-) I excluded all the headers of my antivirus and
internal/external/trusted.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Spam is for losers who can't get business any other way.


Re: my spamassassin has serious config problems

2019-05-28 Thread RW
On Tue, 28 May 2019 15:42:05 +
David Jones wrote:

> On 5/27/19 5:13 PM, hg user wrote:
> > The server was installed and configured by a "zimbra man", a person
> > I fully trust. Since I manage a commercial antivirus/antispam
> > solution that is not properly working for the italian language, I
> > was tasked to join the project in order to understand if we could
> > switch from the proprietary solution to spamassassin.
> > 
> > I'm now in the process of double-checking the configuration of 
> > spamassassin and feeding the bayes engine...
> > 
> > Testing the system I noticed that spamassassin logged the internal
> > MTAs (including the antivirus server) as external and I asked *the
> > zimbra man* to correct the configuration. He replied it was not
> > necessary. Sorry I didn't specify I asked the person in charge of
> > the system.
> > 
> > In the end, I need to think about the answer of RW: spamassassin is
> > run by amavis but with no internal servers defined, it uses my
> > internal one as the external. Received header needs some more care,
> > and probably also the list of stop words should be expanded.
> > Probably there is a ratio behind some decisions taken by the
> > developers, but I can't fully understand how the destination
> > address can help on whether a message is spam or not, at least
> > not 6 times.  
> 
> The internal_networks and trusted_networks are _very important_ to be 
> set correctly for a number of reasons, not just Bayes.  This gives SA 
> the proper "view" to the outside/Internet.  Keep in mind 
> internal_networks is not literally your RFC 1918 internal_networks
> and the trusted_networks are not only ones that you managed/control.
> 
> Internal_networks is any public or private IP space that you trust to 
> not forge the Received or synthetic received headers like
> X-Originating-IP.

Not really, internal_networks is there to establish which relay is the
last-external.

> 
> Trusted_networks can be external/public networks that you know won't 
> change or forge the Received or synthetic received headers.
> 
> I have recently added all Google and Office 365 IP blocks to my 
> trusted_networks to better detect last-external client IPs.

You need to add them to internal_networks to affect the last-external.

>  This
> allows for deep Received header inspection since I know that Google
> and Microsoft aren't going to forge those headers.  Very interesting 
> information comes out into the open as a result of this.

You always get deep checks because allowing spammers to forge their
way into blocklists and other spam tests is harmless. Adding external
addresses to trusted_networks prevents unnecessary blocklist look-ups
and allows whitelist tests to run on the first-trusted relay. 


 



Re: my spamassassin has serious config problems

2019-05-28 Thread David Jones
On 5/27/19 5:13 PM, hg user wrote:
> The server was installed and configured by a "zimbra man", a person I 
> fully trust. Since I manage a commercial antivirus/antispam solution 
> that is not properly working for the italian language, I was tasked to 
> join the project in order to understand if we could switch from the 
> proprietary solution to spamassassin.
> 
> I'm now in the process of double-checking the configuration of 
> spamassassin and feeding the bayes engine...
> 
> Testing the system I noticed that spamassassin logged the internal MTAs 
> (including the antivirus server) as external and I asked *the zimbra 
> man* to correct the configuration. He replied it was not necessary. 
> Sorry I didn't specify I asked the person in charge of the system.
> 
> In the end, I need to think about the answer of RW: spamassassin is run 
> by amavis but with no internal servers defined, it uses my internal one 
> as the external. Received header needs some more care, and probably also 
> the list of stop words should be expanded. Probably there is a ratio 
> behind some decisions taken by the developers, but I can't fully 
> understand how the destination address can help on whether a message is 
> spam or not, at least not 6 times.

The internal_networks and trusted_networks are _very important_ to be 
set correctly for a number of reasons, not just Bayes.  This gives SA 
the proper "view" to the outside/Internet.  Keep in mind 
internal_networks is not literally your RFC 1918 internal_networks and 
the trusted_networks are not only ones that you managed/control.

Internal_networks is any public or private IP space that you trust to 
not forge the Received or synthetic received headers like X-Originating-IP.

Trusted_networks can be external/public networks that you know won't 
change or forge the Received or synthetic received headers.

I have recently added all Google and Office 365 IP blocks to my 
trusted_networks to better detect last-external client IPs.  This allows 
for deep Received header inspection since I know that Google and 
Microsoft aren't going to forge those headers.  Very interesting 
information comes out into the open as a result of this.

P.S. To implement/try this extended trusted_networks, set the score for 
ALL_TRUSTED to -0.001 and disable it from shortcircuit'ing.

score   ALL_TRUSTED -0.001
shortcircuit ALL_TRUSTED off

-- 
David Jones


Re: my spamassassin has serious config problems

2019-05-28 Thread RW
On Tue, 28 May 2019 15:34:06 +0200
hg user wrote:

> Fourth:
> I added a dbg statement to bayes.pm, sub tokenize, to print the
> tokens it extracts from the message.
> I agree with some, I don't with others. I'd like to know if there is
> some doc that lists why tokens are extracted this way (some notes are
> in the source code)
> I discovered that probably some words should be added to the
> stopwords list but there is no way to do it in a configuration file,
> I should modify spamassassin code directly...


The stoplist is just there to drop tokens that are deemed to be not
worth using because they are likely to be neutral. Neutral tokens
don't affect the result. 


For testing purposes I'd suggest stripping any purely internal headers,
except headers that contain envelope information as zimba may be
supplying this by other means.

If you can turn-off auto-training and clear the database, I suggest
you do that. 


Re: my spamassassin has serious config problems

2019-05-28 Thread hg user
I did some more research and I think I have to report the new discovery so
that the thread can be useful to other Readers.

First:
0.000  0   5232  0  non-token data: nspam
0.000  0  70408  0  non-token data: nham
0.000  0 388070  0  non-token data: ntokens
nspam and nham values are definitively the number of messages learnt.

Second:
I saw that nham increased every few seconds. I discovered that
bayes_auto_learn was enabled !
My situation yesterday:
0.000  01042011  0  non-token data: nspam
0.000  0  66472  0  non-token data: nham
0.000  0 663479  0  non-token data: ntokens
My situation now:
0.000  01042049  0  non-token data: nspam
0.000  0  71228  0  non-token data: nham
0.000  01040661  0  non-token data: ntokens

So, at least, I now know that the system is feeding the bayes engine with
some new data and that in this way the results can change.

Third:
in 72_active.cf there are a lot of bayes_ignore_header directives, but they
don't include the ones added by my commercial antivirus. Should I create a
patch?

Fourth:
I added a dbg statement to bayes.pm, sub tokenize, to print the tokens it
extracts from the message.
I agree with some, I don't with others. I'd like to know if there is some
doc that lists why tokens are extracted this way (some notes are in the
source code)
I discovered that probably some words should be added to the stopwords list
but there is no way to do it in a configuration file, I should modify
spamassassin code directly...



To end:
I think that the only way to proceed now is to nuke the bayes db and start
from scratch:
- setup bayes configuration correctly
- double check the corpus to be correctly classified
- run sa-learn

For the "setup bayes configuration correctly" step I accept your
contributions :-) I excluded all the headers of my antivirus and
internal/external/trusted.

Thanks
Francesco


Re: my spamassassin has serious config problems

2019-05-28 Thread Matus UHLAR - fantomas

On 28.05.19 00:13, hg user wrote:

The server was installed and configured by a "zimbra man", a person I fully
trust. Since I manage a commercial antivirus/antispam solution that is not
properly working for the italian language, I was tasked to join the project
in order to understand if we could switch from the proprietary solution to
spamassassin.

I'm now in the process of double-checking the configuration of spamassassin
and feeding the bayes engine...

Testing the system I noticed that spamassassin logged the internal MTAs
(including the antivirus server) as external and I asked *the zimbra man*
to correct the configuration. He replied it was not necessary. Sorry I
didn't specify I asked the person in charge of the system.


I believe that that is not necessary, because zimbra takes the control
itself, uses modified SA source.

If your "spamassassin" binary is not the one from zimbra, it's apparently
the reason why you have probvlems with trustparh configuration and also the
bayes database.

I don't recommend mixing usage of zimbra's internal SA and SA installed from
elsewhere.


Unfortunately, spamassassin documentation is not really clear and asking
google can be even more confusiong... I found posts stating that nham/nspam
reported by --dump magic are either tokens or messages... according with a
test I did this afternoon, feeding 2 messages to sa-learn ham, those
numbers are tokens.


0.000  0   5232  0  non-token data: nspam
0.000  0  70408  0  non-token data: nham
0.000  0 388070  0  non-token data: ntokens

I believe first two are counts of mail, last one is count of tokens and also
that it's self-explanatory.


I noticed that the nham counter kept increasing for several minutes after
sa-learn ended, probably due to the --no-sync parameter... this could also
explain why immediately after the sa-learn of the spam message bayes
reported BAYES_50 and a few minutes later BAYES_00: the engine was still
learning and as new tokens were recorded they changed the result.



In the end, I need to think about the answer of RW: spamassassin is run by
amavis but with no internal servers defined, it uses my internal one as the
external. Received header needs some more care, and probably also the list
of stop words should be expanded. Probably there is a ratio behind some
decisions taken by the developers, but I can't fully understand how the
destination address can help on whether a message is spam or not, at least
not 6 times.

Tomorrow I will try some -D bayes on different messages to try understand
better what the plugin is doing, and I will try to read all the source
code. Unfortunately I don't know perl...

Probably the best solution is to change the configuration, zap the bayes db
and sa-learn all the corpus I put apart


I recommend to ask zimbra forums when you are messing up with zimbra's bayes
database and zimbra SA settings.


On Mon, May 27, 2019 at 8:06 PM Matus UHLAR - fantomas 
wrote:


On 27.05.19 18:04, hg user wrote:
>I was writing a message requesting advice on bayes_ignore_header since I
>was sure something was wrong when I decided to have a look at spamassassin
>-D bayes output... and I was shocked by what I saw !
>
>x-spam-relays-external lists all the hops of the message *including*
internal
>servers and so x-spam-relays-internal is empty...  I specifically asked to
>add the antivirus and other internal MTAs to the internal list...

how?


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
The only substitute for good manners is fast reflexes. 


Re: my spamassassin has serious config problems

2019-05-28 Thread Matus UHLAR - fantomas

On Mon, 27 May 2019 18:04:35 +0200 hg user wrote:

I was writing a message requesting advice on bayes_ignore_header
since I was sure something was wrong when I decided to have a look at
spamassassin -D bayes output... and I was shocked by what I saw !

x-spam-relays-external lists all the hops of the message *including*
internal servers and so x-spam-relays-internal is empty


On 27.05.19 20:18, RW wrote:

You can fix this by setting   internal_networks and trusted_networks.
However if SA is running from amavis it probably doesn't matter,


I believe that the zimbra installation does have these configs set properly,
he just must use zimbra's settings which I don't know how to manage.

"spamassassin" binary ues $HOME/ while zimbra installation stores them in
directory that is not in $HOME of any user.

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Quantum mechanics: The dreams stuff is made of. 


Re: my spamassassin has serious config problems

2019-05-27 Thread hg user
The server was installed and configured by a "zimbra man", a person I fully
trust. Since I manage a commercial antivirus/antispam solution that is not
properly working for the italian language, I was tasked to join the project
in order to understand if we could switch from the proprietary solution to
spamassassin.

I'm now in the process of double-checking the configuration of spamassassin
and feeding the bayes engine...

Testing the system I noticed that spamassassin logged the internal MTAs
(including the antivirus server) as external and I asked *the zimbra man*
to correct the configuration. He replied it was not necessary. Sorry I
didn't specify I asked the person in charge of the system.

Unfortunately, spamassassin documentation is not really clear and asking
google can be even more confusiong... I found posts stating that nham/nspam
reported by --dump magic are either tokens or messages... according with a
test I did this afternoon, feeding 2 messages to sa-learn ham, those
numbers are tokens.
I noticed that the nham counter kept increasing for several minutes after
sa-learn ended, probably due to the --no-sync parameter... this could also
explain why immediately after the sa-learn of the spam message bayes
reported BAYES_50 and a few minutes later BAYES_00: the engine was still
learning and as new tokens were recorded they changed the result.

In the end, I need to think about the answer of RW: spamassassin is run by
amavis but with no internal servers defined, it uses my internal one as the
external. Received header needs some more care, and probably also the list
of stop words should be expanded. Probably there is a ratio behind some
decisions taken by the developers, but I can't fully understand how the
destination address can help on whether a message is spam or not, at least
not 6 times.

Tomorrow I will try some -D bayes on different messages to try understand
better what the plugin is doing, and I will try to read all the source
code. Unfortunately I don't know perl...

Probably the best solution is to change the configuration, zap the bayes db
and sa-learn all the corpus I put apart



On Mon, May 27, 2019 at 8:06 PM Matus UHLAR - fantomas 
wrote:

> On 27.05.19 18:04, hg user wrote:
> >I was writing a message requesting advice on bayes_ignore_header since I
> >was sure something was wrong when I decided to have a look at spamassassin
> >-D bayes output... and I was shocked by what I saw !
> >
> >x-spam-relays-external lists all the hops of the message *including*
> internal
> >servers and so x-spam-relays-internal is empty...  I specifically asked to
> >add the antivirus and other internal MTAs to the internal list...
>
> how?
>
> --
> Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
> Warning: I wish NOT to receive e-mail advertising to this address.
> Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
> My mind is like a steel trap - rusty and illegal in 37 states.
>


Re: my spamassassin has serious config problems

2019-05-27 Thread RW
On Mon, 27 May 2019 18:04:35 +0200
hg user wrote:

> I was writing a message requesting advice on bayes_ignore_header
> since I was sure something was wrong when I decided to have a look at
> spamassassin -D bayes output... and I was shocked by what I saw !
> 
> x-spam-relays-external lists all the hops of the message *including*
> internal servers and so x-spam-relays-internal is empty


You can fix this by setting   internal_networks and trusted_networks.
However if SA is running from amavis it probably doesn't matter,


> specifically asked to add the antivirus and other internal MTAs to
> the internal list... and now I find the internal server names used to
> calculate the bayes point...
> 
> I really think this is skewing the result.


The handling of Received headers is a little odd, and could use some
work. The bottom two headers are tokenized directly and extra tokens
come from the x-spam-relays-* pseudo headers. 

IMO everyone should have these setting: 

bayes_ignore_header x-spam-relays-internal
bayes_ignore_header x-spam-relays-external
bayes_ignore_header x-spam-relays-trusted


> * the 2 words of the subject are listed but Subject: is not tokenized
> according to the sources


The subject is treated as part of the body.


Re: my spamassassin has serious config problems

2019-05-27 Thread Matus UHLAR - fantomas

On 27.05.19 18:04, hg user wrote:

I was writing a message requesting advice on bayes_ignore_header since I
was sure something was wrong when I decided to have a look at spamassassin
-D bayes output... and I was shocked by what I saw !

x-spam-relays-external lists all the hops of the message *including* internal
servers and so x-spam-relays-internal is empty...  I specifically asked to
add the antivirus and other internal MTAs to the internal list... 


how?

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
My mind is like a steel trap - rusty and illegal in 37 states. 


my spamassassin has serious config problems

2019-05-27 Thread hg user
I was writing a message requesting advice on bayes_ignore_header since I
was sure something was wrong when I decided to have a look at spamassassin
-D bayes output... and I was shocked by what I saw !

x-spam-relays-external lists all the hops of the message *including* internal
servers and so x-spam-relays-internal is empty...  I specifically asked to
add the antivirus and other internal MTAs to the internal list... and now I
find the internal server names used to calculate the bayes point...

I really think this is skewing the result.

In the 40 tokens it uses to calculate the score, the internal MTA is
present a couple of times.

I also noticed to my surprise that in the 40 tokens used to calculate the
score, * the address or domain of the sender is not used
* the address of the internal server is used 2 times
* menaningless (to me) since too generic tokens are used several times...
10026 is the port the sending server used, 192.168 is an internal IP
range..)
dbg: bayes: token 'H*r:amavisd-new' => 0.00933830395446512
dbg: bayes: token 'H*r:port' => 0.0100739915629308
dbg: bayes: token 'H*r:10026' => 0.00656298715300288
dbg: bayes: token 'H*r:ESMTPSA' => 0.0291881040543893
dbg: bayes: token 'H*RU:ESMTPSA' => 0.0299783424700051
dbg: bayes: token 'Hx-spam-relays-external:ESMTPSA' => 0.0299783424700051
dbg: bayes: token 'H*r:192.168.1' => 0.0332916024497639
dbg: bayes: token 'H*R:U*noreply' => 0.0884273751672186
dbg: bayes: token 'H*r:localhost' => 0.095748955695973
* the address/domain of the receiver is present in various combinations 6
times why is the receiver address so important?
dbg: bayes: token 'H*r:sk:' => 0.00474205399064878
dbg: bayes: token 'HTo:U*' => 0.00573965631120421
dbg: bayes: token '@.it' => 0.0252948951857414
dbg: bayes: token 'U*' => 0.0252948951857414
dbg: bayes: token 'sk:' => 0.0252948951857414
dbg: bayes: token '' => 0.0252948951857414
* the 2 words of the subject are listed but Subject: is not tokenized
according to the sources
dbg: bayes: token 'INFORMAZIONI' => 0.0198930234212028
dbg: bayes: token 'importanti' => 0.0186572280369034
* the tokens with the highest score are (notice 0.97 to 0.12)
dbg: bayes: token 'assicurarti' => 0.97797086079613
dbg: bayes: token 'caro' => 0.125457833816543

Can you please tell me if my bayes engine is working as it should?