Re: Large commented out body HTML causing SA to timeout/give up/allow spam

2014-09-06 Thread Henrik K
On Fri, Sep 05, 2014 at 06:38:23PM +0200, Matus UHLAR - fantomas wrote:
> >W dniu 2014-09-05 17:55, Justin Edmands pisze:
> >>We are seeing a few emails that are about a 1MB and appear to have
> 
> On 05.09.14 18:09, Adi wrote:
> >You should consider limiting SA processing to mail size
> >up to 150-250 KB.
> >
> >very few spam messages is bigger.
> 
> I'm happily filtering all mail with SA and it did catch much of spam.
> unfortunately there IS spam >256K...
> 
> yes, it _needs_ much memory and CPU power...

Nah,

Amavisd-new has supported truncating large messages for 5 years already.

- large messages beyond $sa_mail_body_size_limit are now partially passed
  to SpamAssassin and other spam scanners for checking: a copy passed to
  a spam scanner is truncated near or slightly past the indicated limit.
  Large messages are no longer given an almost free passage through spam
  checks.

Also there's my patch to make SA handle big blobs gracefully:
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6582

Skipping large messages is a bit of 90's mentality since technically it's
pointless.



Reply versus new thread [Was: Dumping email with blank To: header ?]

2014-09-06 Thread Ian Zimmerman
Others have gracefully answered as to the substance of your message.

I'll have to be a pest and ask that you please do not use "Reply" or
"Followup" when you're starting a new topic.  For list readers with user
agents that thread the standard (RFC standard) way, that breaks
threading.

The way to start a new topic is to copy the list address, do a "New
Message" or similar, and paste the address into the destination field.
You can also save the address in your contact list / address book to
avoid the copy and paste in the future.

Thanks for your cooperation.

-- 
Please *no* private copies of mailing list or newsgroup messages.
Local Variables:
mode:claws-external
End:


Re: large spam messages

2014-09-06 Thread Ian Zimmerman
On Thu, 4 Sep 2014 12:52:34 -0400 (EDT),
Jude DaShiell  wrote:

Jude> Since spamassassin cannot handle large spam over 2MB in size, what
Jude> can be used to handle that class of junk?

I use a script on the MX host to MIME reshape all large messages, dropping
all non-text attachments, and save them to files there, before forwarding
to my IMAP server.  If such a message is ham (which is almost never) it
is easy enough to download the files after the fact.

Can share the script for the asking.

-- 
Please *no* private copies of mailing list or newsgroup messages.
Local Variables:
mode:claws-external
End:


Re: Bayes autolearn questions

2014-09-06 Thread Karsten Bräckelmann
Please use plain-text rather than HTML. In particular with that really
bad indentation format of quoting.


On Sat, 2014-09-06 at 17:22 -0400, Alex wrote:
> On Thu, Sep 4, 2014 at 1:44 PM, Karsten Bräckelmann wrote:
> > On Wed, 2014-09-03 at 23:50 -0400, Alex wrote:
> >
> > > > > I looked in the quarantined message, and according to the _TOKEN_
> > > > > header I've added:
> > > > >
> > > > > X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16.
> > > > >
> > > > > Isn't that sufficient for auto-learning this message as spam?
> > 
> > That's clearly referring to the _TOKEN_ data in the custom header, is it
> > not?
> 
> Yes. Burning the candle at both ends. Really overworked.

Sorry to hear. Nonetheless, did you take the time to really understand
my explanations? It seems you sometimes didn't in the past, and I am not
happy to waste my time on other people's problems if they aren't
following thoroughly.


> > > > That has absolutely nothing to do with auto-learning. Where did you get
> > > > the impression it might?
> > >
> > > If the conditions for autolearning had been met, I understood that it
> > > would be those new tokens that would be learned.
> >
> > Learning is not limited to new tokens. All tokens are learned,
> > regardless their current (h|sp)ammyness.
> >
> > Still, the number of (new) tokens is not a condition for auto-learning.
> > That header shows some more or less nice information, but in this
> > context absolutely irrelevant information.
> 
> I understood "new" to mean the tokens that have not been seen before, and
> would be learned if the other conditions were met.

Well, yes. So what?

Did you understand that the number of previously not seen tokens has
absolutely nothing to do with auto-learning? Did you understand that all
tokens are learned, regardless whether they have been seen before?

This whole part is entirely unrelated to auto-learning and your original
question.


> > Auto-learning in a nutshell: Take all tests hit. Drop some of them with
> > certain tflags, like the BAYES_xx rules. For the remaining rules, look
> > up their scores in the non-Bayes scoreset 0 or 1. Sum up those scores to
> > a total, and compare with the auto-learn threshold values. For spam,
> > also check there are at least 3 points each by header and body rules.
> > Finally, if all that matches, learn.
> 
> Is it important to understand how those three points are achieved or
> calculated?

In most cases, no, I guess. Though that is really just a distinction
usually easy to do based on the rule's type: header vs body-ish rule
definitions.

If the re-calculated total score in scoreset 0 or 1 exceeds the
auto-learn threshold but the message still is not -- then it is
important. Unless you trust the auto-learn discriminator to not cheat on
you.


> > > Okay, of course I understood the difference between points and tokens.
> > > Since the points were over the specified threshold, I thought those
> > > new tokens would have been added.
> >
> > As I have mentioned before in this thread: It is NOT the message's
> > reported total score that must exceed the threshold. The auto-learning
> > discriminator uses an internally calculated score using the respective
> > non-Bayes scoreset.
> 
> Very helpful, thanks. Is there a way to see more about how it makes that
> decision on a particular message?

  spamassassin -D learn

Unsurprisingly, the -D debug option shows information on that decision.
In this case limiting debug output to the 'learn' area comes in handy,
eliminating the noise.

The output includes the important details like auto-learn decision with
human readable explanation, score computed for autolearn as well as head
and body points.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: shouldn't "spamc -L spam" always create BAYES_99?

2014-09-06 Thread Karsten Bräckelmann
On Sun, 2014-09-07 at 09:09 +1200, Jason Haar wrote:
> We've got a problem with a tonne of spam getting BAYES_50 or even
> BAYES_00. We're re-training SA using "spamc -L spam" but it doesn't seem
> to do as much as we'd like. Sometimes it doesn't change the BAYES_
> score, and other times it might go from BAYES_50 to BAYES_80
> 
> I think bayes is working (there's also a tonne of mail getting BAYES_99)
> but I'm guessing there's some "learning logic" I'm not aware of to
> explain why me telling SA "this is spam" doesn't seem to be entirely
> listened to?

The Bayesian classifier operates on tokens, not messages. So while
training a message as spam is like "this is spam" as you put it,
according to Bayes it's "these tokens appear in spam".

For each token (think of it as words), the number of ham and spam they
appeared in and have been learned from are counted. The higher that
ratio is, the higher the probability of a message to be the same
classification for any given token found in later mail.


> So my question is: shouldn't "-L spam"/"-L ham" always make SA re-train
> the bayes more explicitly? Or is that really not possible with a single
> email message? (ie it's a statistics thing). Just trying to understand
> the backend :-)

It's statistics. Learning (increasing the number of ham or spam a token
has been seen in) has less effect for tokens seen about equally frequent
in both ham and spam, than if there already is a bias. Similarly, tokens
with high counts need more training to change overall probability, than
tokens less common in mail. IOW, words like "and" will never be a strong
spammyness indicator.


For more details on that entire topic of Bayes and training, I suggest
the sa-learn man page / documentation. For a closer look at the tokens
used for classification see the hammy/spammytokens Template Tags in the
M::SA::Conf docs. Both available here:

  http://spamassassin.apache.org/doc/

For ad-hoc debugging after training see the spamassassin --cf option to
add_header the token details without a need to actually add them to
every mail.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Bayes autolearn questions

2014-09-06 Thread Alex
Hi,

On Thu, Sep 4, 2014 at 1:44 PM, Karsten Bräckelmann 
wrote:

> On Wed, 2014-09-03 at 23:50 -0400, Alex wrote:
>
> > > > I looked in the quarantined message, and according to the _TOKEN_
> > > > header I've added:
> > > >
> > > > X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16.
> > > >
> > > > Isn't that sufficient for auto-learning this message as spam?
> 
> That's clearly referring to the _TOKEN_ data in the custom header, is it
> not?
>

Yes. Burning the candle at both ends. Really overworked.


> > > That has absolutely nothing to do with auto-learning. Where did you get
> > > the impression it might?
> >
> > If the conditions for autolearning had been met, I understood that it
> > would be those new tokens that would be learned.
>
> Learning is not limited to new tokens. All tokens are learned,
> regardless their current (h|sp)ammyness.
>
> Still, the number of (new) tokens is not a condition for auto-learning.
> That header shows some more or less nice information, but in this
> context absolutely irrelevant information.
>

I understood "new" to mean the tokens that have not been seen before, and
would be learned if the other conditions were met.


> Auto-learning in a nutshell: Take all tests hit. Drop some of them with
> certain tflags, like the BAYES_xx rules. For the remaining rules, look
> up their scores in the non-Bayes scoreset 0 or 1. Sum up those scores to
> a total, and compare with the auto-learn threshold values. For spam,
> also check there are at least 3 points each by header and body rules.
> Finally, if all that matches, learn.
>

Is it important to understand how those three points are achieved or
calculated?


> > Okay, of course I understood the difference between points and tokens.
> > Since the points were over the specified threshold, I thought those
> > new tokens would have been added.
>
> As I have mentioned before in this thread: It is NOT the message's
> reported total score that must exceed the threshold. The auto-learning
> discriminator uses an internally calculated score using the respective
> non-Bayes scoreset.
>

Very helpful, thanks. Is there a way to see more about how it makes that
decision on a particular message?

Thanks,
Alex


shouldn't "spamc -L spam" always create BAYES_99?

2014-09-06 Thread Jason Haar
Hi there

We've got a problem with a tonne of spam getting BAYES_50 or even
BAYES_00. We're re-training SA using "spamc -L spam" but it doesn't seem
to do as much as we'd like. Sometimes it doesn't change the BAYES_
score, and other times it might go from BAYES_50 to BAYES_80

I think bayes is working (there's also a tonne of mail getting BAYES_99)
but I'm guessing there's some "learning logic" I'm not aware of to
explain why me telling SA "this is spam" doesn't seem to be entirely
listened to?

So my question is: shouldn't "-L spam"/"-L ham" always make SA re-train
the bayes more explicitly? Or is that really not possible with a single
email message? (ie it's a statistics thing). Just trying to understand
the backend :-)

-- 
Cheers

Jason Haar
Corporate Information Security Manager, Trimble Navigation Ltd.
Phone: +1 408 481 8171
PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1




Re: Large commented out body HTML causing SA to timeout/give up/allow spam

2014-09-06 Thread Matus UHLAR - fantomas

I'm happily filtering all mail with SA and it did catch much of spam.
unfortunately there IS spam >256K...


On 05.09.14 18:59, Adi wrote:

Of course,  but rather it is such a small percentage compared to the
good messages (bigger than 250 KB).

IMHO rather that it is not worth it to filter.


if it does not overload the machine, it is worth not to see spam in inbox.


Only "prefilter" by very few RBL (in MTA - rest RBL are in SA)
and optionally clamav.

but everyone has their own approach based on its experience and
capabilities of the hardware :)

Do you have statistics (would be interesting) ? :

1. For messages bigger than 250 KB .   SPAM / HAM  %


no...


2. Compare how many SPAM was catched (in %) is < 250 KB and > 250 KB


12 of 40 this year, ~100 out of ~7000 since I save them...
(I don't count those rejected with score >10)
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Eagles may soar, but weasels don't get sucked into jet engines.