Re: Large commented out body HTML causing SA to timeout/give up/allow spam
On Fri, Sep 05, 2014 at 06:38:23PM +0200, Matus UHLAR - fantomas wrote: > >W dniu 2014-09-05 17:55, Justin Edmands pisze: > >>We are seeing a few emails that are about a 1MB and appear to have > > On 05.09.14 18:09, Adi wrote: > >You should consider limiting SA processing to mail size > >up to 150-250 KB. > > > >very few spam messages is bigger. > > I'm happily filtering all mail with SA and it did catch much of spam. > unfortunately there IS spam >256K... > > yes, it _needs_ much memory and CPU power... Nah, Amavisd-new has supported truncating large messages for 5 years already. - large messages beyond $sa_mail_body_size_limit are now partially passed to SpamAssassin and other spam scanners for checking: a copy passed to a spam scanner is truncated near or slightly past the indicated limit. Large messages are no longer given an almost free passage through spam checks. Also there's my patch to make SA handle big blobs gracefully: https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6582 Skipping large messages is a bit of 90's mentality since technically it's pointless.
Reply versus new thread [Was: Dumping email with blank To: header ?]
Others have gracefully answered as to the substance of your message. I'll have to be a pest and ask that you please do not use "Reply" or "Followup" when you're starting a new topic. For list readers with user agents that thread the standard (RFC standard) way, that breaks threading. The way to start a new topic is to copy the list address, do a "New Message" or similar, and paste the address into the destination field. You can also save the address in your contact list / address book to avoid the copy and paste in the future. Thanks for your cooperation. -- Please *no* private copies of mailing list or newsgroup messages. Local Variables: mode:claws-external End:
Re: large spam messages
On Thu, 4 Sep 2014 12:52:34 -0400 (EDT), Jude DaShiell wrote: Jude> Since spamassassin cannot handle large spam over 2MB in size, what Jude> can be used to handle that class of junk? I use a script on the MX host to MIME reshape all large messages, dropping all non-text attachments, and save them to files there, before forwarding to my IMAP server. If such a message is ham (which is almost never) it is easy enough to download the files after the fact. Can share the script for the asking. -- Please *no* private copies of mailing list or newsgroup messages. Local Variables: mode:claws-external End:
Re: Bayes autolearn questions
Please use plain-text rather than HTML. In particular with that really bad indentation format of quoting. On Sat, 2014-09-06 at 17:22 -0400, Alex wrote: > On Thu, Sep 4, 2014 at 1:44 PM, Karsten Bräckelmann wrote: > > On Wed, 2014-09-03 at 23:50 -0400, Alex wrote: > > > > > > > I looked in the quarantined message, and according to the _TOKEN_ > > > > > header I've added: > > > > > > > > > > X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16. > > > > > > > > > > Isn't that sufficient for auto-learning this message as spam? > > > > That's clearly referring to the _TOKEN_ data in the custom header, is it > > not? > > Yes. Burning the candle at both ends. Really overworked. Sorry to hear. Nonetheless, did you take the time to really understand my explanations? It seems you sometimes didn't in the past, and I am not happy to waste my time on other people's problems if they aren't following thoroughly. > > > > That has absolutely nothing to do with auto-learning. Where did you get > > > > the impression it might? > > > > > > If the conditions for autolearning had been met, I understood that it > > > would be those new tokens that would be learned. > > > > Learning is not limited to new tokens. All tokens are learned, > > regardless their current (h|sp)ammyness. > > > > Still, the number of (new) tokens is not a condition for auto-learning. > > That header shows some more or less nice information, but in this > > context absolutely irrelevant information. > > I understood "new" to mean the tokens that have not been seen before, and > would be learned if the other conditions were met. Well, yes. So what? Did you understand that the number of previously not seen tokens has absolutely nothing to do with auto-learning? Did you understand that all tokens are learned, regardless whether they have been seen before? This whole part is entirely unrelated to auto-learning and your original question. > > Auto-learning in a nutshell: Take all tests hit. Drop some of them with > > certain tflags, like the BAYES_xx rules. For the remaining rules, look > > up their scores in the non-Bayes scoreset 0 or 1. Sum up those scores to > > a total, and compare with the auto-learn threshold values. For spam, > > also check there are at least 3 points each by header and body rules. > > Finally, if all that matches, learn. > > Is it important to understand how those three points are achieved or > calculated? In most cases, no, I guess. Though that is really just a distinction usually easy to do based on the rule's type: header vs body-ish rule definitions. If the re-calculated total score in scoreset 0 or 1 exceeds the auto-learn threshold but the message still is not -- then it is important. Unless you trust the auto-learn discriminator to not cheat on you. > > > Okay, of course I understood the difference between points and tokens. > > > Since the points were over the specified threshold, I thought those > > > new tokens would have been added. > > > > As I have mentioned before in this thread: It is NOT the message's > > reported total score that must exceed the threshold. The auto-learning > > discriminator uses an internally calculated score using the respective > > non-Bayes scoreset. > > Very helpful, thanks. Is there a way to see more about how it makes that > decision on a particular message? spamassassin -D learn Unsurprisingly, the -D debug option shows information on that decision. In this case limiting debug output to the 'learn' area comes in handy, eliminating the noise. The output includes the important details like auto-learn decision with human readable explanation, score computed for autolearn as well as head and body points. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: shouldn't "spamc -L spam" always create BAYES_99?
On Sun, 2014-09-07 at 09:09 +1200, Jason Haar wrote: > We've got a problem with a tonne of spam getting BAYES_50 or even > BAYES_00. We're re-training SA using "spamc -L spam" but it doesn't seem > to do as much as we'd like. Sometimes it doesn't change the BAYES_ > score, and other times it might go from BAYES_50 to BAYES_80 > > I think bayes is working (there's also a tonne of mail getting BAYES_99) > but I'm guessing there's some "learning logic" I'm not aware of to > explain why me telling SA "this is spam" doesn't seem to be entirely > listened to? The Bayesian classifier operates on tokens, not messages. So while training a message as spam is like "this is spam" as you put it, according to Bayes it's "these tokens appear in spam". For each token (think of it as words), the number of ham and spam they appeared in and have been learned from are counted. The higher that ratio is, the higher the probability of a message to be the same classification for any given token found in later mail. > So my question is: shouldn't "-L spam"/"-L ham" always make SA re-train > the bayes more explicitly? Or is that really not possible with a single > email message? (ie it's a statistics thing). Just trying to understand > the backend :-) It's statistics. Learning (increasing the number of ham or spam a token has been seen in) has less effect for tokens seen about equally frequent in both ham and spam, than if there already is a bias. Similarly, tokens with high counts need more training to change overall probability, than tokens less common in mail. IOW, words like "and" will never be a strong spammyness indicator. For more details on that entire topic of Bayes and training, I suggest the sa-learn man page / documentation. For a closer look at the tokens used for classification see the hammy/spammytokens Template Tags in the M::SA::Conf docs. Both available here: http://spamassassin.apache.org/doc/ For ad-hoc debugging after training see the spamassassin --cf option to add_header the token details without a need to actually add them to every mail. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Bayes autolearn questions
Hi, On Thu, Sep 4, 2014 at 1:44 PM, Karsten Bräckelmann wrote: > On Wed, 2014-09-03 at 23:50 -0400, Alex wrote: > > > > > I looked in the quarantined message, and according to the _TOKEN_ > > > > header I've added: > > > > > > > > X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16. > > > > > > > > Isn't that sufficient for auto-learning this message as spam? > > That's clearly referring to the _TOKEN_ data in the custom header, is it > not? > Yes. Burning the candle at both ends. Really overworked. > > > That has absolutely nothing to do with auto-learning. Where did you get > > > the impression it might? > > > > If the conditions for autolearning had been met, I understood that it > > would be those new tokens that would be learned. > > Learning is not limited to new tokens. All tokens are learned, > regardless their current (h|sp)ammyness. > > Still, the number of (new) tokens is not a condition for auto-learning. > That header shows some more or less nice information, but in this > context absolutely irrelevant information. > I understood "new" to mean the tokens that have not been seen before, and would be learned if the other conditions were met. > Auto-learning in a nutshell: Take all tests hit. Drop some of them with > certain tflags, like the BAYES_xx rules. For the remaining rules, look > up their scores in the non-Bayes scoreset 0 or 1. Sum up those scores to > a total, and compare with the auto-learn threshold values. For spam, > also check there are at least 3 points each by header and body rules. > Finally, if all that matches, learn. > Is it important to understand how those three points are achieved or calculated? > > Okay, of course I understood the difference between points and tokens. > > Since the points were over the specified threshold, I thought those > > new tokens would have been added. > > As I have mentioned before in this thread: It is NOT the message's > reported total score that must exceed the threshold. The auto-learning > discriminator uses an internally calculated score using the respective > non-Bayes scoreset. > Very helpful, thanks. Is there a way to see more about how it makes that decision on a particular message? Thanks, Alex
shouldn't "spamc -L spam" always create BAYES_99?
Hi there We've got a problem with a tonne of spam getting BAYES_50 or even BAYES_00. We're re-training SA using "spamc -L spam" but it doesn't seem to do as much as we'd like. Sometimes it doesn't change the BAYES_ score, and other times it might go from BAYES_50 to BAYES_80 I think bayes is working (there's also a tonne of mail getting BAYES_99) but I'm guessing there's some "learning logic" I'm not aware of to explain why me telling SA "this is spam" doesn't seem to be entirely listened to? So my question is: shouldn't "-L spam"/"-L ham" always make SA re-train the bayes more explicitly? Or is that really not possible with a single email message? (ie it's a statistics thing). Just trying to understand the backend :-) -- Cheers Jason Haar Corporate Information Security Manager, Trimble Navigation Ltd. Phone: +1 408 481 8171 PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1
Re: Large commented out body HTML causing SA to timeout/give up/allow spam
I'm happily filtering all mail with SA and it did catch much of spam. unfortunately there IS spam >256K... On 05.09.14 18:59, Adi wrote: Of course, but rather it is such a small percentage compared to the good messages (bigger than 250 KB). IMHO rather that it is not worth it to filter. if it does not overload the machine, it is worth not to see spam in inbox. Only "prefilter" by very few RBL (in MTA - rest RBL are in SA) and optionally clamav. but everyone has their own approach based on its experience and capabilities of the hardware :) Do you have statistics (would be interesting) ? : 1. For messages bigger than 250 KB . SPAM / HAM % no... 2. Compare how many SPAM was catched (in %) is < 250 KB and > 250 KB 12 of 40 this year, ~100 out of ~7000 since I save them... (I don't count those rejected with score >10) -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Eagles may soar, but weasels don't get sucked into jet engines.