Re: Spamassassin Bayes... "why give that spam that score???"

John Hardin Wed, 24 Feb 2016 17:15:06 -0800

On Thu, 25 Feb 2016, Steve wrote:

On 24/02/2016 22:59, John Hardin wrote:
 On Wed, 24 Feb 2016, Steve wrote:
> I've used spamassassin for many years - on Ubuntu, using amvisd - with> great success. In recent months, I've been receiving several spam> messages each day that evade the filters.
 Can you provide samples? (e.g. three or four on Pastebin)
One of each of the most common forms:

http: //pastebin.com/Wk2KD1Q1
http: //pastebin.com/QCQ9Ymw7
http: //pastebin.com/wgkmiJLt

The second one has autolearn=yes, so I would say that autolearn isprobably the cause of this behavior.

Note that the bayes score doesn't contribute to the autolearning decisionto avoid positive feedback, but if there are no non-Bayes spam signs andthe message scores lightly negative like that one does, it can be learnedas ham. That would make any subsequent similar messages score even lower,possibly offsetting actual spam hits.

Subsequently training those messages as spam will offset that effect, butyou're to a degree playing whack-a-mole that way.

I misspoke a bit when I said there are no knobs to twiddle. I forgot aboutthe autolearn thresholds, but they aren't strictly part of how bayesitself works, they are (again) training. If you want to use autolearn, youmight want to reduce the learn-as-ham threshold even further. Viewautolearn as a not-quite-trustworthy user making submissions, and thethresholds are a way to limit the effects of poor judgement. :)

I note that they tend to come from different mail servers each time - theURLs in the body tend to be unique, too.

Have you considered greylisting to give domains a chance to be added toURIBLs before you see them?

> * The false positives all match BAYES_00 - attracting a default score of> -1.9. BAYES_00 seems to be at the crux of the misclassification.>> Is there a way to delve into why these messages have been allocated such> a low bayes score - while (to a human) appearing blatant, simple, spam> on "vanilla" spam topics? Has my bayes data been "poisoned" somehow?
 Poisoning is less likely than mistraining.
 How large is your userbase and mail volume?
One user - me - several email addresses. 10,000 mails per month - severalmailing lists where I read only a tiny fraction of the posts.


Heh. For once it's someone pretty much like me. :)

~ 1,500 spams (that survive mail server RBLs). Autolearn is on - I don'tthink about it, it is automatic. :)
 How do you train your Bayes? Autolearn? General user submissions? Trusted
 user submissions? Only you, from only your personal mail?
Only my personal mailbox *really* matters to me. I train from it using thedovecot antispam plugin... which feeds mail I shift to/from a spam folderthrough a pipe involving "spamc -C".


And I assume there's a similar ham folder? You need both.

 Do you keep base training corpora so you can wipe and retrain if it goes
 off the rails for some reason?
(In principle) I've got multi-gigabyte-scale spam/ham corpora. I'm yet to[ever] do anything with it. :)

I have base bayes corpora of a few thousand messages each spam and ham,kept in aged corpora files. I add a handful to that every month, mostly onthe spam side. SA is trained nightly from the current corpora files and Ican retrain from from scratch from all of them if needed, but I haven'tneeded to do that yet.

 If all the FNs are getting BAYES_00, make sure you're (re)training them as
 spam.
I believe I'm doing that - but it isn't easy to prove that the training'worked'.

If you look at the output from the training you'll be able to see how many"new" messages it learned from.

It will have an effect, in that it will remove a specific mistraining, butin the meantime autolearn may be making bad decisions about othermessages.

 Review how you're training. If your users aren't really trustworthy you
 should be manually reviewing submissions.
When spam arrives in my primary inbox, I hand classify - I'm less obsessiveabout mailing lists. Dovecot initiates training automatically when I shiftmessages to a special spam folder.

OK, good. If you had a userbase, their judgement (or lack thereof) couldbe an issue.

 I feel autolearn can be problematic, particularly if things are already
 going off the rails.
I expect Autolearn (assisted by Razor, Pyzor and DCC) has done the vastmajority of my training. This year, I've hand-trained 216 false-negativesand 0 false positives.

For the size of your install, I'd recommend turning off autolearn and gowith purely hand-collected corpora. It serves me well.

 If you have base training corpora, review it for misclassifications (FNs),
 wipe and retrain.
I guess I could do that... My expectation is that - if I train with thecorpora I can pick easily (without changing configuration) I'll get the samebayes database I currently have... which will give the same scores.

No, autolearning would no longer be affecting the results, and if you *do*get the same FNs, you can then go through your ham corpora and look forother possible causes (misclassified messages, or a ham that's somethinglike part of a discussion about spam so it's confusing and shouldn't bein the corpora at all).

Really, I'd like to understand why my current bayes database makes theclassifications it does.


Basically, because of what's been trained into it as ham.

If you autolearn, you can't really review that after the fact.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Markley's Law (variant of Godwin's Law): As an online discussion
  of gun owners' rights grows longer, the probability of an ad hominem
  attack involving penis size approaches 1.
-----------------------------------------------------------------------
 65 days since the first successful real return to launch site (SpaceX)

Re: Spamassassin Bayes... "why give that spam that score???"

Reply via email to