On Thu, 25 Feb 2016, Steve wrote:

On 24/02/2016 22:59, John Hardin wrote:
 On Wed, 24 Feb 2016, Steve wrote:

> I've used spamassassin for many years - on Ubuntu, using amvisd - with > great success. In recent months, I've been receiving several spam > messages each day that evade the filters.

 Can you provide samples? (e.g. three or four on Pastebin)

One of each of the most common forms:

http: //pastebin.com/Wk2KD1Q1
http: //pastebin.com/QCQ9Ymw7
http: //pastebin.com/wgkmiJLt

The second one has autolearn=yes, so I would say that autolearn is probably the cause of this behavior.

Note that the bayes score doesn't contribute to the autolearning decision to avoid positive feedback, but if there are no non-Bayes spam signs and the message scores lightly negative like that one does, it can be learned as ham. That would make any subsequent similar messages score even lower, possibly offsetting actual spam hits.

Subsequently training those messages as spam will offset that effect, but you're to a degree playing whack-a-mole that way.

I misspoke a bit when I said there are no knobs to twiddle. I forgot about the autolearn thresholds, but they aren't strictly part of how bayes itself works, they are (again) training. If you want to use autolearn, you might want to reduce the learn-as-ham threshold even further. View autolearn as a not-quite-trustworthy user making submissions, and the thresholds are a way to limit the effects of poor judgement. :)

I note that they tend to come from different mail servers each time - the URLs in the body tend to be unique, too.

Have you considered greylisting to give domains a chance to be added to URIBLs before you see them?

> * The false positives all match BAYES_00 - attracting a default score of > -1.9. BAYES_00 seems to be at the crux of the misclassification. > > Is there a way to delve into why these messages have been allocated such > a low bayes score - while (to a human) appearing blatant, simple, spam > on "vanilla" spam topics? Has my bayes data been "poisoned" somehow?

 Poisoning is less likely than mistraining.
 How large is your userbase and mail volume?

One user - me - several email addresses. 10,000 mails per month - several mailing lists where I read only a tiny fraction of the posts.

Heh. For once it's someone pretty much like me. :)

~ 1,500 spams (that survive mail server RBLs). Autolearn is on - I don't think about it, it is automatic. :)

 How do you train your Bayes? Autolearn? General user submissions? Trusted
 user submissions? Only you, from only your personal mail?

Only my personal mailbox *really* matters to me. I train from it using the dovecot antispam plugin... which feeds mail I shift to/from a spam folder through a pipe involving "spamc -C".

And I assume there's a similar ham folder? You need both.

 Do you keep base training corpora so you can wipe and retrain if it goes
 off the rails for some reason?

(In principle) I've got multi-gigabyte-scale spam/ham corpora. I'm yet to [ever] do anything with it. :)

I have base bayes corpora of a few thousand messages each spam and ham, kept in aged corpora files. I add a handful to that every month, mostly on the spam side. SA is trained nightly from the current corpora files and I can retrain from from scratch from all of them if needed, but I haven't needed to do that yet.

 If all the FNs are getting BAYES_00, make sure you're (re)training them as
 spam.

I believe I'm doing that - but it isn't easy to prove that the training 'worked'.

If you look at the output from the training you'll be able to see how many "new" messages it learned from.

It will have an effect, in that it will remove a specific mistraining, but in the meantime autolearn may be making bad decisions about other messages.

 Review how you're training. If your users aren't really trustworthy you
 should be manually reviewing submissions.

When spam arrives in my primary inbox, I hand classify - I'm less obsessive about mailing lists. Dovecot initiates training automatically when I shift messages to a special spam folder.

OK, good. If you had a userbase, their judgement (or lack thereof) could be an issue.

 I feel autolearn can be problematic, particularly if things are already
 going off the rails.

I expect Autolearn (assisted by Razor, Pyzor and DCC) has done the vast majority of my training. This year, I've hand-trained 216 false-negatives and 0 false positives.

For the size of your install, I'd recommend turning off autolearn and go with purely hand-collected corpora. It serves me well.

 If you have base training corpora, review it for misclassifications (FNs),
 wipe and retrain.

I guess I could do that... My expectation is that - if I train with the corpora I can pick easily (without changing configuration) I'll get the same bayes database I currently have... which will give the same scores.

No, autolearning would no longer be affecting the results, and if you *do* get the same FNs, you can then go through your ham corpora and look for other possible causes (misclassified messages, or a ham that's something like part of a discussion about spam so it's confusing and shouldn't be in the corpora at all).

Really, I'd like to understand why my current bayes database makes the classifications it does.

Basically, because of what's been trained into it as ham.

If you autolearn, you can't really review that after the fact.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Markley's Law (variant of Godwin's Law): As an online discussion
  of gun owners' rights grows longer, the probability of an ad hominem
  attack involving penis size approaches 1.
-----------------------------------------------------------------------
 65 days since the first successful real return to launch site (SpaceX)

Reply via email to