On Thu, 25 Feb 2016, Steve wrote:
On 24/02/2016 22:59, John Hardin wrote:
On Wed, 24 Feb 2016, Steve wrote:
> I've used spamassassin for many years - on Ubuntu, using amvisd - with
> great success. In recent months, I've been receiving several spam
> messages each day that evade the filters.
Can you provide samples? (e.g. three or four on Pastebin)
One of each of the most common forms:
http: //pastebin.com/Wk2KD1Q1
http: //pastebin.com/QCQ9Ymw7
http: //pastebin.com/wgkmiJLt
The second one has autolearn=yes, so I would say that autolearn is
probably the cause of this behavior.
Note that the bayes score doesn't contribute to the autolearning decision
to avoid positive feedback, but if there are no non-Bayes spam signs and
the message scores lightly negative like that one does, it can be learned
as ham. That would make any subsequent similar messages score even lower,
possibly offsetting actual spam hits.
Subsequently training those messages as spam will offset that effect, but
you're to a degree playing whack-a-mole that way.
I misspoke a bit when I said there are no knobs to twiddle. I forgot about
the autolearn thresholds, but they aren't strictly part of how bayes
itself works, they are (again) training. If you want to use autolearn, you
might want to reduce the learn-as-ham threshold even further. View
autolearn as a not-quite-trustworthy user making submissions, and the
thresholds are a way to limit the effects of poor judgement. :)
I note that they tend to come from different mail servers each time - the
URLs in the body tend to be unique, too.
Have you considered greylisting to give domains a chance to be added to
URIBLs before you see them?
> * The false positives all match BAYES_00 - attracting a default score of
> -1.9. BAYES_00 seems to be at the crux of the misclassification.
>
> Is there a way to delve into why these messages have been allocated such
> a low bayes score - while (to a human) appearing blatant, simple, spam
> on "vanilla" spam topics? Has my bayes data been "poisoned" somehow?
Poisoning is less likely than mistraining.
How large is your userbase and mail volume?
One user - me - several email addresses. 10,000 mails per month - several
mailing lists where I read only a tiny fraction of the posts.
Heh. For once it's someone pretty much like me. :)
~ 1,500 spams (that survive mail server RBLs). Autolearn is on - I don't
think about it, it is automatic. :)
How do you train your Bayes? Autolearn? General user submissions? Trusted
user submissions? Only you, from only your personal mail?
Only my personal mailbox *really* matters to me. I train from it using the
dovecot antispam plugin... which feeds mail I shift to/from a spam folder
through a pipe involving "spamc -C".
And I assume there's a similar ham folder? You need both.
Do you keep base training corpora so you can wipe and retrain if it goes
off the rails for some reason?
(In principle) I've got multi-gigabyte-scale spam/ham corpora. I'm yet to
[ever] do anything with it. :)
I have base bayes corpora of a few thousand messages each spam and ham,
kept in aged corpora files. I add a handful to that every month, mostly on
the spam side. SA is trained nightly from the current corpora files and I
can retrain from from scratch from all of them if needed, but I haven't
needed to do that yet.
If all the FNs are getting BAYES_00, make sure you're (re)training them as
spam.
I believe I'm doing that - but it isn't easy to prove that the training
'worked'.
If you look at the output from the training you'll be able to see how many
"new" messages it learned from.
It will have an effect, in that it will remove a specific mistraining, but
in the meantime autolearn may be making bad decisions about other
messages.
Review how you're training. If your users aren't really trustworthy you
should be manually reviewing submissions.
When spam arrives in my primary inbox, I hand classify - I'm less obsessive
about mailing lists. Dovecot initiates training automatically when I shift
messages to a special spam folder.
OK, good. If you had a userbase, their judgement (or lack thereof) could
be an issue.
I feel autolearn can be problematic, particularly if things are already
going off the rails.
I expect Autolearn (assisted by Razor, Pyzor and DCC) has done the vast
majority of my training. This year, I've hand-trained 216 false-negatives
and 0 false positives.
For the size of your install, I'd recommend turning off autolearn and go
with purely hand-collected corpora. It serves me well.
If you have base training corpora, review it for misclassifications (FNs),
wipe and retrain.
I guess I could do that... My expectation is that - if I train with the
corpora I can pick easily (without changing configuration) I'll get the same
bayes database I currently have... which will give the same scores.
No, autolearning would no longer be affecting the results, and if you *do*
get the same FNs, you can then go through your ham corpora and look for
other possible causes (misclassified messages, or a ham that's something
like part of a discussion about spam so it's confusing and shouldn't be
in the corpora at all).
Really, I'd like to understand why my current bayes database makes the
classifications it does.
Basically, because of what's been trained into it as ham.
If you autolearn, you can't really review that after the fact.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Markley's Law (variant of Godwin's Law): As an online discussion
of gun owners' rights grows longer, the probability of an ad hominem
attack involving penis size approaches 1.
-----------------------------------------------------------------------
65 days since the first successful real return to launch site (SpaceX)