On Wed, 15 Aug 2012, Ben Johnson wrote:
Some 99% of the spam that I receive, which is grossly spammy (we're
talking auto loans, cash advances, dink pills, the whole lot) contains
"BAYES_00=-1.9" in the tests portion of the X-Spam-Status header.
Might anyone know why?
Poor training.
Apart from the Bayes score, what kind of scores are those spams
getting?
While I have not trained the Bayesian filter manually to date,
Is there any provision for any manual training in your environment? Have
you set up training folders where your users can submit message for
training? Do you run sa-learn at all?
how is it that the spammiest of the spam is being classified with
BAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply that
the message is almost certainly not spam?
BAYES_00 implies that the message in question looks very similar to
messages the Bayes system has been told are not spam. It depends solely on
how it has been trained.
I wasn't aware that autolearning could do a cold-start of Bayes, can
anyone confirm whether this is the case?
If it can't then someone somewhere trained bayes up to the default minimum
200 hams and 200 spams needed for it to start classifying.
Before we offer suggestions, some more data from you please:
What version of SA is this?
What does "sa-learn --dump magic" report about your current Bayes
database?
What are all of the bayes_* configuration options in your local config?
What will probably end up happening is this:
(1) wipe your Bayes database
(2) turn off autolearn
(3) collect several hundred hams and spams for an initial training corpus
(4) train using that corpus
(5) evaluate results
Depending on your mail volume, once Bayes is working well after manual
training, you may then want to reenable autolearn; I personally suggest it
only where the volume is high enough and/or the character of mail is
varied enough to prohibit manual training. You might also want to adjust
the autolearn thresholds.
You may also want to set up some mechanism for users to submit
misclassified messages for training. Depending on how much you trust their
judgement the learning from these can be automatic or can go through you
as a reviewer.
Recommendation: keep your manual training corpus around in case you need
to do the above again for some reason.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Judicial Activism (n): interpreting the Constitution to grant the
government powers that are popularly felt to be "needed" but that
are not explicitly provided for therein (common definition);
interpreting the Constitution as it is written (Brady definition)
-----------------------------------------------------------------------
Today: the 67th anniversary of the end of World War II