On Wed, 15 Aug 2012, Ben Johnson wrote:

Some 99% of the spam that I receive, which is grossly spammy (we're
talking auto loans, cash advances, dink pills, the whole lot) contains
"BAYES_00=-1.9" in the tests portion of the X-Spam-Status header.

Might anyone know why?

Poor training.

Apart from the Bayes score, what kind of scores are those spams getting?

While I have not trained the Bayesian filter manually to date,

Is there any provision for any manual training in your environment? Have you set up training folders where your users can submit message for training? Do you run sa-learn at all?

how is it that the spammiest of the spam is being classified with BAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply that the message is almost certainly not spam?

BAYES_00 implies that the message in question looks very similar to messages the Bayes system has been told are not spam. It depends solely on how it has been trained.

I wasn't aware that autolearning could do a cold-start of Bayes, can anyone confirm whether this is the case?

If it can't then someone somewhere trained bayes up to the default minimum 200 hams and 200 spams needed for it to start classifying.

Before we offer suggestions, some more data from you please:

What version of SA is this?

What does "sa-learn --dump magic" report about your current Bayes database?

What are all of the bayes_* configuration options in your local config?


What will probably end up happening is this:
(1) wipe your Bayes database
(2) turn off autolearn
(3) collect several hundred hams and spams for an initial training corpus
(4) train using that corpus
(5) evaluate results

Depending on your mail volume, once Bayes is working well after manual training, you may then want to reenable autolearn; I personally suggest it only where the volume is high enough and/or the character of mail is varied enough to prohibit manual training. You might also want to adjust the autolearn thresholds.

You may also want to set up some mechanism for users to submit misclassified messages for training. Depending on how much you trust their judgement the learning from these can be automatic or can go through you as a reviewer.

Recommendation: keep your manual training corpus around in case you need to do the above again for some reason.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Judicial Activism (n): interpreting the Constitution to grant the
  government powers that are popularly felt to be "needed" but that
  are not explicitly provided for therein (common definition);
  interpreting the Constitution as it is written (Brady definition)
-----------------------------------------------------------------------
 Today: the 67th anniversary of the end of World War II

Reply via email to