Re: Very spammy messages yield BAYES_00 (-1.9)

John Hardin Wed, 15 Aug 2012 11:25:14 -0700

On Wed, 15 Aug 2012, Ben Johnson wrote:

Some 99% of the spam that I receive, which is grossly spammy (we're
talking auto loans, cash advances, dink pills, the whole lot) contains
"BAYES_00=-1.9" in the tests portion of the X-Spam-Status header.


Might anyone know why?


Poor training.

Apart from the Bayes score, what kind of scores are those spamsgetting?

While I have not trained the Bayesian filter manually to date,

Is there any provision for any manual training in your environment? Haveyou set up training folders where your users can submit message fortraining? Do you run sa-learn at all?

how is it that the spammiest of the spam is being classified withBAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply thatthe message is almost certainly not spam?

BAYES_00 implies that the message in question looks very similar tomessages the Bayes system has been told are not spam. It depends solely onhow it has been trained.

I wasn't aware that autolearning could do a cold-start of Bayes, cananyone confirm whether this is the case?

If it can't then someone somewhere trained bayes up to the default minimum200 hams and 200 spams needed for it to start classifying.


Before we offer suggestions, some more data from you please:

What version of SA is this?

What does "sa-learn --dump magic" report about your current Bayesdatabase?


What are all of the bayes_* configuration options in your local config?


What will probably end up happening is this:
(1) wipe your Bayes database
(2) turn off autolearn
(3) collect several hundred hams and spams for an initial training corpus
(4) train using that corpus
(5) evaluate results

Depending on your mail volume, once Bayes is working well after manualtraining, you may then want to reenable autolearn; I personally suggest itonly where the volume is high enough and/or the character of mail isvaried enough to prohibit manual training. You might also want to adjustthe autolearn thresholds.

You may also want to set up some mechanism for users to submitmisclassified messages for training. Depending on how much you trust theirjudgement the learning from these can be automatic or can go through youas a reviewer.

Recommendation: keep your manual training corpus around in case you needto do the above again for some reason.


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Judicial Activism (n): interpreting the Constitution to grant the
  government powers that are popularly felt to be "needed" but that
  are not explicitly provided for therein (common definition);
  interpreting the Constitution as it is written (Brady definition)
-----------------------------------------------------------------------
 Today: the 67th anniversary of the end of World War II

Re: Very spammy messages yield BAYES_00 (-1.9)

Reply via email to