On Wed, 15 Aug 2012, Ben Johnson wrote:

On 8/15/2012 2:24 PM, John Hardin wrote:
On Wed, 15 Aug 2012, Ben Johnson wrote:

Some 99% of the spam that I receive, which is grossly spammy (we're
talking auto loans, cash advances, dink pills, the whole lot) contains
"BAYES_00=-1.9" in the tests portion of the X-Spam-Status header.

Might anyone know why?

Poor training.

John, I can't thank you enough for the thoroughness of your response.

I like to show off. :)

Apart from the Bayes score, what kind of scores are those spams getting?

Here are a few examples (the first two of which are two of VERY few in
which the BAYES_* value is over 00):

-----------------
No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no

No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793,
SPF_PASS=-0.001] autolearn=no

No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no

No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7,
URIBL_RHS_DOB=1.514] autolearn=no
-----------------

It might be interesting to see some log entries where autolearn=yes...

It bears mention that the RCVD_IN_DNSWL_MED test is having even more of a negative impact (pardon the pun) than BAYES_*. I am already working with the dnswl.org folks (off-list, for privacy reasons) to get to the bottom of that issue.

This might be a major contributing factor. If your system was taught from scratch by autolearn, and DNSWL (which is fairly well trusted) has been pushing a lot of spams to low scores...

You might want to set:
        bayes_auto_learn_threshold_nonspam -3

That won't _fix_ the problem (at least not quickly) or avoid the need to wipe and retrain, but it might keep things from getting worse.

See perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold for more info.

Most of the list is probably laughing, but given the complexity of Spam
Assassin, this crucial requirement was lost on me, amidst the sea of
information and instructions. For example, there is no mention of the
fact that SA is essentially useless without Bayesian training on
http://wiki.apache.org/spamassassin/StartUsing .

That's because that shouldn't be the case. The base ruleset + URIBL should be very effective pretty much out-of-the-box.

What version of SA is this?

# spamassassin --version
SpamAssassin version 3.3.1
 running on Perl version 5.10.1

A little stale, but not bad.

You may also want to set up some mechanism for users to submit
misclassified messages for training. Depending on how much you trust
their judgement the learning from these can be automatic or can go
through you as a reviewer.

That sounds like a good idea. Is there a particular HOW TO or tutorial
that you recommend? If it depends on the environment/configuration, this
server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin.

I'm not sure, I don't lurk the Wiki much. About the best I can suggest is search the SA users mailing list archives for "training dovecot".

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  The ["assault weapons"] ban is the moral equivalent of banning red
  cars because they look too fast.  -- Steve Chapman, Chicago Tribune
-----------------------------------------------------------------------
 Today: the 67th anniversary of the end of World War II

Reply via email to