Mike Woods wrote:
> Hi guys, bit of a query regarding sa-learn and messages that have
> already been tagged as spam.
>
> We have spamassassin scanning mail via amavisd and sending any caught
> spams to a spam folder in the users accounts (using plus addressing),
> we've also been getting users to drop any missed spams into this spam
> folder so we can train spamassassin on them, at present I have a
> script that moves *only* the missed spams to a master folder for
> sa-learn, my question is simple, would there be any benefit in
> including the mails identified as spam in this process, I know
> sa-learn looks for common patterns in spams to identify them as spam
> but im unsure if adding known spams in would be beneficial in this ?

YES. There is DEFINITELY a benefit to learning messages tagged as spam.
Even if they got BAYES_99.

Why? because spam mutates over time, and even if a spam got bayes_99, it
may still have new variants of "hot" words in it that will help it keep
hitting the same kind of spam as it changes. If you wait till this kind
of message mutates enough to no longer be bayes_99, you've put yourself
behind the curve, and now you have to catch up to the new variant.

In general: DO NOT intentionally try to bias the training of your bayes
database for any reason. That's just self-inflicted bayes poison. If
it's spam, train it as spam. Do not hold back because of "ham-like"
content. Do not hold back because it was already tagged. If it's spam,
train it as such. The same goes for nonspam training. Don't hold back
training any emails that you don't want to be tagged, even if they
contain "spam words".

SpamAssassin's bayes system will handle the gray cases just fine. It
does particularly well at this because of the chi-squared combining, as
compared to the results of simple averaging.

Reply via email to