On Wed, September 27, 2006 10:43 am, Matt Kettler said: > Mike Woods wrote: >> Hi guys, bit of a query regarding sa-learn and messages that have >> already been tagged as spam. >> >> We have spamassassin scanning mail via amavisd and sending any caught >> spams to a spam folder in the users accounts (using plus addressing), >> we've also been getting users to drop any missed spams into this spam >> folder so we can train spamassassin on them, at present I have a >> script that moves *only* the missed spams to a master folder for >> sa-learn, my question is simple, would there be any benefit in >> including the mails identified as spam in this process, I know >> sa-learn looks for common patterns in spams to identify them as spam >> but im unsure if adding known spams in would be beneficial in this ? > > YES. There is DEFINITELY a benefit to learning messages tagged as spam. > Even if they got BAYES_99. > > Why? because spam mutates over time, and even if a spam got bayes_99, it > may still have new variants of "hot" words in it that will help it keep > hitting the same kind of spam as it changes. If you wait till this kind > of message mutates enough to no longer be bayes_99, you've put yourself > behind the curve, and now you have to catch up to the new variant.
While I in general agree with this, I was under the impression that spamassassin will auto-learn from messages it marks. (At least, past a certain threshold.) In which case, feeding the spam messages to it again would bias the database towards spam, as the messages are being learned twice. So the question would have to be: Does Spamassassin automatically update the Bayes database from (some/any) messages it flags as spam or ham? Daniel T. Staal --------------------------------------------------------------- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. ---------------------------------------------------------------