On Wed, September 27, 2006 11:10 am, Jim Maul said: > Daniel T. Staal wrote: >> On Wed, September 27, 2006 10:43 am, Matt Kettler said: >>> Mike Woods wrote: >>>> Hi guys, bit of a query regarding sa-learn and messages that have >>>> already been tagged as spam. >>>> >>>> We have spamassassin scanning mail via amavisd and sending any caught >>>> spams to a spam folder in the users accounts (using plus addressing), >>>> we've also been getting users to drop any missed spams into this spam >>>> folder so we can train spamassassin on them, at present I have a >>>> script that moves *only* the missed spams to a master folder for >>>> sa-learn, my question is simple, would there be any benefit in >>>> including the mails identified as spam in this process, I know >>>> sa-learn looks for common patterns in spams to identify them as spam >>>> but im unsure if adding known spams in would be beneficial in this ? >>> YES. There is DEFINITELY a benefit to learning messages tagged as spam. >>> Even if they got BAYES_99. >>> >>> Why? because spam mutates over time, and even if a spam got bayes_99, >>> it >>> may still have new variants of "hot" words in it that will help it keep >>> hitting the same kind of spam as it changes. If you wait till this kind >>> of message mutates enough to no longer be bayes_99, you've put yourself >>> behind the curve, and now you have to catch up to the new variant. >> >> While I in general agree with this, I was under the impression that >> spamassassin will auto-learn from messages it marks. (At least, past a >> certain threshold.) In which case, feeding the spam messages to it >> again >> would bias the database towards spam, as the messages are being learned >> twice. > > I believe that SA will not learn a message it has seen before so > multiple sa-learn's will not have any affect.
Actually, that was my impression too. Which means, for the orginal question, that re-learning the already caught spams will have very little effect other than wasting some processor cycles. Doing what he is doing right now is probably best. Daniel T. Staal --------------------------------------------------------------- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. ---------------------------------------------------------------