On Wed, September 27, 2006 11:10 am, Jim Maul said:
> Daniel T. Staal wrote:
>> On Wed, September 27, 2006 10:43 am, Matt Kettler said:
>>> Mike Woods wrote:
>>>> Hi guys, bit of a query regarding sa-learn and messages that have
>>>> already been tagged as spam.
>>>>
>>>> We have spamassassin scanning mail via amavisd and sending any caught
>>>> spams to a spam folder in the users accounts (using plus addressing),
>>>> we've also been getting users to drop any missed spams into this spam
>>>> folder so we can train spamassassin on them, at present I have a
>>>> script that moves *only* the missed spams to a master folder for
>>>> sa-learn, my question is simple, would there be any benefit in
>>>> including the mails identified as spam in this process, I know
>>>> sa-learn looks for common patterns in spams to identify them as spam
>>>> but im unsure if adding known spams in would be beneficial in this ?
>>> YES. There is DEFINITELY a benefit to learning messages tagged as spam.
>>> Even if they got BAYES_99.
>>>
>>> Why? because spam mutates over time, and even if a spam got bayes_99,
>>> it
>>> may still have new variants of "hot" words in it that will help it keep
>>> hitting the same kind of spam as it changes. If you wait till this kind
>>> of message mutates enough to no longer be bayes_99, you've put yourself
>>> behind the curve, and now you have to catch up to the new variant.
>>
>> While I in general agree with this, I was under the impression that
>> spamassassin will auto-learn from messages it marks.  (At least, past a
>> certain threshold.)  In which case, feeding the spam messages to it
>> again
>> would bias the database towards spam, as the messages are being learned
>> twice.
>
> I believe that SA will not learn a message it has seen before so
> multiple sa-learn's will not have any affect.

Actually, that was my impression too.

Which means, for the orginal question, that re-learning the already caught
spams will have very little effect other than wasting some processor
cycles.  Doing what he is doing right now is probably best.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Reply via email to