Re: Site-wide training with same message, different recipients and different classification

Robert Menschel 28 Apr 2005 02:53:29 -0000

Hello ks,

Wednesday, April 27, 2005, 3:50:04 AM, you wrote:


> we're evaluating SpamAssassin 3.02 on a mail gateway on Linux. The
> mailboxes are not on this gateway but on a Lotus Notes Server to
> where the mail is forwarded. Training is done via copying mails into
> a different Mailfolder, which is emptied via POP3 using fetchmail
> from the SA gateway. Headers are modified via procmail, then.    

> Overall we're happy with SA, but still some questions arise from
> time to time. 

> So, the current question is:
> What happens, if  two users receive the same mail, but both are
> classifying the mail different (one as ham and the other one as
> spam) and feed it back into SA to learn it ?

> Remember that learning is only done as the user under which's
> privilegies SA runs, so it's not user specific.

So if I understand you correctly, there's a generic userid which is
used for both scoring and for learning, which has nothing to do with
the users who receive that email.

> In what direction will the score for the next mail from that sender
> be be pushed ? up or down ? Spam or Ham ?

Is the mail identical down to the message id?

If so, since to the best of my knowledge Bayes tracks messages by
message id, then the last learning "wins".

If both users put their ham or spam into the learning queue at about
the same time, and the system just happens to learn the spam queue
first, the message will be learned as spam, and when the system then
learns the ham queue, the message will be unlearned as spam and
learned as ham.

However, the impact will probably be small -- Bayes is statistical,
and while the From header has some weight, it's only one token, of
which several/many are used to determine whether an email is ham or
spam.

An almost identical message, same From, same path, same/similar
subject, same/similar To, same/similar body content, would tend to be
pushed in that direction (ham in my example), but a reasonably
different message, same From, same path, same/similar To, mostly
different subject, significantly different body, might go in either
direction.

I track spam/ham on a system with hundreds of domains and all of their
users, with one central Bayes database, and I've not seen any problems
caused by this type of sequence learning.

Bob Menschel

Re: Site-wide training with same message, different recipients and different classification

Reply via email to