Hello ks, Wednesday, April 27, 2005, 3:50:04 AM, you wrote:
> we're evaluating SpamAssassin 3.02 on a mail gateway on Linux. The > mailboxes are not on this gateway but on a Lotus Notes Server to > where the mail is forwarded. Training is done via copying mails into > a different Mailfolder, which is emptied via POP3 using fetchmail > from the SA gateway. Headers are modified via procmail, then. > Overall we're happy with SA, but still some questions arise from > time to time. > So, the current question is: > What happens, if two users receive the same mail, but both are > classifying the mail different (one as ham and the other one as > spam) and feed it back into SA to learn it ? > Remember that learning is only done as the user under which's > privilegies SA runs, so it's not user specific. So if I understand you correctly, there's a generic userid which is used for both scoring and for learning, which has nothing to do with the users who receive that email. > In what direction will the score for the next mail from that sender > be be pushed ? up or down ? Spam or Ham ? Is the mail identical down to the message id? If so, since to the best of my knowledge Bayes tracks messages by message id, then the last learning "wins". If both users put their ham or spam into the learning queue at about the same time, and the system just happens to learn the spam queue first, the message will be learned as spam, and when the system then learns the ham queue, the message will be unlearned as spam and learned as ham. However, the impact will probably be small -- Bayes is statistical, and while the From header has some weight, it's only one token, of which several/many are used to determine whether an email is ham or spam. An almost identical message, same From, same path, same/similar subject, same/similar To, same/similar body content, would tend to be pushed in that direction (ham in my example), but a reasonably different message, same From, same path, same/similar To, mostly different subject, significantly different body, might go in either direction. I track spam/ham on a system with hundreds of domains and all of their users, with one central Bayes database, and I've not seen any problems caused by this type of sequence learning. Bob Menschel
