I see. That's a very good point, about sharing the Bayes within a
different community.
Anyone see a problem with a single-user collecting spam (and ham) from
various personal mailboxes that came in from different internet service
providers and doing a sa-learn on it?
David Roth
rothmail (at) comcast (dot) net
On Nov 18, 2005, at 8:45 AM, Anthony Peacock wrote:
Hi,
I actually think it is more to do with the fact that one person's
spam could be another person's ham. If the mail streams and servers
are carrying messages for a community of users who receive (and want
to receive) similar types of email messages, I can't see any major
problem with using those emails to train Bayes. However, if the
servers are processing email for two completely different user
communities their ideas of what is and isn't spam could be so
different that the Bayes stats become diluted.
For instance I work for a Medical School, but in a heavily IT based
department. Some terms that may be considered pornographic for
someone working in banking could be perfectly acceptable in my
environment.
If I am understanding this correctly...the concern is that the
Bayes
should match the mail server in which the ham and spam was received on
only?
David Roth
rothmail (at) comcast.net (dot) net
On Nov 18, 2005, at 5:10 AM, qMax wrote:
in wiki://BayesInSpamAssassin it is said:
Do not train Bayes on different mail streams or public spam corpora.
These method will mislead Bayes into believing certain tokens are
spammy or hammy when they are not.
Could you explain why it is so, and what could happen if to teach
nayes from several mail servers ?
--
qMax
--
Anthony Peacock
CHIME, Royal Free & University College Medical School
WWW: http://www.chime.ucl.ac.uk/~rmhiajp/
"Computer software consists of only two components:
ones and zeros, in roughly equal proportions. All that is
required is to sort them into the correct order."