Hi,

I actually think it is more to do with the fact that one person's 
spam could be another person's ham.  If the mail streams and servers 
are carrying messages for a community of users who receive (and want 
to receive) similar types of email messages, I can't see any major 
problem with using those emails to train Bayes.  However, if the 
servers are processing email for two completely different user 
communities their ideas of what is and isn't spam could be so 
different that the Bayes stats become diluted.

For instance I work for a Medical School, but in a heavily IT based 
department.  Some terms that may be considered pornographic for 
someone working in banking could be  perfectly acceptable in my 
environment.



> If I am understanding this correctly...the concern is that the 
Bayes
> should match the mail server in which the ham and spam was received on
> only?
> 
> David Roth
> rothmail (at) comcast.net (dot) net
> 
> On Nov 18, 2005, at 5:10 AM, qMax wrote:
> 
> > in wiki://BayesInSpamAssassin it is said:
> > Do not train Bayes on different mail streams or public spam corpora.
> > These method will mislead Bayes into believing certain tokens are
> > spammy or hammy when they are not.
> >
> > Could you explain why it is so, and what could happen if to teach
> > nayes from several mail servers ?
> >
> > -- 
> >  qMax
> >
> 


-- 
Anthony Peacock       
CHIME, Royal Free & University College Medical School
WWW:    http://www.chime.ucl.ac.uk/~rmhiajp/
"Computer  software  consists of  only  two  components: 
ones and zeros, in roughly equal proportions.   All that is
required is to sort them into the correct order."


Reply via email to