> -----Original Message-----
> From: Tony Earnshaw [mailto:[EMAIL PROTECTED]
> Sent: Monday, June 16, 2003 7:58 AM
> Subject: Re: [SAtalk] Removing headers etc.. to feed Bayes correctly
>...people on the list were 
> saying about 
> murdering the Bayes database before it had even reached 
> maturity made me 
> feel like Gerhard Schröder.


To my mind, it's not murdering, or anything remotely approaching it.  The suggestion 
to let sa-learn do the initial ham and spam seeding is simply not optimal.  
Autolearning above (or below) a threshold established by SpamAssassin is an 
ill-conceived method of establishing an initial Bayes token base.  Pre-selecting a 
corpus through spamassassin directly contradicts the entire basis upon which Bayesian 
theory relies for a token database:  the assumption that there are "interesting 
tokens" that normal heuristics are missing.  A Bayes database doesn't reach maturity 
by having a certain number of SA-filtered spams >15 and SA-filtered hams <-2; it 
reaches maturity by having a certain number of confirmed hams and spams, period.  
Therefore, if one organization obtains initial Bayes seeding strictly through 
auto-learning for three weeks and get 2000 hams and 2000 spams in it, and another does 
theirs in 15 minutes by manually teaching it 2000 hams from this week, and 2000 spams 
from this week (that SpamAssassin has never touched), the LATTER would be the much, 
much more accurate Bayesian seeding procedure.

This is discussed in-depth in Paul Graham's writing on the topic, specifically the 
part where he mentions that tokens like "per" and "FL" and "ff0000" are actually very 
reliable indicators of spammishness.


This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
Spamassassin-talk mailing list

Reply via email to