On Mon, 8 Sep 2014, Alex Regan wrote:

Did you understand that the number of previously not seen tokens has
absolutely nothing to do with auto-learning?

Yes, that was a mistake.

Did you understand that all
tokens are learned, regardless whether they have been seen before?

That doesn't really matter from a user perspective, though, right? I mean, if there are tokens that have already been learned are learned again, the net result is zero.

Very much not zero. Each token has several values assocated with it:
 # ham
 # spam
 time-stamp

So each time it's learned its respective ham/spam counter is incremented
which indicates how spammy or hammy a given token is and its time-stamp is
updated indicating how "fresh" a token is. The bayes expiry process removes
"stale" tokens when it does its job to prune the database down to size.

Thus learning a token multiple times increases its weight and keeps it
"fresh" so it is kept as an active/relevant piece of info.

--
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Reply via email to