On Mon, 8 Sep 2014, Alex Regan wrote:
Did you understand that the number of previously not seen tokens has
absolutely nothing to do with auto-learning?
Yes, that was a mistake.
Did you understand that all
tokens are learned, regardless whether they have been seen before?
That doesn't really matter from a user perspective, though, right? I mean, if
there are tokens that have already been learned are learned again, the net
result is zero.
Very much not zero. Each token has several values assocated with it:
# ham
# spam
time-stamp
So each time it's learned its respective ham/spam counter is incremented
which indicates how spammy or hammy a given token is and its time-stamp is
updated indicating how "fresh" a token is. The bayes expiry process removes
"stale" tokens when it does its job to prune the database down to size.
Thus learning a token multiple times increases its weight and keeps it
"fresh" so it is kept as an active/relevant piece of info.
--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{