On Mon, 10 Jan 2005 15:10:08 +1300, "Tony Meyer" <[EMAIL PROTECTED]> wrote:
>> I was experimenting with training last night (Outlook plugin >> v1.01) and noticed something odd: sometimes, when I trained a >> single message as spam and then ran a filtering pass over the >> whole spam collection, a few other spams ended up with a >> lower score than before. >> >> Maybe I'm completely misunderstanding how the classifier >> works, but shouldn't train-as-spam increase spam token counts >> and thus make all messages containing those tokens appear >> more spammy, and have no effect on those that don't? > >Part of the calculation of a token's probability is: > >(spamcount and hamcount are for the token. nham and nspam are totals) > > hamratio = hamcount / nham > spamratio = spamcount / nspam > prob = spamratio / (hamratio + spamratio) > [the bayesian adjustment follows, but isn't important here] Ah, of course. I was looking at the token counts and not thinking about the total message counts. Stupid weekend brain. :) I was doing a kind of manual "train to exhaustion", and the other thing I noticed was that the spam took a lot more training to make classification accurate (currently 82 ham : 409 spam, out of a total training set of 644 : 1414). I guess this simply means that my spam is a lot less consistent than my ham. BTW, I also found a trick in Outlook to be able to train on a given spam more than once, to force correct classification. Normally this doesn't work because the plugin sees the two messages as identical, but creating the copy in an IMAP folder seems to fool it. -- Mat. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
