To partially answer my own question: I see that most of the scores are very near 0 and 100%, so an even number of hits would yield a score very close to 50%. Perhaps it is a case of "over-training" Bayes? It's on a site-wide gateway, and spam and ham tokens probably occur multiple times in the mail stream.
If that's the case, how does the Bayes algorithm decide what is a useful token and what is a noise word? Pierre -----Original Message----- From: Pierre Thomson Sent: Wednesday, May 19, 2004 2:39 PM To: [EMAIL PROTECTED] Subject: auto-trained Bayes tends towards 50% ? Hi all, Quite a lot of my mail is scoring so near 50% (0.49999 - 0.50001) that it doesn't show any Bayes score in the summary. This is mostly just an inconvenience, but I wonder if there is a "pull" towards the middle that makes this happen. I am seeing it in about 2% of my email, much more than I would expect statistically. I run auto-learn with thresholds of -0.1 and 12 for ham and spam, respectively, and hand train only what falls on the wrong side of the threshold, mostly "sham" and mailing list stuff. I should add that Bayes is performing beautifully on the whole, giving BAYES_99 scores to much of the spam and BAYES_00 to much of the ham. Here's the debug of today's Congressional Quarterly update: debug: bayes token 'norton' => 0.999707779886148 ... debug: bayes token 'Cabinet' => 0.0489090909090909 debug: bayes: score = 0.5 I'm running 2.63 with a Bayes DB like this: 0.000 0 2 0 non-token data: bayes db version 0.000 0 114942 0 non-token data: nspam 0.000 0 39789 0 non-token data: nham 0.000 0 138554 0 non-token data: ntokens 0.000 0 1084813996 0 non-token data: oldest atime 0.000 0 1084991383 0 non-token data: newest atime 0.000 0 1084990603 0 non-token data: last journal sync atime 0.000 0 1084986837 0 non-token data: last expiry atime 0.000 0 172800 0 non-token data: last expire atime delta 0.000 0 13664 0 non-token data: last expire reduction count Is anyone else noticing this? Pierre Thomson BIC
