To partially answer my own question: I see that most of the scores are very 
near 0 and 100%, so an even number of hits would yield a score very close to 
50%.  Perhaps it is a case of "over-training" Bayes?  It's on a site-wide 
gateway, and spam and ham tokens probably occur multiple times in the mail 
stream.

If that's the case, how does the Bayes algorithm decide what is a useful token 
and what is a noise word?

Pierre


-----Original Message-----
From: Pierre Thomson 
Sent: Wednesday, May 19, 2004 2:39 PM
To: [EMAIL PROTECTED]
Subject: auto-trained Bayes tends towards 50% ?


Hi all,

Quite a lot of my mail is scoring so near 50% (0.49999 - 0.50001) that it 
doesn't show any Bayes score in the summary.  This is mostly just an 
inconvenience, but I wonder if there is a "pull" towards the middle that makes 
this happen.  I am seeing it in about 2% of my email, much more than I would 
expect statistically.

I run auto-learn with thresholds of -0.1 and 12 for ham and spam, respectively, 
and hand train only what falls on the wrong side of the threshold, mostly 
"sham" and mailing list stuff.  I should add that Bayes is performing 
beautifully on the whole, giving BAYES_99 scores to much of the spam and 
BAYES_00 to much of the ham.

Here's the debug of today's Congressional Quarterly update:

debug: bayes token 'norton' => 0.999707779886148
...
debug: bayes token 'Cabinet' => 0.0489090909090909
debug: bayes: score = 0.5

I'm running 2.63 with a Bayes DB like this:

0.000          0          2          0  non-token data: bayes db version
0.000          0     114942          0  non-token data: nspam
0.000          0      39789          0  non-token data: nham
0.000          0     138554          0  non-token data: ntokens
0.000          0 1084813996          0  non-token data: oldest atime
0.000          0 1084991383          0  non-token data: newest atime
0.000          0 1084990603          0  non-token data: last journal sync atime
0.000          0 1084986837          0  non-token data: last expiry atime
0.000          0     172800          0  non-token data: last expire atime delta
0.000          0      13664          0  non-token data: last expire reduction 
count

Is anyone else noticing this?

Pierre Thomson
BIC

Reply via email to