[Nate Tanner] > i had been using version 0.3 of spambayes for a long time (XP/outlook > express) and it was working fairly well. i recently upgraded to 1.0.1, and > now i get a ton of false positives (including the confirmation and welcome > messages from this mailing list !!) probably close to 20% of my valid > emails are being marked as spam. > > does anyone have any ideas about how to fix this problem? it's worse now > than if i had no filter, because i have to comb through every spam looking > for non-spams! please help!
As Tony suggested, retrain from scratch. Some of the stuff in your data really doesn't make sense. For example, > ... > Total emails trained: Spam: 1299 Ham: 3644 ... > header:Subject:1 0.673037 1135 833 > header:From:1 0.675923 1139 847 > header:To:1 0.67644 1139 849 > header:Date:1 0.676889 1138 850 That says, for example, that 3644-1135=2509 of the ham messages you trained on didn't have a Subject line. That's unbelievable -- or you have very weird ham <wink>. Similarly, about 2,500 of your ham messages didn't have a To line, From line, or Date line in the headers. Those are equally incredible. These kinds of header lines should appear in virtually all email, whether ham or spam, and then they're judged as neutral. Instead the presence of a Subject line "looks spammy" to your database, and that's nuts. This is also incredible: > sender:no real name:2**0 0.004644 48 0 That says you've trained on no spam at all where the From line didn't contain a real name -- yet that's very common in spam, and moderately unusual in ham. You even have ubiquitous words like "the" and "and" scoring as spammy! Something is seriously messed up with the training here -- start over. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
