OK, I went to re-train from scratch. I removed hammie.db, message_info_database.db, and statistics_database.db from Documents and Settings/Owner/Application Data/SpamBayes/Proxy. I was going along fine all day, training new messages, then I went to review some additional messages (by right clicking on the tray icon). It pulled up a ton more messages than I was expecting, so I discarded all except 8 of them. Then I went to the home page and it says: Database only has 7 good and 1 spam - you should consider performing additional training.
Apparently the reason it pulled up a ton of messages, was because all of a sudden it decided that it hadn't trained on them already, even though it had. So the question is, what did I do before the error occurred, that might have caused spambayes to suddenly not remember any previous training. The answer -- the only thing i did was to modify the configuration, so it would put the string "spam," in the "To:" and "Subject:" headers. So is modifying the configuration supposed to undo all the prior training? If not, any guesses on why this happened? Thanks for your assistance. ----- Original Message ----- From: "Tim Peters" <[EMAIL PROTECTED]> To: <spam>; "Nate Tanner" <[EMAIL PROTECTED]> Cc: <[email protected]> Sent: Sunday, January 09, 2005 10:35 PM Subject: spam,Re: [Spambayes] tons of false positives after upgrading [Nate Tanner] > i had been using version 0.3 of spambayes for a long time (XP/outlook > express) and it was working fairly well. i recently upgraded to 1.0.1, and > now i get a ton of false positives (including the confirmation and welcome > messages from this mailing list !!) probably close to 20% of my valid > emails are being marked as spam. > > does anyone have any ideas about how to fix this problem? it's worse now > than if i had no filter, because i have to comb through every spam looking > for non-spams! please help! As Tony suggested, retrain from scratch. Some of the stuff in your data really doesn't make sense. For example, > ... > Total emails trained: Spam: 1299 Ham: 3644 ... > header:Subject:1 0.673037 1135 833 > header:From:1 0.675923 1139 847 > header:To:1 0.67644 1139 849 > header:Date:1 0.676889 1138 850 That says, for example, that 3644-1135=2509 of the ham messages you trained on didn't have a Subject line. That's unbelievable -- or you have very weird ham <wink>. Similarly, about 2,500 of your ham messages didn't have a To line, From line, or Date line in the headers. Those are equally incredible. These kinds of header lines should appear in virtually all email, whether ham or spam, and then they're judged as neutral. Instead the presence of a Subject line "looks spammy" to your database, and that's nuts. This is also incredible: > sender:no real name:2**0 0.004644 48 0 That says you've trained on no spam at all where the From line didn't contain a real name -- yet that's very common in spam, and moderately unusual in ham. You even have ubiquitous words like "the" and "and" scoring as spammy! Something is seriously messed up with the training here -- start over. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
