On 10/10/05, mgleich <[EMAIL PROTECTED]> wrote: > I've just realized that although my database is 536kb and that is not so > large, it is composed of 702 spam and 110 ham. I gather this is extremely > unbalanced and may explain why I'm getting false negatives.
Actually, 7 to 1 is really not an unusually high imbalance. We've seen reports from people who have 100 to 1 or higher imbalances. If you are getting false positives then imbalance is the most common cause. A few false negatives are not uncommon, though, because spam is constantly changing. If a relatively high percentage of your spam is coming in as false negatives, then you might have an imbalance problem. The best way to tell for sure is to see the spam clues for one of the false negatives, which you can generate from the SpamBayes menu. > Do I need to begin from scratch? If so, do I just delete the db file and > will Spambayes just create a new one? For a 7 to 1 imbalance, I would usually say there is no need to begin from scratch. However, SpamBayes learns quickly so it shouldn't hurt to start over and see what happens. Since you know the size of your DB, you've obviously located the file. You will probably see two files with the *.db extension, one is the training data and the other contains information about the messages that have been processed. Just close Outlook, delete these 2 files, then restart Outlook and SpamBayes should recreate the databases. -- Kenny Pitt _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
