> Even training only on mistakes and unsures, I have had a steadily > increasing ratio for months. I almost never see a misclassified > ham and only very rarely a ham about which the system is unsure. > It's unsure about spam every day.
I run several servers, and hundreds of domains, so I get quite a bit of email - even if much of it is just for archival purposes (logs mostly). I keep MOST of my ham. I no longer keep spam beyond about 2 months or so, but in that period I can easily collect some 100,000 spam that would otherwise totally dwarf the amount of ham I receive. I get at least 2k messages per day, sometimes as much as 5k. I never exactly plan to rebuild the database, but always do when I make a big mistake. While I never really had a problem with the effectiveness of SpamBayes before, a couple times I've clicked the wrong button in the 'unsure' folder when I had fifteen+ spam or ham selected, which can quite effectively destroy the database. So I purge it and retrain on my current archive of spam and a couple known good folders under the inbox that I have stored a few thousand messages. Having that archive of known good messages makes all the difference in the world. I now have a database of about 90k/80k (the db is about 330mb) and only receive about 25 unclassified messages per day on average, which consists of about 20% either gobbledygook or legit messages with no content except for their attachments or a blank subject - the other 80% are 'trainable' spam. I train on all ham and only those spam messages that look like they'll make a difference to the validity of future checks. If it's a gobbledygook spam message, I usually just delete it directly from unsure. I still use 75%/15% as the spam cutoffs. While I could probably avoid looking at subject lines for approximately 50-60% of the spam that goes to unsure by lowering the spam cutoff to 60%, it takes only a few extra seconds to look through those other subjects or senders once per day to correct their status. I'd rather not risk losing an important message from a client that is forwarding a spam message they received directly to the spam folder. Once it's in there I don't even bother looking at it but once per month when I use the library of spam I've collected to fine-tune my server-side filters. Legitimate forum and group messages can often be flagged higher than 10%, so I don't want to lower my ham threshold. If anything, it could use to go up to 20% or so. The no-subject or attachment-only ham are almost always high teens or low twenty scores, but if I adjust the ham setting I'll get a bit more of the gobbledygook to my inbox, too. Anyway... Just thought a bit more anecdotal evidence might be interesting to some. ;) Regards, Shawn K. Hall http://12PointDesign.com/ '// ======================================================== "You have to change the map, not the world." -- Marcus Kaarto _______________________________________________ SpamBayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html