David Abrahams wrote on Saturday, February 03, 2007 9:01 PM -0600: > "Seth Goodman" <[EMAIL PROTECTED]> writes: > > > If your training set has much more spam than ham, you can train on > > ham that already scores properly. > > That'll help? Great; it's easy enough.
There is anecdotal evidence that this helps, as well a few systems where it doesn't seem to matter. If Spambayes is not classifying well enough, this is a good thing to try. > > > Whether you choose ham that scores very low already (typical ham) or > > the highest scoring ham (unusual ham) is your preference. > > Are you suggesting that it makes no difference? Not at all ... only that no one can tell you for sure which is better for your own mail flow. My preference for adding ham to a training set is to pick the highest scoring ham and train on a few at a time, rescoring the ham folder after training each new group. There are a lot of different approaches, and there hasn't been a clear winner that works better on everyone's mail flow. > > > If you use the Outlook plugin, > > No offense to all the Outlook users out there, but I avoid it like the > plague. I'm using sb_imapfilter and doing the filtering server-side. No offense taken. This is a public mailing list for a spam filtering program with a specific version for Outlook. How to use it with Outlook is of interest to a lot of readers. > > > just move the ham you want to train on to the unsure folder and tell > > Spambayes it's not spam. How much trained ham/spam imbalance is too > > much is also up for debate. Some people have reported good results > > with 5:1 and even 10:1 imbalance, while others do poorly under those > > conditions. > > Sounds pretty indefinite. What's poorly mean? It's deliberately indefinite, as results are variable. I can tell you that my setup has been operating at around 5% unsures, 0.5% false negatives (spam in the inbox) and perhaps one false positive (ham in the spam folder) per year for a long time. This seems to be typical, though 0.1% false positives might be more common. My current training set has around 250 ham and 500 spam. What kind of performance do you see? > > > I try to avoid mine going further than 2:1 and train on > > my highest scoring ham to fix it. This seems to work better for me > > than training only on unsures. > > I don't get nearly enough unsures that are ham to correct the > imbalance that way. The strategy you imply is train on all unsures, which happens to be the method the Outlook plugin is based on. This is because it is easy to understand and generally works well. One problem is that over time, train on unsures tends to result in a training set that has a lot more spam than ham, and this sometimes causes the classifier to function poorly (more weasel words). If that is your problem, you need to train on additional ham that already classifies correctly. The only way you can tell if that's your problem is to train on more ham and see if that helps. > > Please let us know what you try, what helps and what doesn't. > > I will, but aren't you afraid there are just too many levers to pull, > what with all the configuration options and legit approaches to > training? Seems like it would be hard to learn much from user > feedback. There are quite a few variables, and I appreciate your willingness to report back. The developers do read this list, and your results will be noted. As far as what is learned from whom, there has been a lot of careful testing by a lot of people using a purpose built testing system, but it's good to continue to do reality checks. If what you report reinforces the current view, that's good news. If there are persistent reports where it doesn't agree, then there is something to look at. So yes, end user feedback is very helpful. -- Seth Goodman _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
