> I think the problem is more that Spambayes doesn't do anything to > encourage sensible training schemes.
I don't agree here. The Outlook plug-in encourages train-on-error, because the simplest training is clicking the 'Spam' or 'Not Spam' buttons for mistakes (or dragging the messages to their proper place). Train-on-error (fpfnunsure) seems to be one of the best regimes based on the testing done so far. (The plug-in wizard probably encourages people too strongly to do initial training, which should be changed, I think). sb_server was recently changed to encourage train-on-error (fpfnunsure) as well (this will make it into 1.1a2 if I ever find time to do a release, or if someone else does one). The default action for ham and spam is 'discard', and unsure 'defer', encouraging people to only train unsures (and presumably fp and fn as corrections). > It wouldn't be responsible for the > developers to force one scheme or another on the users, since there is > no proof that any one particular scheme would work for the majority of > users. I think that the testing that has been done certainly indicates that fpfnunsure, nonedge, and tte are all superior to train-on-everything in almost any situation. (My TREC tests are the main contra-example I can think of, but they are clouded by the lack of the unsure range). I think that the developers should set things up so that the simplest regime for users is one that is most likely to give results, while allowing users to use something else if they like. I think sb_server does this fairly well, since it's easy to change the default actions so that you get train-on-everything with the least amount of work, or nonedge with the least amount of work. > For example, a lot of spam has "word salad" added as hidden text to > confuse Bayesian filters like Spambayes. [...] Random 'word salad' has most often been shown to help statistical filters like SpamBayes, not harm it. People tend to use a fairly small vocabulary (compared to the entire language vocabulary) in their email (this is especially true if work and personal email is segregated). As such, randomly selecting a word is more likely to result in a word outside of the user's typical email vocabulary than one inside. This means it'll either not have been seen before (and be ignored), or have been seen in spam (particularly other 'word salad' spam) and actually increase the message score. More clever spam, that include less random noise (e.g. newspaper clippings) are more of an issue. > That's only if you define training on every unsure as using Spambayes > correctly. I disagree on that particular point, though the operating > instructions don't say this. Once Spambayes is operating well, you > should probably not train on all the spam in the Unsure folder. It is hard to try and explain this art to the average Outlook user, however. (Suggestions are welcome ;) > Finally, unless > Spambayes implements some form of pruning old messages from the > database, [...] Note that if pruning is done, it's not clear that age should be the deciding factor. Then what happens to that once-a-year-ham? =Tony.Meyer -- Please always include the list (spambayes at python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
