> My ISP provides a spam filtering service (server side) that labels the > things that they think are spam by putting an extra string in the > subject like (e.g. "--Spam--" at the front). Their filters don't > catch everything so I want to also use SpamBayes to eliminate the spam > that my ISP doesn't label. My question is whether or not I should > train SpamBayes with the spams that get labeled by my ISP. I could > easily see SpamBayes picking up on the "--Spam--" string in the > subject line and filtering just based on that.
It's possible (even likely, assuming that their filter is any good) that "subject:--Spam--" will become a strong spam clue, but hopefully there would be enough ham clues in the (ISP's) false positive that SpamBayes would still be able to make a correct (or perhaps unsure) classification. The only way to know for sure is to give it a go. You can see how strong the "subject:--Spam--" clue is by looking at the clues for a message with such a modified subject. The situation with my work email is similar - I can opt out of their spam filtering, but that means that they will prepend "[SPAM]" to the subject. I ignore their classification and SpamBayes still works fine. However, they have some really terrible false positives, which means that "subject:[SPAM]" isn't as strong a clue as it would be otherwise. > On the other hand maybe that would introduce some selection bias or a > bad spam vs ham ratio for training (e.g. maybe I'll get 50 ham, 40 spam > caught by my ISP, and 10 spam not caught by my ISP (I don't know what > the ratio is yet, I only just started using my ISP's filter)). It's all guesswork at the moment, but you might find that it helps with keeping a ~1::1 ratio. Ham tends to be reasonably homogenous, so you generally need to train on less of it than spam (assuming you're doing some sort of train-on-errors-and-unsures training), so this might help balance that out. > Does anyone have any advice on whether these might interfere or how to > avoid that interference? Should I even be using my ISP's filter along > with SpamBayes or just SpamBayes by itself? If the ISP's filter is reasonably good, then you might as well as it as well; plenty of people like these sort of tiered filter systems. I expect that you'll find that it doesn't interfere at all; the only way to know for sure is to try it out, though. (Maybe after training for a while, you could get someone to send you a ham message with "--Spam--" in the subject, and see if the hammy clues are enough to get it through). Let us know if you find out anything interesting! :) =Tony.Meyer -- Please always include the list (spambayes at python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
