Re: phish/bayes

Matt Kettler Fri, 26 Aug 2005 11:02:09 -0700

At 06:49 PM 8/25/2005, satalk (sent by Nabble.com) wrote:

MY question is as follows:
Given that so many valid tokens from ebay/paypal sites
exist in phish emails, am I correct in saying that it is
imperative to avoid phish emails entering the bayes database?

I would say it's imperative NOT to avoid training phish mails. To avoidtraining them is to intentionally poison your database.

Don't ever avoid training a spam because it's got "ham like" content. Thisincludes phish mails, "bayes poison" etc. Train them all. If it is spam,train it as spam. Period.

Remember, your bayes DB can only be as accurate as your training is. Ifyour training isn't realistic, your bayes db won't work well on realisticemail.

It's a common misconception that training ham-like spam will poison yourbayes db. This problem might exist in very crude bayes implementations, butmost bayes implementations, including SA, are largely immune to this.

SA's use of chi-squared combining makes it very resistant to being"poisoned" into creating FPs by training nonspam text inside spam. Mosttokens that are seen in both spam and ham are given very little weight bythe chi-squared combining.

On the other hand, failing to train those same messages makes SA very weakto having them FN in the future. If a token is only ever seen in ham it'sgiven a very strong weight in the chi-squared combining.

Re: phish/bayes

Reply via email to