At 06:49 PM 8/25/2005, satalk (sent by Nabble.com) wrote:
MY question is as follows: Given that so many valid tokens from ebay/paypal sites exist in phish emails, am I correct in saying that it is imperative to avoid phish emails entering the bayes database?
I would say it's imperative NOT to avoid training phish mails. To avoid training them is to intentionally poison your database.
Don't ever avoid training a spam because it's got "ham like" content. This includes phish mails, "bayes poison" etc. Train them all. If it is spam, train it as spam. Period.
Remember, your bayes DB can only be as accurate as your training is. If your training isn't realistic, your bayes db won't work well on realistic email.
It's a common misconception that training ham-like spam will poison your bayes db. This problem might exist in very crude bayes implementations, but most bayes implementations, including SA, are largely immune to this.
SA's use of chi-squared combining makes it very resistant to being "poisoned" into creating FPs by training nonspam text inside spam. Most tokens that are seen in both spam and ham are given very little weight by the chi-squared combining.
On the other hand, failing to train those same messages makes SA very weak to having them FN in the future. If a token is only ever seen in ham it's given a very strong weight in the chi-squared combining.