-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Mark A. DeMichele writes: > This is exactly what I was talking about in a previous post. > > If spammers start doing this and they make sure this bogus section is > larger than the actual spam section, the bayes filter will probably mark > it as ham. What's worst is if you then force the bayes filter to learn > this as spam, now you just increased the spam score for each of these > good words. If this happens over and over again, I would imagine that > the bayes filter would malfunction. At least that's my opinion, but > feel free to disagree. Good, will do. The idea of bayes is that you train it on 1. *YOUR* ham 2. *YOUR* spam Unless spammers figure out what *YOU* call ham, they can add random words, bits of Russian literature, snippets of Tom Sawyer until the cows come home. For spammers to effectively "poison" bayes, they need to figure out what kind of text *YOU* have trained on. If I don't receive copies of Tom Sawyer by email normally, then sure, 19th-century US lit will become a spam sign. But I don't care because *I DON'T* receive copies of Tom Sawyer by email, normally. (I reserve that honour for snailmail, or occasionally by FTP.) So it's not going to wind up misclassifying anything as a result. In the worst case, they'll find one or two strong ham-sign words -- like 'Kits', or 'entries' (for my corpus). Worst case? I retrain on their mail, and those tokens become about even ham and spam counts, 0.5 probability, and are *ignored* by the Bayes calculation in future. *PLEASE* read up on how Bayes works. READ John Graham-Cumming's presentation from the last Spam Conf, and NOTICE how it took him thousands of iterations of bayes-poisoning, sending a mail each time with a direct feedback loop, to get a single spam through. The sky is NOT falling, guys! - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) Comment: Exmh CVS iD8DBQFALWYuQTcbUG5Y7woRAuzOAKCRRNOx7r2SD/PpyKRAIcthNsC9JgCg7drd 468mo+BQ7BGH/Ix5OfEXg/E= =w6ao -----END PGP SIGNATURE-----
