Hi folks -- in case you haven't yet seen this:
'This paper shows how an adversary can exploit statistical machine learning, as used in the SpamBayes spam filter, to render it useless--even if the adversary's access is limited to only 1% of the training messages. We further demonstrate a new class of focused attacks that successfully prevent victims from receiving specific email messages. Finally, we introduce two new types of defenses against these attacks.' http://www.usenix.org/event/leet08/tech/full_papers/nelson/nelson_html/ Basically, measuring the effects of loading spams with huge dictionaries in order to increase false positive frequencies, once the mail has been trained on. Would be interested to hear what people think -- personally: - 1. this is very similar to http://www.cs.dal.ca/research/techreports/2004/CS-2004-06.shtml , and I haven't seen spammers using those attacks in the intervening 4 years. - 2. I wonder how big the messages have to be, in order to affect training in a relatively small number of messages. Maybe limiting the number of tokens trained on per message, might help. It might be worthwhile implementing the described "RONI" scheme to avoid the less targeted form of the issue anyway. --j. _______________________________________________ SpamBayes@python.org http://mail.python.org/mailman/listinfo/spambayes Info/Unsubscribe: http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html