I'm sure many of you are aware that spamming of the submission forms on blogs and other websites is a large and increasing problem. The Mojam and Musi-Cal concert websites suffered from the same malady. I originally considered implementing some sort of CAPTCHA scheme:
http://en.wikipedia.org/wiki/Captcha but that has limitations and would have required changes to all submission forms on the websites. I decided to instead implement a SpamBayes-based solution in our XML-RPC server. It has a couple distinct advantages: * It has none of the CAPTCHA gotchas. * It is implemented at a single point in the system. * No changes to the Web interface were required so users don't have to learn something new. I'll give you a quick sketch of what I did to solve this problem. If you'd like more details, drop me a note. When someone submits concert dates to our sites the submission is represented as a simple dictionary. A valid submission will have information about who's performing, a date in the future, valid location information, etc. In contrast, when someone spams the submission forms the dictionary often contains bogus information or is missing some fields altogether. For example, if the spammer puts something in the date fields it's likely to be garbage which won't parse properly, resulting in a default date of 1900-01-01. Similarly, the city/state/country is likely to be invalid, so we won't be able to find lat/long info. The dictionary is preprocessed into a string of tokens which includes the obvious text which was part of the submission, but which also contains synthetic tokens. Here's a spammer's entry represented as text: Bradyn Maximus Ty [EMAIL PROTECTED] 1900-01-01 Jerald kwds:False kwds-private:False Malcom 1900-01-01 Jarod date:ancient perflen:1 infolen:1 hasphone:False hasprice:False city:unknown venue:present Here's a valid entry represented as text: Anchorage [EMAIL PROTECTED] 2006-10-07 kwds:True kwds-private:True .bl.1348 .ra LaVette,Bettye 2006-10-07 AK Discovery Theatre date:current perflen:1 infolen:0 hasphone:False hasprice:False city:known venue:present The synthetic tokens that suggest problems are such huge red flags for the classifier that after training on just a couple of these bad boys the rejection rate of spam submissions seems to be 100%. Of course, this sort of spamming is probably still in its infancy, so I expect we might eventually see some sort of arms race develop as has been true for email spam. I'm not too worried about that though because for the most part I think the spammers' primary target is the blogosphere with its ubiquitous comment feature, not specialized websites like ours. The tokenizer class is quite simple. I post it here in its entirety. Note that major bits of it were just pasted from the default tokenizer. from spambayes.tokenizer import log2, Tokenizer, numeric_entity_re, \ numeric_entity_replacer, crack_urls, breaking_entity_re, html_re, \ tokenize_word class Tokenizer(Tokenizer): def tokenize(self, text): maxword = 20 # Replace numeric character entities (like a for the letter # 'a'). text = numeric_entity_re.sub(numeric_entity_replacer, text) # Normalize case. text = text.lower() # Get rid of uuencoded sections, embedded URLs, <style gimmicks, # and HTML comments. for cracker in (crack_urls,): text, tokens = cracker(text) for t in tokens: yield t # Remove HTML/XML tags. Also . <br> and <p> tags should # create a space too. text = breaking_entity_re.sub(' ', text) # It's important to eliminate HTML tags rather than, e.g., # replace them with a blank (as this code used to do), else # simple tricks like # Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion # can be used to disguise words. <br> and <p> were special- # cased just above (because browsers break text on those, # they can't be used to hide words effectively). text = html_re.sub('', text) # Tokenize everything in the body. for w in text.split(): n = len(w) # Make sure this range matches in tokenize_word(). if 3 <= n <= maxword: yield w elif n >= 3: for t in tokenize_word(w): yield t The only thing I found a bit frustrating was that I had to override the Hammie class to provide an alternate tokenizer class: class Hammie(hammie.Hammie): def __init__(self, bayes): hammie.Hammie.__init__(self, bayes) self.tokenizer = Tokenizer() def _scoremsg(self, msg, evidence=False): return self.bayes.spamprob(self.tokenizer.tokenize(msg), evidence) def train(self, msg, is_spam, add_header=False): self.bayes.learn(self.tokenizer.tokenize(msg), is_spam) I think it would be a bit more general if the Hammie class accepted an optional tokenizer to avoid this. So far I've trained on 338 hams and 26 spams. (I have a new guy I'm breaking in who is not experienced with SpamBayes. I see a number of entries in the ham data I would not have added there. I expect I can probably reduce the ham data size to 200 or less.) The BerkDB file containing the token database is a whopping 86KBytes. I somewhat arbitrarily set the ham cutoff at 0.15 and the spam cutoff at 0.60. The true spam seems to so far all land in the 0.95-1.0 range. I see some "possible spam" in the 0.16-0.2 range. Most of the time that's because the submitter forgot to enter a date (or entered it incorrectly) or misspelled the city. Skip _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev