I'm sure many of you are aware that spamming of the submission forms on
blogs and other websites is a large and increasing problem.  The Mojam and
Musi-Cal concert websites suffered from the same malady.  I originally
considered implementing some sort of CAPTCHA scheme:

    http://en.wikipedia.org/wiki/Captcha

but that has limitations and would have required changes to all submission
forms on the websites.  I decided to instead implement a SpamBayes-based
solution in our XML-RPC server.  It has a couple distinct advantages:

    * It has none of the CAPTCHA gotchas.
    * It is implemented at a single point in the system.
    * No changes to the Web interface were required so users don't have to
      learn something new.

I'll give you a quick sketch of what I did to solve this problem.  If you'd
like more details, drop me a note.

When someone submits concert dates to our sites the submission is
represented as a simple dictionary.  A valid submission will have
information about who's performing, a date in the future, valid location
information, etc.  In contrast, when someone spams the submission forms the
dictionary often contains bogus information or is missing some fields
altogether.  For example, if the spammer puts something in the date fields
it's likely to be garbage which won't parse properly, resulting in a default
date of 1900-01-01.  Similarly, the city/state/country is likely to be
invalid, so we won't be able to find lat/long info.

The dictionary is preprocessed into a string of tokens which includes the
obvious text which was part of the submission, but which also contains
synthetic tokens.  Here's a spammer's entry represented as text:

    Bradyn Maximus Ty [EMAIL PROTECTED] 1900-01-01 Jerald kwds:False
    kwds-private:False Malcom 1900-01-01 Jarod date:ancient perflen:1
    infolen:1 hasphone:False hasprice:False city:unknown venue:present

Here's a valid entry represented as text:

    Anchorage [EMAIL PROTECTED] 2006-10-07 kwds:True kwds-private:True .bl.1348
    .ra LaVette,Bettye 2006-10-07 AK Discovery Theatre date:current
    perflen:1 infolen:0 hasphone:False hasprice:False city:known
    venue:present

The synthetic tokens that suggest problems are such huge red flags for the
classifier that after training on just a couple of these bad boys the
rejection rate of spam submissions seems to be 100%.  Of course, this sort
of spamming is probably still in its infancy, so I expect we might
eventually see some sort of arms race develop as has been true for email
spam.  I'm not too worried about that though because for the most part I
think the spammers' primary target is the blogosphere with its ubiquitous
comment feature, not specialized websites like ours.

The tokenizer class is quite simple.  I post it here in its entirety.  Note
that major bits of it were just pasted from the default tokenizer.

    from spambayes.tokenizer import log2, Tokenizer, numeric_entity_re, \
         numeric_entity_replacer, crack_urls, breaking_entity_re, html_re, \
         tokenize_word

    class Tokenizer(Tokenizer):
        def tokenize(self, text):
            maxword = 20
            # Replace numeric character entities (like a for the letter
            # 'a').
            text = numeric_entity_re.sub(numeric_entity_replacer, text)

            # Normalize case.
            text = text.lower()

            # Get rid of uuencoded sections, embedded URLs, <style gimmicks,
            # and HTML comments.
            for cracker in (crack_urls,):
                text, tokens = cracker(text)
                for t in tokens:
                    yield t

            # Remove HTML/XML tags.  Also &nbsp;.  <br> and <p> tags should
            # create a space too.
            text = breaking_entity_re.sub(' ', text)
            # It's important to eliminate HTML tags rather than, e.g.,
            # replace them with a blank (as this code used to do), else
            # simple tricks like
            #    Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion
            # can be used to disguise words.  <br> and <p> were special-
            # cased just above (because browsers break text on those,
            # they can't be used to hide words effectively).
            text = html_re.sub('', text)

            # Tokenize everything in the body.
            for w in text.split():
                n = len(w)
                # Make sure this range matches in tokenize_word().
                if 3 <= n <= maxword:
                    yield w

                elif n >= 3:
                    for t in tokenize_word(w):
                        yield t

The only thing I found a bit frustrating was that I had to override the
Hammie class to provide an alternate tokenizer class:

    class Hammie(hammie.Hammie):
        def __init__(self, bayes):
            hammie.Hammie.__init__(self, bayes)
            self.tokenizer = Tokenizer()

        def _scoremsg(self, msg, evidence=False):
            return self.bayes.spamprob(self.tokenizer.tokenize(msg), evidence)

        def train(self, msg, is_spam, add_header=False):
            self.bayes.learn(self.tokenizer.tokenize(msg), is_spam)

I think it would be a bit more general if the Hammie class accepted an
optional tokenizer to avoid this.

So far I've trained on 338 hams and 26 spams.  (I have a new guy I'm
breaking in who is not experienced with SpamBayes.  I see a number of
entries in the ham data I would not have added there.  I expect I can
probably reduce the ham data size to 200 or less.)  The BerkDB file
containing the token database is a whopping 86KBytes.  I somewhat
arbitrarily set the ham cutoff at 0.15 and the spam cutoff at 0.60.  The
true spam seems to so far all land in the 0.95-1.0 range.  I see some
"possible spam" in the 0.16-0.2 range.  Most of the time that's because the
submitter forgot to enter a date (or entered it incorrectly) or misspelled
the city.

Skip
_______________________________________________
spambayes-dev mailing list
spambayes-dev@python.org
http://mail.python.org/mailman/listinfo/spambayes-dev

Reply via email to