As long as you aren't trying to capture and deliver *all* tweets,
there are a couple of good ways to cut out spammers. One thing I do is
save all mentions for all users in a database of tweets. When a tweet
comes in from the streaming API, I collect @mentions, and store them
with the screen name of the tweet's author and the screen name
mentioned. Then I can rank users based on the number of different
accounts that mention them. If you only use the tweets from the top N%
of users, the quality improves a lot. I find that the top 80% is
usually enough of a screen to get good quality.

Another trick is blocking duplicates from each user. The API only
blocks duplicates that repeat immediately, but if a spammer has a list
of tweets, and cycles through them, all the tweets get through. I
compare all new tweets with the other tweets from that user. This is
very expensive if you have a big database. This can be made less
intensive by limiting the comparison to just the tweets from that user
in the last few days. You can also run this with a separate process
that doesn't slow down you main tweet parsing loop. Most spammers are
so simplistic that they just repeat the same tweet over and over. In a
real spammy set of keywords, if I find more than a few duplicates from
a user, I just stop saving their tweets.

On Fri, Nov 26, 2010 at 11:26 AM, Furkan Kuru <> wrote:
> Word "lol" is the most common in these spam tweets. We receive 400 spam
> tweets per hour now tracking 100K people.
> We plan to delete all of the tweets containing "lol" word. It is also used
> by our users (Turkish people) writing in English though.
> Any better suggestions?

Adam Green
Twitter API Consultant and Trainer

Twitter developer documentation and resources:
API updates via Twitter:
Issues/Enhancements Tracker:
Change your membership to this group:

Reply via email to