As long as you aren't trying to capture and deliver *all* tweets,
there are a couple of good ways to cut out spammers. One thing I do is
save all mentions for all users in a database of tweets. When a tweet
comes in from the streaming API, I collect @mentions, and store them
with the screen name of the tweet's author and the screen name
mentioned. Then I can rank users based on the number of different
accounts that mention them. If you only use the tweets from the top N%
of users, the quality improves a lot. I find that the top 80% is
usually enough of a screen to get good quality.

Another trick is blocking duplicates from each user. The API only
blocks duplicates that repeat immediately, but if a spammer has a list
of tweets, and cycles through them, all the tweets get through. I
compare all new tweets with the other tweets from that user. This is
very expensive if you have a big database. This can be made less
intensive by limiting the comparison to just the tweets from that user
in the last few days. You can also run this with a separate process
that doesn't slow down you main tweet parsing loop. Most spammers are
so simplistic that they just repeat the same tweet over and over. In a
real spammy set of keywords, if I find more than a few duplicates from
a user, I just stop saving their tweets.


On Fri, Nov 26, 2010 at 11:26 AM, Furkan Kuru <furkank...@gmail.com> wrote:
>
> Word "lol" is the most common in these spam tweets. We receive 400 spam
> tweets per hour now tracking 100K people.
>
> We plan to delete all of the tweets containing "lol" word. It is also used
> by our users (Turkish people) writing in English though.
>
> Any better suggestions?
>

-- 
Adam Green
Twitter API Consultant and Trainer
http://140dev.com
@140dev

-- 
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk

Reply via email to