On Sun, 3 Apr 2011 18:19:38 -0700 (PDT), Jeff Tucker <[email protected]> wrote:
I'm conducting a research project involving proactively identifying
twitter spam accounts before they actually start spamming.  I've
observed that some spammers attempt to create tweets that look like
they're a legitimate account prior to actually sending spam and my
project is to be able to identify those accounts as soon as they pop
up.

Unfortunately (I can't believe that I'm writing this) I am having a
hard time getting spammers to actually spam me. Is there any way that
I can somehow get access to the tweets of several dozen spam accounts
(prior to when they're shut down) so that I can see what they're
posting?  Is this possible somehow?

Also, if anyone gets spammed regularly, are you interested in helping
me out with my research?  No guarantee that I'll actually publish
this, but anyone interested will be credited in my paper in the
acknowledgements.  Thanks
-Jeff Tucker
Lecturer, DigiPen Institute of Technology
www.digipen.edu

I don't know how rapidly Twitter detects and shuts spam accounts down these days. I imagine there's a priority scheme, with accounts linking to malware and pr0n shut down more aggressively than those that are just "selling stuff" and being annoying about it. Here's a bit of pseudo-code that will get you one class of spammers:

1. Poll the Trending Topics periodically. IIRC if you do it every ten minutes for all the localities you won't use up all your API calls per hour.

2. Do a search for each trending topic - take the first 100 tweets for each. This doesn't cost you any API calls, since it's a search.

3. Now use a relational database to find tweets that match more than one trending topic. There's a high probability those are spam. Quite a few of the other tweets will be spam too, but those that match multiple trends are much more likely to be spam.

4. Now you have a list of accounts - pull their most recent 3200 tweets and test your algorithm. You'll probably have to manually go through them to find the boundary where the account started spamming, but then you should have a nice dataset for a classifier training.


--
http://twitter.com/znmeb http://borasky-research.net

"A mathematician is a device for turning coffee into theorems." -- Paul Erdős

--
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk

Reply via email to