On 02/10/2010 10:03 PM, mzap wrote:
> I am a linguist at the University of Sydney currently studying the
> language of microblogging. I would like to build a 100 million word
> corpus of tweets. I am trying to determine the best way of collecting
> such a corpus. Does Twitter make data available directly or is the
> only method scraping tweets using the API( I am not a programmer
> myself although I do have access to a programmer who is able to use
> the API)?
> If I was to use the API would rate limiting mean that it is going to
> take ages to reach 100 million tweets?
If you're just collecting tweets to build a corpus, it's pretty easy to
do with the Streaming API. I've got Perl scripts that can do that,
either with Streaming or Search. With Streaming there's no "rate limit"
- just connect to the "Sample" stream and collect tweets until you have
a big enough corpus.
I don't have a good idea how long it will take you to get 100 million
words, but it should be easy to figure out how long it will take to get
100 million tweets - just see how many tweets per hour "sample" is sending.
M. Edward (Ed) Borasky
"A mathematician is a device for turning coffee into theorems." ~ Paul Erdős