Thanks for these replies :) We've built the corpus. Funny thing for me (wearing a corpus linguist hat) is that the corpus is bigger than most of the reference corpora I might use!
cheers, Michele On Feb 12, 3:45 pm, "M. Edward (Ed) Borasky" <zzn...@gmail.com> wrote: > On 02/10/2010 10:03 PM, mzap wrote: > > > I am a linguist at the University of Sydney currently studying the > > language of microblogging. I would like to build a 100 million word > >corpusof tweets. I am trying to determine the best way of collecting > > such acorpus. Does Twitter make data available directly or is the > > only method scraping tweets using the API( I am not a programmer > > myself although I do have access to a programmer who is able to use > > the API)? > > > If I was to use the API would rate limiting mean that it is going to > > take ages to reach 100 million tweets? > > > cheers, > > Michele > > If you're just collecting tweets to build acorpus, it's pretty easy to > do with the Streaming API. I've got Perl scripts that can do that, > either with Streaming or Search. With Streaming there's no "rate limit" > - just connect to the "Sample" stream and collect tweets until you have > a big enoughcorpus. > > I don't have a good idea how long it will take you to get 100 million > words, but it should be easy to figure out how long it will take to get > 100 million tweets - just see how many tweets per hour "sample" is sending. > > -- > M. Edward (Ed) Borasky > borasky-research.net/m-edward-ed-borasky > > "A mathematician is a device for turning coffee into theorems." ~ Paul Erdős