Re: [twitter-dev] Re: Building a 100 million word Twitter corpus

2010-02-14 Thread M. Edward (Ed) Borasky
On 02/13/2010 09:41 PM, mzap wrote:
> Thanks for these replies :) We've built the corpus. Funny thing for me
> (wearing a corpus linguist hat) is that the corpus is bigger than most
> of the reference corpora I might use!
> 
> cheers,
> Michele

Yeah, it's pretty easy to collect tweets - I just tested some of my code
on a small sample from the Streaming "sample" pipe. It's huge!

Speaking of Twitter "natural language processing", you might be
interested in my tweet-text translation efforts. I'm going to be posting
some more details in a day or so, but this routine might be of some
interest to you:

lexical_regex_utilities.pl at master from znmeb's
Twitter-API-Perl-Utilities - GitHub http://meb.tw/b4AHK9

And a test driver (requires JSON input, which is sort of the "native"
language of the Twitter APIs:

test_pg_text.pl at master from znmeb's Twitter-API-Perl-Utilities -
GitHub http://meb.tw/bAmt8q

License is same as Perl - Artistic. I need to put that in the
repository. ;-)


-- 
M. Edward (Ed) Borasky
borasky-research.net/m-edward-ed-borasky

"A mathematician is a device for turning coffee into theorems." ~ Paul Erdős


[twitter-dev] Re: Building a 100 million word Twitter corpus

2010-02-13 Thread mzap
Thanks for these replies :) We've built the corpus. Funny thing for me
(wearing a corpus linguist hat) is that the corpus is bigger than most
of the reference corpora I might use!

cheers,
Michele

On Feb 12, 3:45 pm, "M. Edward (Ed) Borasky"  wrote:
> On 02/10/2010 10:03 PM, mzap wrote:
>
> > I am a linguist at the University of Sydney currently studying the
> > language of microblogging. I would like to build a 100 million word
> >corpusof tweets. I am trying to determine the best way of collecting
> > such acorpus. Does Twitter make data available directly or is the
> > only method scraping tweets using the API( I am not a programmer
> > myself although I do have access to a programmer who is able to use
> > the API)?
>
> > If I was to use the API would rate limiting mean that it is going to
> > take ages to reach 100 million tweets?
>
> > cheers,
> > Michele
>
> If you're just collecting tweets to build acorpus, it's pretty easy to
> do with the Streaming API. I've got Perl scripts that can do that,
> either with Streaming or Search. With Streaming there's no "rate limit"
> - just connect to the "Sample" stream and collect tweets until you have
> a big enoughcorpus.
>
> I don't have a good idea how long it will take you to get 100 million
> words, but it should be easy to figure out how long it will take to get
> 100 million tweets - just see how many tweets per hour "sample" is sending.
>
> --
> M. Edward (Ed) Borasky
> borasky-research.net/m-edward-ed-borasky
>
> "A mathematician is a device for turning coffee into theorems." ~ Paul Erdős