On 02/13/2010 09:41 PM, mzap wrote: > Thanks for these replies :) We've built the corpus. Funny thing for me > (wearing a corpus linguist hat) is that the corpus is bigger than most > of the reference corpora I might use! > > cheers, > Michele
Yeah, it's pretty easy to collect tweets - I just tested some of my code on a small sample from the Streaming "sample" pipe. It's huge! Speaking of Twitter "natural language processing", you might be interested in my tweet-text translation efforts. I'm going to be posting some more details in a day or so, but this routine might be of some interest to you: lexical_regex_utilities.pl at master from znmeb's Twitter-API-Perl-Utilities - GitHub http://meb.tw/b4AHK9 And a test driver (requires JSON input, which is sort of the "native" language of the Twitter APIs: test_pg_text.pl at master from znmeb's Twitter-API-Perl-Utilities - GitHub http://meb.tw/bAmt8q License is same as Perl - Artistic. I need to put that in the repository. ;-) -- M. Edward (Ed) Borasky borasky-research.net/m-edward-ed-borasky "A mathematician is a device for turning coffee into theorems." ~ Paul Erdős