> Thanks for these replies :) We've built the corpus. Funny thing for me
> (wearing a corpus linguist hat) is that the corpus is bigger than most
> of the reference corpora I might use!
Yeah, it's pretty easy to collect tweets - I just tested some of my code
on a small sample from the Streaming "sample" pipe. It's huge!

Speaking of Twitter "natural language processing", you might be
interested in my tweet-text translation efforts. I'm going to be posting
some more details in a day or so, but this routine might be of some
interest to you:

lexical_regex_utilities.pl at master from znmeb's
Twitter-API-Perl-Utilities - GitHub http://meb.tw/b4AHK9

And a test driver (requires JSON input, which is sort of the "native"
language of the Twitter APIs:

test_pg_text.pl at master from znmeb's Twitter-API-Perl-Utilities -
GitHub http://meb.tw/bAmt8q

License is same as Perl - Artistic. I need to put that in the
repository. ;-)

