[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-18 Thread djMax
Hi Jay, very interesting project. I run a hyperlocal wiki in Boston: http://boston.povo.com. How are you pulling these, are you going after specific users who set their location to Boston (or whichever city)? On Apr 17, 4:20 pm, jayb wrote: > I've been collecting tweets for about a week for a

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-17 Thread Nicole Simon
Anything you can do to help people determine better the language of tweets, so search is more usable for international users. ;)) I am a bit curious about the mentioned 'costs of publishing in journals and conferences' - don't know about the journals. but none of the conferences I know of in Tech

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-17 Thread jayb
I've been collecting tweets for about a week for a project (http:// www.happn.in). Some characteristics of my current dataset: * Begin around April 10th 2009 * Collected from users who are located nearby 26 US cities * ~5,000,000 tweets * Growing at ~800,000 per day * ~900MB in mysql * ~375,000 u

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-17 Thread Nick Arnett
Part 1: http://drop.io/gmx85rd (tweetsgzaa) Part 2: http://drop.io/f5itrsx (tweetsgzab) Password (for the download): twitter The two parts need to be concatenated and then un-gzipped (naming the concatenated file tweets.gz would be appropriate). Nick The format is a tab-delimited text file.

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-17 Thread Nick Arnett
I'm splitting it and putting it on drop.io. Will take a little while to upload... I'll post when it's available. Nick On Fri, Apr 17, 2009 at 9:17 AM, djMax wrote: > > http://drop.io > > On Apr 17, 12:07 pm, Nick Arnett wrote: > > Michele, djMax and anybody else interested... It is a 128MB f

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-17 Thread Nick Arnett
On Fri, Apr 17, 2009 at 9:17 AM, djMax wrote: > > http://drop.io > The free version is limited to 100MB... I could split it, I guess. Any others with a higher limit? Nick

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-17 Thread djMax
http://drop.io On Apr 17, 12:07 pm, Nick Arnett wrote: > Michele, djMax and anybody else interested...  It is a 128MB file after > gzipping (291MB uncompressed).  Any thoughts on a place to put it for > download?  I'm reluctant to sacrifice a lot of my own bandwidth for this and > off the top of

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-17 Thread Nick Arnett
Michele, djMax and anybody else interested... It is a 128MB file after gzipping (291MB uncompressed). Any thoughts on a place to put it for download? I'm reluctant to sacrifice a lot of my own bandwidth for this and off the top of my head, I can't think of a good place to share it. Nick

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-16 Thread djMax
I've wondered about a distributed version of this... If those of us who want to sift through the "entire" stream were to pool our API usage, in theory we could do it w/o knocking over twitter right? My particular usage is mining for geo content, either lat/lng or NLP based feature extraction.

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-16 Thread Michele Zappavigna
Hi Nick, I am linguist currently working on Twitter. I would be very interested in using the corpus that you mention you have created. I work in the area of Systemic Functional Linguistics and am looking at how people use language to affiliate on Twitter. At the moment I am working with a corpus

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-09 Thread Nick Arnett
On Thu, Apr 9, 2009 at 2:04 PM, kanny wrote: > ... It could change the twitter > client game completely as we dive deeper into the meanings of the > tweets instead of the keyword based or author based groupings. > That's what TwURLed News is about, but using a much simpler clue - cited URLs - a

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-09 Thread kanny
Thanks Nick for your gesture. I will certainly be interested in trying out your cached tweets, but its usefulness will be limited to those who follow the cached tweets' authors. About sharing, i don't intend to publish in journals or conferences as i can't afford the costs, but will definitely sh

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-09 Thread Nick Arnett
On Thu, Apr 9, 2009 at 7:13 AM, kanny wrote: > > > Caching is something i will definitely be doing, but as i said, to do > something complex like semantic model generation, i need access to a > user's last, at least 100,000 friends_timeline tweets. For a typical > user following 100 reasonably act

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-09 Thread kanny
Thank you Doug for the reply. Currently, I am able to get only about 1000 tweets from a user's timeline, though the limit says about 3000. I also requested for the whitelisting and am glad that it is accepted, but i don't know where do i request for a datamining feed ? Caching is something i wil

[twitter-dev] Re: Tweet Corpus creation for NLP research

2009-04-08 Thread Doug Williams
We don't have a method to download the entire friends_timeline for a user. If you search the boards or documentation you will find there is an artificial limit on the number of tweets you can download [1]. Please doing datamining often request access to the datamining feed and cache tweets as they