100GB is a lot... If the average JSON representation of a tweet takes 5 KB (and I think it might), you'd need 20 million tweets. Let's say that there are 100 million tweets sent per day (I think it's more though), and you get 1% from the sample stream (which would be 1 million). You'd have to capture that stream for 20 days to get enough tweets.

Sample stream is at https://stream.twitter.com/1/statuses/sample.json (OAuth/Basic Auth required). Have fun capturing!


PS: I really hope I got the math right. :-)

On 3/6/11 8:16 PM, Ted Pedersen wrote:
I'd like to get somewhere around 100GB of tweets. It doesn't matter
where they are from, when they were sent, etc. I'd just like to have a
relatively large collection of data to use as assignment data for a
class I'm teaching that uses Hadoop.

Is such a collection available for download anywhere, or is there an
existing program I could use to simply record twitter data for some
period of time? (I've heard about both the firehose and the streaming
API, but can't seem to find anything that is ready to run with that
for this particular task....but I might not know where to look).


Ted Pedersen

Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 

Reply via email to