Re: [twitter-dev] Infochimps Datasets available for Hack Day: drawn from 1.6B tweets, 40M+ users+reputation, ~0.5B reply links, more!

2010-04-15 Thread znmeb

- Philip (flip) Kromer f...@infochimps.org wrote:

 Hi all,
 
 I'm pleased to announce that Infochimps is making datasets from our
 massive scrape of the Twitter corpus available for Chirp Hack day
 devs.
 
 There's a big opportunity for apps that draw on the historical record
 and *structure* of twitter -- apps that require a global perspective
 and intense computation. The following are available to mash up
 against other datasets from infochimps.org or even just to
 bootstrap-seed the database for your Hack Day application. We also
 have a 30-machine cluster up to do further extractions, so if you have
 something really interesting you'd like to pull please let me know.
 
 Reputation Metrics from Reply and Follow graph s Uses algorithm
 similar to pagerank to derive reputation, one using the a_follows_b
 graph and one using the a_replies_b graphs
 Reply/retweet/mention graph Every observed Reply, retweet, or mention
 seen in a 1.6B-tweet sample (about 15% of historical record):
 a_[rel]_b, user_a_id, user_b_id, tweet_id
 Twitter Users by Background Color The number of users with each
 background color: color code, user count
 Twitter Users by Friends Count The number of users with a given number
 of friends: number of friends, user count
 Twitter Users by Followers Count The number of users with a given
 number of followers: number of followers, user count
 Twitter Users by Created At The number of users whose accounts were
 created in a given month/day/hour along with the earliest seen ID in
 that hour: timestamp to month/day/hour, user count
 Smileys Smiley faces with user, date, tweet_id
 Hashtags Hashtags with user, date, tweet_id
 TweetUrl URLs with user, date, tweet_id
 Twitter Users by Location The number of users in a location string (as
 provided by the user in their profile). location, user count
 Stock Tweets Tweets that include the stock symbol tag convention of
 $STOCKNAME or $$. The tweet is listed for each time a tag is used in
 the tweet. stock_tweet (resource name), symbol captured, tweet object
 (all things in a tweet)
 Stock Prices Daily stock prices for the NASDAQ, NYSE, AMEX exchanges
 1970-now symbol, open, low, close, high, volume
 
 Parameters for what's available:
 
 raw object size number of objs
 a_follows_b 45.8 GB 1,587,838,568
 a_mentions_b 29.5 GB 493,682,309
 a_retweets_b 1.6 GB 36,022,061
 twitter_user 3.1 GB 43,261,388
 tweets 376.0 GB 1,641,624,381
 hashtag 7.1 GB 139,916,844
 smiley 4.4 GB 99,272,082
 tweet_url 29.5 GB 433,278,116
 
 If you'd like access to any of these, or have an idea that needs
 something /not/ here, please let me know ( f...@infochimps.org ).
 We're only opening access to Hack Day devs for now -- but please let
 us know your ideas so we can show twitter how much demand there is for
 aggregated access to data.
 
 best,
 flip
 @mrflip
 512-659-6846
 
 
 http://infochimps.org
 Find any dataset in the world

This is too short notice for me to be able to come up with a use for these 
data. But for the future, do you by any chance have access to *intraday futures 
and options* time series? Daily stock data are more or less useless.


-- 
To unsubscribe, reply using remove me as the subject.


[twitter-dev] Infochimps Datasets available for Hack Day: drawn from 1.6B tweets, 40M+ users+reputation, ~0.5B reply links, more!

2010-04-14 Thread Philip (flip) Kromer
Hi all,

I'm pleased to announce that Infochimps is making datasets from our massive
scrape of the Twitter corpus available for Chirp Hack day devs.

There's a big opportunity for apps that draw on the historical record and
*structure* of twitter -- apps that require a global perspective and intense
computation. The following are available to mash up against other datasets
from infochimps.org or even just to bootstrap-seed the database for your
Hack Day application.  We also have a 30-machine cluster up to do further
extractions, so if you have something really interesting you'd like to pull
please let me know.

*Reputation Metrics from Reply and Follow graph*s Uses algorithm similar to
pagerank to derive reputation, one using the a_follows_b graph and one using
the a_replies_b graphs
*Reply/retweet/mention graph* Every observed Reply, retweet, or mention seen
in a 1.6B-tweet sample (about 15% of historical record): a_[rel]_b,
user_a_id, user_b_id, tweet_id
*Twitter Users by Background Color* The number of users with each background
color: color code, user count
*Twitter Users by Friends Count *The number of users with a given number of
friends: number of friends, user count
*Twitter Users by Followers Count* The number of users with a given number
of followers: number of followers, user count
*Twitter Users by Created At* The number of users whose accounts were
created in a given month/day/hour along with the earliest seen ID in that
hour: timestamp to month/day/hour, user count
*Smileys* Smiley faces with user, date, tweet_id
*Hashtags* Hashtags with user, date, tweet_id
*TweetUrl* URLs with user, date, tweet_id
*Twitter Users by Location* The number of users in a location string (as
provided by the user in their profile). location, user count
*Stock Tweets* Tweets that include the stock symbol tag convention of
$STOCKNAME or $$. The tweet is listed for each time a tag is used in the
tweet. stock_tweet (resource name), symbol captured, tweet object (all
things in a tweet)
*Stock Prices *Daily stock prices for the NASDAQ, NYSE, AMEX exchanges
1970-now symbol, open, low, close, high, volume

Parameters for what's available:

raw object  size   number of objs
a_follows_b 45.8 GB 1,587,838,568
a_mentions_b29.5 GB   493,682,309
a_retweets_b 1.6 GB36,022,061
twitter_user 3.1 GB43,261,388
tweets 376.0 GB 1,641,624,381
hashtag  7.1 GB   139,916,844
smiley   4.4 GB99,272,082
tweet_url   29.5 GB   433,278,116

If you'd like access to any of these, or have an idea that needs something
/not/ here, please let me know (f...@infochimps.org).  We're only opening
access to Hack Day devs for now -- but please let us know your ideas so we
can show twitter how much demand there is for aggregated access to data.

best,
flip
@mrflip
512-659-6846


http://infochimps.org
Find any dataset in the world


-- 
To unsubscribe, reply using remove me as the subject.


Re: [twitter-dev] Infochimps Datasets available for Hack Day: drawn from 1.6B tweets, 40M+ users+reputation, ~0.5B reply links, more!

2010-04-14 Thread Kovas Boguts

Hi flip,

I'm hacking at chirp and am interested in a follows b, hashtags, and  
urls. What would be extra sweet would be timestamps on the follow  
relations, if you crawl the same person over time and we can see how  
that network evolves. Thanks for making the data available.


Kovas
Infoharmoni

On Apr 14, 2010, at 1:51 PM, Philip (flip) Kromer  
f...@infochimps.org wrote:



Hi all,

I'm pleased to announce that Infochimps is making datasets from our  
massive scrape of the Twitter corpus available for Chirp Hack day  
devs.


There's a big opportunity for apps that draw on the historical  
record and *structure* of twitter -- apps that require a global  
perspective and intense computation. The following are available to  
mash up against other datasets from infochimps.org or even just to  
bootstrap-seed the database for your Hack Day application.  We also  
have a 30-machine cluster up to do further extractions, so if you  
have something really interesting you'd like to pull please let me  
know.


Reputation Metrics from Reply and Follow graphs Uses algorithm  
similar to pagerank to derive reputation, one using the a_follows_b  
graph and one using the a_replies_b graphs
Reply/retweet/mention graph Every observed Reply, retweet, or  
mention seen in a 1.6B-tweet sample (about 15% of historical  
record): a_[rel]_b, user_a_id, user_b_id, tweet_id
Twitter Users by Background Color The number of users with each  
background color: color code, user count
Twitter Users by Friends Count The number of users with a given  
number of friends: number of friends, user count
Twitter Users by Followers Count The number of users with a given  
number of followers: number of followers, user count
Twitter Users by Created At The number of users whose accounts were  
created in a given month/day/hour along with the earliest seen ID in  
that hour: timestamp to month/day/hour, user count

Smileys Smiley faces with user, date, tweet_id
Hashtags Hashtags with user, date, tweet_id
TweetUrl URLs with user, date, tweet_id
Twitter Users by Location The number of users in a location string  
(as provided by the user in their profile). location, user count
Stock Tweets Tweets that include the stock symbol tag convention of  
$STOCKNAME or $$. The tweet is listed for each time a tag is used in  
the tweet. stock_tweet (resource name), symbol captured, tweet  
object (all things in a tweet)
Stock Prices Daily stock prices for the NASDAQ, NYSE, AMEX exchanges  
1970-now symbol, open, low, close, high, volume


Parameters for what's available:

raw object  size   number of objs
a_follows_b 45.8 GB 1,587,838,568
a_mentions_b29.5 GB   493,682,309
a_retweets_b 1.6 GB36,022,061
twitter_user 3.1 GB43,261,388
tweets 376.0 GB 1,641,624,381
hashtag  7.1 GB   139,916,844
smiley   4.4 GB99,272,082
tweet_url   29.5 GB   433,278,116

If you'd like access to any of these, or have an idea that needs  
something /not/ here, please let me know (f...@infochimps.org).   
We're only opening access to Hack Day devs for now -- but please let  
us know your ideas so we can show twitter how much demand there is  
for aggregated access to data.


best,
flip
@mrflip
512-659-6846


http://infochimps.org
Find any dataset in the world