[twitter-dev] Academic data release

2010-02-24 Thread Pete Warden
I'm looking into releasing a data set based on information pulled from the
Twitter API. It would be a free release limited to academic researchers, an
anonymized version of the network connections of several million users with
public profiles.

What I'm hoping to release is something like this:
user id, city-level location, follower ids, friend ids

In all cases, the ids are arbitrary identifiers that are not convertible to
actual Twitter ids, and any detailed locations are converted to the nearest
large city.

I'm aware that it may be possible to de-anonymize some of these users based
on topology, but since much richer information is available through the API
on these users anyway, that seems unlikely to be an issue? However I'm
obviously keen to hear any concerns that Twitter (or other developers here)
may have before I go forward with this.

cheers,
Pete


Re: [twitter-dev] Academic data release

2010-02-24 Thread M. Edward (Ed) Borasky

Quoting Pete Warden p...@petewarden.com:


I'm looking into releasing a data set based on information pulled from the
Twitter API. It would be a free release limited to academic researchers, an
anonymized version of the network connections of several million users with
public profiles.

What I'm hoping to release is something like this:
user id, city-level location, follower ids, friend ids

In all cases, the ids are arbitrary identifiers that are not convertible to
actual Twitter ids, and any detailed locations are converted to the nearest
large city.

I'm aware that it may be possible to de-anonymize some of these users based
on topology, but since much richer information is available through the API
on these users anyway, that seems unlikely to be an issue? However I'm
obviously keen to hear any concerns that Twitter (or other developers here)
may have before I go forward with this.

cheers,
Pete



What is the value of such a dataset to an academic researcher? I
consider myself an academic researcher, though I don't have a formal
position as one. What can you do with a real Twitter social graph
that you can't do with one generated by random techniques based on
statistical sampling of Twitter data?

A million-user real social graph, even assuming fewer than 5,000
friend_ids and follower_ids per user, costs two million API calls. At
350 calls per hour, that works out to 238 days by my calculation. And
during that 238 days, the social graph is changing many times a
second.

I don't see the value of a *static* sample of the Twitter social graph.
A randomly-generated graph of a much larger size could be
constructed in a day, *including* coding time, *and* you could
incorporate the changing nature of Twitter social graphs in a
simulation. What would be interesting to me would be the *model*, not  
the data. To quote Dr. Neil Gunther  
(http://www.perfdynamics.com/Manifesto/gcaprules.html):


Data comes from the Devil, only models come from God.

(And smiling at a subtle irony in my standard email signature) ;-)

--
M. Edward (Ed) Borasky
borasky-research.net/m-edward-ed-borasky/

A mathematician is a device for turning coffee into theorems. ~ Paul Erdos


Re: [twitter-dev] Academic data release

2010-02-24 Thread John Kalucki
It's possible, if not likely, that releasing this data would be against one
or more Service Terms.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.

On Wed, Feb 24, 2010 at 7:54 AM, Pete Warden p...@petewarden.com wrote:

 I'm looking into releasing a data set based on information pulled from the
 Twitter API. It would be a free release limited to academic researchers, an
 anonymized version of the network connections of several million users with
 public profiles.

 What I'm hoping to release is something like this:
 user id, city-level location, follower ids, friend ids

 In all cases, the ids are arbitrary identifiers that are not convertible to
 actual Twitter ids, and any detailed locations are converted to the nearest
 large city.

 I'm aware that it may be possible to de-anonymize some of these users based
 on topology, but since much richer information is available through the API
 on these users anyway, that seems unlikely to be an issue? However I'm
 obviously keen to hear any concerns that Twitter (or other developers here)
 may have before I go forward with this.

 cheers,
 Pete



Re: [twitter-dev] Academic data release

2010-02-24 Thread Pete Warden
The value lies in the particular properties of a real social graph, as
opposed to an artificially generated one. The sort of questions it's useful
for are primarily social rather than mathematical. For a summary of some
existing research on similar data sets, see:

http://petewarden.typepad.com/searchbrowser/2010/02/social-network-data-and-research.html

On Wed, Feb 24, 2010 at 11:18 AM, M. Edward (Ed) Borasky
zn...@cesmail.netwrote:

 Quoting Pete Warden p...@petewarden.com:

  I'm looking into releasing a data set based on information pulled from the
 Twitter API. It would be a free release limited to academic researchers,
 an
 anonymized version of the network connections of several million users
 with
 public profiles.

 What I'm hoping to release is something like this:
 user id, city-level location, follower ids, friend ids

 In all cases, the ids are arbitrary identifiers that are not convertible
 to
 actual Twitter ids, and any detailed locations are converted to the
 nearest
 large city.

 I'm aware that it may be possible to de-anonymize some of these users
 based
 on topology, but since much richer information is available through the
 API
 on these users anyway, that seems unlikely to be an issue? However I'm
 obviously keen to hear any concerns that Twitter (or other developers
 here)
 may have before I go forward with this.


 What is the value of such a dataset to an academic researcher? I consider
 myself an academic researcher, though I don't have a formal position as one.
 What can you do with a real Twitter social graph that you can't do with
 one generated by random techniques based on statistical sampling of Twitter
 data?

 A million-user real social graph, even assuming fewer than 5,000
 friend_ids and follower_ids per user, costs two million API calls. At 350
 calls per hour, that works out to 238 days by my calculation. And during
 that 238 days, the social graph is changing many times a second. A
 randomly-generated graph of a much larger size could be constructed in a
 day, *including* coding time, *and* you could incorporate the changing
 nature of Twitter social graphs in a simulation.

 (Smiling at the subtle irony in my standard email signature) ;-)

 --
 M. Edward (Ed) Borasky
 borasky-research.net/m-edward-ed-borasky/

 A mathematician is a device for turning coffee into theorems. ~ Paul
 Erdos