[twitter-dev] Re: Tweet Corpus creation for NLP research

kanny Thu, 09 Apr 2009 08:46:07 -0700

Thank you Doug for the reply.

Currently, I am able to get only about 1000 tweets from a user's
timeline, though the limit says about 3000. I also requested for the
whitelisting and am glad that it is accepted, but i don't know where
do i request for a datamining feed ?

Caching is something i will definitely be doing, but as i said, to do
something complex like semantic model generation, i need access to a
user's last, at least 100,000 friends_timeline tweets. For a typical
user following 100 reasonably active persons, this would take 2-3
months to build, which is not practical to wait for the application to
be usable.

If I may suggest, here is what can be a feasible alternative :
Currently, a tweet is enclosed in a large header structure whose many
elements are often not used. For the specific needs of this project
(or any NLP machine learning project using the longer tail of tweet
history), the following header-less format can be fine :

.
.

Tweet Time in miliseconds from 1-1-1970
Tweet Id[ In-reply-to Tweet Id]
screen_name
text
<New Line>
.
.

This would take about 160 bytes per tweet assuming avg tweet text is
100 chars. So for 100,000 tweets, it would be 16MB which when
compressed would be about 6MB. That will be just one-time download and
then the standard API access every 5 min would suffice to update the
state of the application while caching will take care of the model
updates. The current header structure took around 1MB for downloading
the 1000 tweets, so this header-less format won't affect the twitter
service at all.

I can provide a perl script to generate this format if it is helpful.
It will certainly help NLP researchers and other developers interested
in doing more intelligent processing, organization and presentation of
a user's friends_timeline.

Let me know if such a thing would be supported by Twitter.

Thanks

On Apr 8, 9:28 pm, Doug Williams <[email protected]> wrote:
> We don't have a method to download the entire friends_timeline for a user.
> If you search the boards or documentation you will find there is an
> artificial limit on the number of tweets you can download [1].
>
> Please doing datamining often request access to the datamining feed and
> cache tweets as they come.
>
> 1.http://apiwiki.twitter.com/REST+API+Documentation#PaginationLimiting
>
> Doug Williams
> Twitter API Supporthttp://twitter.com/dougw
>
>
>
> On Wed, Apr 8, 2009 at 8:26 AM, kanny <[email protected]> wrote:
>
> > I am interested to do something deeper than the surface-level
> > processing of a user's incoming tweets. For this, I will need to
> > create a corpus of the user's friends_timeline over, say, past one
> > month or any computationally feasible period. Basically, a large
> > enough set of, say, 1-100 Million tweets for someone following
> > 100-1000 people. It would be only a one-time download, as afterwards,
> > incremental downloads should suffice.
>
> > This would translate into 100MB-10 GB of download for a user. It could
> > be less for people following less or less-active people. Does Twitter
> > API provide support for such corpus creation ? It could be very
> > helpful for Natural Language Processing research if Twitter creates
> > some sample corpus of public_timeline or some selected user's
> > timelines.
>
> > Looking forward to some help in this regard.
> > Thanks

[twitter-dev] Re: Tweet Corpus creation for NLP research

Reply via email to