[twitter-dev] Streaming API and older tweets?

2010-01-22 Thread Jorge Vargas
Hello,

I'm building an archival tool for a specific hashtag after looking at
the API and testing the streaming API and I got it working to pull any
new tweets. However it seems the count parameter is restricited.
From http://apiwiki.twitter.com/Streaming-API-Documentation#QueryParameters

Firehose, Retweet, Link, Birddog and Shadow clients interested in
capturing all statuses should maintain a current estimate of the
number of statuses received per second and note the time that the last
status was received. Upon a reconnect, the client can then estimate
the appropriate backlog to request. Note that the count parameter is
not allowed elsewhere, including track, sample and on the default
access role.

I'm a little confused by that statement. Does that means I can't use
count on any track requests or just track with the default access
role? In that case how I can get Shadow access, google isn't
cooperating much due to the simplicity of the name.

Otherwise pulling tweets from the past from the streaming api seems
not possible. I'm able to do that with search using since_id, but that
seems a little weird.

So if my streaming client dies since it is impossible for it to look
back at those tweets, I'm going to have to build a full catch up
procedure using the search API to fill in the gap between the two
moments the streaming client was down. Is that really what is expected
of third party developers?

I'm just puzzled at the fact that the twitter team is asking everyone
that pulls data to move to the streaming API but it seems that we are
going to need a very big work around in order to get a reliable source
of data and there is no simple way to get old data out of the system.


Re: [twitter-dev] Streaming API and older tweets?

2010-01-22 Thread John Kalucki
You can use the Search API to do historical queries when you first
start following a hash tag, then switch to Streaming. Search results
may be incomplete but sufficient in this case.

Currently the count parameter is not supported in conjunction with
track on any role, due to both cost and scraping potential. Track
users must be willing to tolerate some minor data loss during
reconnection. Note that all users of Search have always tolerated
occasional minor data loss -- Search has always been slightly lossy.

A well-coded Streaming client can usually reconnect within a second.
You can try to paper over data loss during this period with the Search
API, but it may not be worth the effort. If a keyword has such
velocity that there's much data loss during that second of
reconnection, search results are likely to be heavily filtered anyway.
In general, this Streaming approach is likely to be closer to a total
covering of the keyword than querying Search.

In the end, if a keyword is prevalent enough to be noticed as a gap
during a reconnect, it's a pretty high volume keyword. Is data loss in
this case an actual practical issue for your app when you are
receiving tens or hundreds of tweets per minute?

If there is sufficient demand, we'll investigate historical queries
for track at higher access levels, but currently this is a low
priority issue that will soon have a workable alternative. The
workaround would be to take the Firehose, once we announce terms, and
use the count parameter there. It's possible to consume the Firehose
without common-case data loss.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.











On Fri, Jan 22, 2010 at 5:41 AM, Jorge Vargas jorge.var...@gmail.com wrote:
 Hello,

 I'm building an archival tool for a specific hashtag after looking at
 the API and testing the streaming API and I got it working to pull any
 new tweets. However it seems the count parameter is restricited.
 From http://apiwiki.twitter.com/Streaming-API-Documentation#QueryParameters

 Firehose, Retweet, Link, Birddog and Shadow clients interested in
 capturing all statuses should maintain a current estimate of the
 number of statuses received per second and note the time that the last
 status was received. Upon a reconnect, the client can then estimate
 the appropriate backlog to request. Note that the count parameter is
 not allowed elsewhere, including track, sample and on the default
 access role.

 I'm a little confused by that statement. Does that means I can't use
 count on any track requests or just track with the default access
 role? In that case how I can get Shadow access, google isn't
 cooperating much due to the simplicity of the name.

 Otherwise pulling tweets from the past from the streaming api seems
 not possible. I'm able to do that with search using since_id, but that
 seems a little weird.

 So if my streaming client dies since it is impossible for it to look
 back at those tweets, I'm going to have to build a full catch up
 procedure using the search API to fill in the gap between the two
 moments the streaming client was down. Is that really what is expected
 of third party developers?

 I'm just puzzled at the fact that the twitter team is asking everyone
 that pulls data to move to the streaming API but it seems that we are
 going to need a very big work around in order to get a reliable source
 of data and there is no simple way to get old data out of the system.



Re: [twitter-dev] Streaming API and older tweets?

2010-01-22 Thread Jorge Vargas
Thank you for your very complete answer John.

That is exactly what I was thinking. From what I can tell my tag isn't
very high traffic so I'm going to go check the search API to pull the
older results and leave my current Streaming client running to pull
anything new.

Thank you!

On Fri, Jan 22, 2010 at 2:06 PM, John Kalucki j...@twitter.com wrote:
 You can use the Search API to do historical queries when you first
 start following a hash tag, then switch to Streaming. Search results
 may be incomplete but sufficient in this case.

 Currently the count parameter is not supported in conjunction with
 track on any role, due to both cost and scraping potential. Track
 users must be willing to tolerate some minor data loss during
 reconnection. Note that all users of Search have always tolerated
 occasional minor data loss -- Search has always been slightly lossy.

 A well-coded Streaming client can usually reconnect within a second.
 You can try to paper over data loss during this period with the Search
 API, but it may not be worth the effort. If a keyword has such
 velocity that there's much data loss during that second of
 reconnection, search results are likely to be heavily filtered anyway.
 In general, this Streaming approach is likely to be closer to a total
 covering of the keyword than querying Search.

 In the end, if a keyword is prevalent enough to be noticed as a gap
 during a reconnect, it's a pretty high volume keyword. Is data loss in
 this case an actual practical issue for your app when you are
 receiving tens or hundreds of tweets per minute?

 If there is sufficient demand, we'll investigate historical queries
 for track at higher access levels, but currently this is a low
 priority issue that will soon have a workable alternative. The
 workaround would be to take the Firehose, once we announce terms, and
 use the count parameter there. It's possible to consume the Firehose
 without common-case data loss.

 -John Kalucki
 http://twitter.com/jkalucki
 Infrastructure, Twitter Inc.











 On Fri, Jan 22, 2010 at 5:41 AM, Jorge Vargas jorge.var...@gmail.com wrote:
 Hello,

 I'm building an archival tool for a specific hashtag after looking at
 the API and testing the streaming API and I got it working to pull any
 new tweets. However it seems the count parameter is restricited.
 From http://apiwiki.twitter.com/Streaming-API-Documentation#QueryParameters

 Firehose, Retweet, Link, Birddog and Shadow clients interested in
 capturing all statuses should maintain a current estimate of the
 number of statuses received per second and note the time that the last
 status was received. Upon a reconnect, the client can then estimate
 the appropriate backlog to request. Note that the count parameter is
 not allowed elsewhere, including track, sample and on the default
 access role.

 I'm a little confused by that statement. Does that means I can't use
 count on any track requests or just track with the default access
 role? In that case how I can get Shadow access, google isn't
 cooperating much due to the simplicity of the name.

 Otherwise pulling tweets from the past from the streaming api seems
 not possible. I'm able to do that with search using since_id, but that
 seems a little weird.

 So if my streaming client dies since it is impossible for it to look
 back at those tweets, I'm going to have to build a full catch up
 procedure using the search API to fill in the gap between the two
 moments the streaming client was down. Is that really what is expected
 of third party developers?

 I'm just puzzled at the fact that the twitter team is asking everyone
 that pulls data to move to the streaming API but it seems that we are
 going to need a very big work around in order to get a reliable source
 of data and there is no simple way to get old data out of the system.