[twitter-dev] Streaming API and older tweets?
Hello, I'm building an archival tool for a specific hashtag after looking at the API and testing the streaming API and I got it working to pull any new tweets. However it seems the count parameter is restricited. From http://apiwiki.twitter.com/Streaming-API-Documentation#QueryParameters Firehose, Retweet, Link, Birddog and Shadow clients interested in capturing all statuses should maintain a current estimate of the number of statuses received per second and note the time that the last status was received. Upon a reconnect, the client can then estimate the appropriate backlog to request. Note that the count parameter is not allowed elsewhere, including track, sample and on the default access role. I'm a little confused by that statement. Does that means I can't use count on any track requests or just track with the default access role? In that case how I can get Shadow access, google isn't cooperating much due to the simplicity of the name. Otherwise pulling tweets from the past from the streaming api seems not possible. I'm able to do that with search using since_id, but that seems a little weird. So if my streaming client dies since it is impossible for it to look back at those tweets, I'm going to have to build a full catch up procedure using the search API to fill in the gap between the two moments the streaming client was down. Is that really what is expected of third party developers? I'm just puzzled at the fact that the twitter team is asking everyone that pulls data to move to the streaming API but it seems that we are going to need a very big work around in order to get a reliable source of data and there is no simple way to get old data out of the system.
Re: [twitter-dev] Streaming API and older tweets?
You can use the Search API to do historical queries when you first start following a hash tag, then switch to Streaming. Search results may be incomplete but sufficient in this case. Currently the count parameter is not supported in conjunction with track on any role, due to both cost and scraping potential. Track users must be willing to tolerate some minor data loss during reconnection. Note that all users of Search have always tolerated occasional minor data loss -- Search has always been slightly lossy. A well-coded Streaming client can usually reconnect within a second. You can try to paper over data loss during this period with the Search API, but it may not be worth the effort. If a keyword has such velocity that there's much data loss during that second of reconnection, search results are likely to be heavily filtered anyway. In general, this Streaming approach is likely to be closer to a total covering of the keyword than querying Search. In the end, if a keyword is prevalent enough to be noticed as a gap during a reconnect, it's a pretty high volume keyword. Is data loss in this case an actual practical issue for your app when you are receiving tens or hundreds of tweets per minute? If there is sufficient demand, we'll investigate historical queries for track at higher access levels, but currently this is a low priority issue that will soon have a workable alternative. The workaround would be to take the Firehose, once we announce terms, and use the count parameter there. It's possible to consume the Firehose without common-case data loss. -John Kalucki http://twitter.com/jkalucki Infrastructure, Twitter Inc. On Fri, Jan 22, 2010 at 5:41 AM, Jorge Vargas jorge.var...@gmail.com wrote: Hello, I'm building an archival tool for a specific hashtag after looking at the API and testing the streaming API and I got it working to pull any new tweets. However it seems the count parameter is restricited. From http://apiwiki.twitter.com/Streaming-API-Documentation#QueryParameters Firehose, Retweet, Link, Birddog and Shadow clients interested in capturing all statuses should maintain a current estimate of the number of statuses received per second and note the time that the last status was received. Upon a reconnect, the client can then estimate the appropriate backlog to request. Note that the count parameter is not allowed elsewhere, including track, sample and on the default access role. I'm a little confused by that statement. Does that means I can't use count on any track requests or just track with the default access role? In that case how I can get Shadow access, google isn't cooperating much due to the simplicity of the name. Otherwise pulling tweets from the past from the streaming api seems not possible. I'm able to do that with search using since_id, but that seems a little weird. So if my streaming client dies since it is impossible for it to look back at those tweets, I'm going to have to build a full catch up procedure using the search API to fill in the gap between the two moments the streaming client was down. Is that really what is expected of third party developers? I'm just puzzled at the fact that the twitter team is asking everyone that pulls data to move to the streaming API but it seems that we are going to need a very big work around in order to get a reliable source of data and there is no simple way to get old data out of the system.
Re: [twitter-dev] Streaming API and older tweets?
Thank you for your very complete answer John. That is exactly what I was thinking. From what I can tell my tag isn't very high traffic so I'm going to go check the search API to pull the older results and leave my current Streaming client running to pull anything new. Thank you! On Fri, Jan 22, 2010 at 2:06 PM, John Kalucki j...@twitter.com wrote: You can use the Search API to do historical queries when you first start following a hash tag, then switch to Streaming. Search results may be incomplete but sufficient in this case. Currently the count parameter is not supported in conjunction with track on any role, due to both cost and scraping potential. Track users must be willing to tolerate some minor data loss during reconnection. Note that all users of Search have always tolerated occasional minor data loss -- Search has always been slightly lossy. A well-coded Streaming client can usually reconnect within a second. You can try to paper over data loss during this period with the Search API, but it may not be worth the effort. If a keyword has such velocity that there's much data loss during that second of reconnection, search results are likely to be heavily filtered anyway. In general, this Streaming approach is likely to be closer to a total covering of the keyword than querying Search. In the end, if a keyword is prevalent enough to be noticed as a gap during a reconnect, it's a pretty high volume keyword. Is data loss in this case an actual practical issue for your app when you are receiving tens or hundreds of tweets per minute? If there is sufficient demand, we'll investigate historical queries for track at higher access levels, but currently this is a low priority issue that will soon have a workable alternative. The workaround would be to take the Firehose, once we announce terms, and use the count parameter there. It's possible to consume the Firehose without common-case data loss. -John Kalucki http://twitter.com/jkalucki Infrastructure, Twitter Inc. On Fri, Jan 22, 2010 at 5:41 AM, Jorge Vargas jorge.var...@gmail.com wrote: Hello, I'm building an archival tool for a specific hashtag after looking at the API and testing the streaming API and I got it working to pull any new tweets. However it seems the count parameter is restricited. From http://apiwiki.twitter.com/Streaming-API-Documentation#QueryParameters Firehose, Retweet, Link, Birddog and Shadow clients interested in capturing all statuses should maintain a current estimate of the number of statuses received per second and note the time that the last status was received. Upon a reconnect, the client can then estimate the appropriate backlog to request. Note that the count parameter is not allowed elsewhere, including track, sample and on the default access role. I'm a little confused by that statement. Does that means I can't use count on any track requests or just track with the default access role? In that case how I can get Shadow access, google isn't cooperating much due to the simplicity of the name. Otherwise pulling tweets from the past from the streaming api seems not possible. I'm able to do that with search using since_id, but that seems a little weird. So if my streaming client dies since it is impossible for it to look back at those tweets, I'm going to have to build a full catch up procedure using the search API to fill in the gap between the two moments the streaming client was down. Is that really what is expected of third party developers? I'm just puzzled at the fact that the twitter team is asking everyone that pulls data to move to the streaming API but it seems that we are going to need a very big work around in order to get a reliable source of data and there is no simple way to get old data out of the system.