If you are writing a general purpose display app, I think, (but I am not at all certain), that you can ignore this issue. Reasonable polling frequency on modest velocity timelines will sometimes, but very rarely, miss a tweet. Also, over time, we're doing things to make this better for everyone. Many of our projects have the side-effect of reducing K, decreasing the already low since_id failure odds even further. Some tweet pipeline changes when live in the last few weeks that dramatically reduce the K distribution for various user types.
Since I don't know how the Last-Modified time exactly works, I'm going to restate your response slightly: Assuming synchronized clocks (or solely the Twitter Clock, if exposed properly via Last-Modified), given a poll at time t, the newest status is at least t - n seconds old, and sufficient n, then even a naive since_id algorithm will be effectively Exactly Once. Assuming that Twitter is running normally. For a given poll, when the poll time and last update time delta drops below this n second period, there's a non-zero loss risk. Just what is n? It is K expressed as time rather than as a discrete count. For some timelines types, with some classes of users, K is as much as perhaps 180 seconds. For others, K is less than 1 second. There's some variability here that we should characterize more carefully internally and then discuss publicly. I suspect there's a lot to be learned from this exercise. Since_id really runs into trouble when any of the following are too great: the polling frequency, the updating frequency, the roughly-sorted K value. If you are polling often to reduce display latency, use the Streaming API. If the timeline moves too fast to capture it all exactly, you should reconsider your requirements or get a Commercial Data License for the Streaming API. Does the user really need to see every Bieber at 3 Biebers Per Second? How would they ever know if they missed 10^-5 of them in a blur? If you need them all for analysis, consider calculating the confidence interval given a sample proportion of 1 - 10^6 (6 9s) or so vs. a total enumeration. Indistinguishable. If you need them for some other purpose, say CRM, the Streaming API may be the answer. -John Kalucki http://twitter.com/jkalucki Infrastructure, Twitter Inc. On Fri, Apr 9, 2010 at 2:28 PM, Brian Smith <[email protected]> wrote: > John, > > > > I am not polling. I am simply trying to implement a basic “refresh” feature > like every desktop/mobile Twitter app has. Basically, I just want to let > users scroll through their timelines, and be reasonably sure that I am > presenting them with an accurate & complete view of the timeline, while > using as little bandwidth as possible. > > > > When I said “10 seconds old”/“30 seconds old”/etc. I was referring to I was > referring to the age at the time the page of tweets was generated. So, > basically, if the tweet’s timestamp – the response’s Last-Modified time more > than 10,000 ms (from what you said below), you are almost definitely getting > At Least Once behavior if Twitter is operating normally, and you can use > that information to get At Least Once behavior that emulates Exactly Once > behavior with little (usually no) overhead. Is that a correct interpretation > of what you were saying? > > > > Thanks, > > Brian > > > > > > *From:* [email protected] [mailto: > [email protected]] *On Behalf Of *John Kalucki > *Sent:* Friday, April 09, 2010 3:31 PM > > *To:* [email protected] > *Subject:* Re: [twitter-dev] Re: Upcoming changes to the way status IDs > are sequenced > > > > Your second paragraph doesn't quite make sense. The period between your > next poll and the timestamp of the last status is irrelevant. The issue is > solely the magnitude of K on the roughly sorted stream of events that are > applied to the materialized timeline vector. As K varies, so do the odds, > however infinitesimally small, that you will miss a tweet using the last > status id returned. The period between your polls of the API does not affect > this K. > > My recommendation is to ignore this issue in nearly every use case. If you > are, however, polling high velocity timelines (including search queries) and > attempting to approximate an Exactly Once QoS, you should, basically, stop > doing that. You are probably wasting resources and you'll probably never get > Exactly Once behavior anyway. Use the Streaming API instead. > > -John Kalucki > http://twitter.com/jkalucki > Infrastructure, Twitter Inc. > > On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith <[email protected]> wrote: > > John, > > > > Thank you. That was one of the most informative emails on the Twitter API I > have seen on the list. > > > > Basically, even now, an application should not use an ID of a tweet for > since_id if the tweet is less than 10 seconds old, ignoring service > abnormalities. Probably a larger threshold (30 seconds or even a minute) > would be better, especially when you take into consideration the likelihood > of clock skew between the servers that generate the timestamps. > > > > I think this is information that would be useful to have added to the API > documentation, as I know many applications are taking a much more naive > approach to pagination. > > > > Thanks again, > > Brian > > > > *From:* [email protected] *On Behalf Of *John > Kalucki > *Sent:* Friday, April 09, 2010 1:20 PM > > > *To:* [email protected] > *Subject:* Re: [twitter-dev] Re: Upcoming changes to the way status IDs > are sequenced > > > > Folks are making a lot of incorrect assumptions about the Twitter > architecture, especially around how we materialize and present timeline > vectors and just what QoS we're really offering. This new scheme does not > significantly, or perhaps even observably, make the existing issues around > since_id any better or any worse. And I'm being very precise here. The > since_id situation is such that the few milliseconds skew possible in > Snowflake are practically irrelevant and lost in the noise of a 4 to 6 > orders-of-magnitude misconception. (That's a very big misconception.) > > > > If you do not know the rough ordering of our event stream as it applied to > the materialized timeline vectors and also the expected rate of change on > the timeline in question, you cannot make good choices about making since_id > perfect. But, neither you should you try to make it perfect, nor should you > have to worry about this. > > If you insist upon worrying about this, here's my slight salting of Mark's > advice: In the existing continuously increasing id generation scheme on the > Twitter.com API, I'd subtract about 5000 ids from since_id to ensure > sufficient overlap in nearly all cases, but even this could be lossy in the > face of severe operational issues -- issues of a type that we haven't seen > in many many months. The search API has a different K in its rough ordering, > so you might need more like 10,000 ids. In the new Snowflake scheme, I'd > overlap by about 5000 milliseconds for twitter.com APIs and 10,000 ms for > search APIs. > > Despite all this, things still could go wrong. An engineer here is known > for pointing out that even things that almost never ever happen, happen all > the time on the Twitter system. Now, just because they are happening, to > someone, all the time, doesn't mean that they'll ever ever happen to you or > your users in a thousand years -- but some's getting hit with it, somewhere, > a few times a day. > > The above schemes no longer treat the id as an opaque unique ordered > identifier. And woe lies in wait for you as changes are made to these ids. > Woe. You also need to deduplicate. Be very careful and understand fully what > you summon by breaking this semantic contract. > > In the end, since_id issues go away on the Streaming API, and other than > around various start-up discontinuities, you can ignore this issue. I'll be > talking about Rough Ordering, among other things Streaming, at the Chirp > conference. Come geek out. > > -John Kalucki > http://twitter.com/jkalucki > Infrastructure, Twitter Inc. > > On Fri, Apr 9, 2010 at 1:58 AM, Dave Sherohman <[email protected]> wrote: > > On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote: > > However, I wanted to be clear and feel it should be made obvious that > > with this change, there is a possibility that a tweet may not be > > delivered to client if the implementation of how since_id is currently > > used is not updated to cover the case. I still envision the situation > > as more likely than you seem to believe and figure as tweet velocity > > increases, the likelihood will also increase; But I am assuming have > > better data to support your viewpoint than I and shall defer. > > Maybe I'm just missing something here, but it seems trivial to fix on > Twitter's side (enough so that I assume it's what they've been planning > from the start to do): Only return tweets from closed buckets. > > We are guaranteed that the buckets will be properly ordered. The order > will only be randomized within a bucket. Therefore, by only returning > tweets from buckets which are no longer receiving new tweets, since_id > works and will never miss a tweet. > > And, yes, this does mean a slight delay in getting the tweets out > because they have to wait a few milliseconds for their bucket to close > before being exposed to calls which can use since_id, plus maybe a > little longer for the contents of that bucket to be distributed to > multiple servers. That's still going to only take time comparable to > round-trip times for an HTTP request to fetch the data for display to a > user and be far, far less than the average refresh delay required by > those clients which fall under the API rate limit. I submit, therefore, > that any such delay caused by waiting for buckets to close will be > inconsequential. > > -- > Dave Sherohman > > > > > -- To unsubscribe, reply using "remove me" as the subject.
