Hi John (et al.), These emails from you are great -- they are exactly the sort of thoughtful, detailed, specific, technical emails that I would personally love to see accompany future announcements. I think they would prevent a fair amount of FUD. Thank you.
I have one stupid question, if you don't mind, though. You refer in every email to "K". What, precisely, does K refer to? What are its units? (I think I know what it you mean by it, but I'd be interested to hear precisely.) Thanks, Josh On Sun, Apr 11, 2010 at 2:23 PM, John Kalucki <j...@twitter.com> wrote: > If you are writing a general purpose display app, I think, (but I am not at > all certain), that you can ignore this issue. Reasonable polling frequency > on modest velocity timelines will sometimes, but very rarely, miss a tweet. > Also, over time, we're doing things to make this better for everyone. Many > of our projects have the side-effect of reducing K, decreasing the already > low since_id failure odds even further. Some tweet pipeline changes when > live in the last few weeks that dramatically reduce the K distribution for > various user types. > > Since I don't know how the Last-Modified time exactly works, I'm going to > restate your response slightly: > > Assuming synchronized clocks (or solely the Twitter Clock, if exposed > properly via Last-Modified), given a poll at time t, the newest status is at > least t - n seconds old, and sufficient n, then even a naive since_id > algorithm will be effectively Exactly Once. Assuming that Twitter is running > normally. For a given poll, when the poll time and last update time delta > drops below this n second period, there's a non-zero loss risk. > > Just what is n? It is K expressed as time rather than as a discrete count. > For some timelines types, with some classes of users, K is as much as > perhaps 180 seconds. For others, K is less than 1 second. There's some > variability here that we should characterize more carefully internally and > then discuss publicly. I suspect there's a lot to be learned from this > exercise. > > Since_id really runs into trouble when any of the following are too great: > the polling frequency, the updating frequency, the roughly-sorted K value. > If you are polling often to reduce display latency, use the Streaming API. > If the timeline moves too fast to capture it all exactly, you should > reconsider your requirements or get a Commercial Data License for the > Streaming API. Does the user really need to see every Bieber at 3 Biebers > Per Second? How would they ever know if they missed 10^-5 of them in a blur? > If you need them all for analysis, consider calculating the confidence > interval given a sample proportion of 1 - 10^6 (6 9s) or so vs. a total > enumeration. Indistinguishable. If you need them for some other purpose, say > CRM, the Streaming API may be the answer. > > -John Kalucki > http://twitter.com/jkalucki > Infrastructure, Twitter Inc. > > > On Fri, Apr 9, 2010 at 2:28 PM, Brian Smith <br...@briansmith.org> wrote: >> >> John, >> >> >> >> I am not polling. I am simply trying to implement a basic “refresh” >> feature like every desktop/mobile Twitter app has. Basically, I just want to >> let users scroll through their timelines, and be reasonably sure that I am >> presenting them with an accurate & complete view of the timeline, while >> using as little bandwidth as possible. >> >> >> >> When I said “10 seconds old”/“30 seconds old”/etc. I was referring to I >> was referring to the age at the time the page of tweets was generated. So, >> basically, if the tweet’s timestamp – the response’s Last-Modified time more >> than 10,000 ms (from what you said below), you are almost definitely getting >> At Least Once behavior if Twitter is operating normally, and you can use >> that information to get At Least Once behavior that emulates Exactly Once >> behavior with little (usually no) overhead. Is that a correct interpretation >> of what you were saying? >> >> >> >> Thanks, >> >> Brian >> >> >> >> >> >> From: twitter-development-talk@googlegroups.com >> [mailto:twitter-development-t...@googlegroups.com] On Behalf Of John Kalucki >> Sent: Friday, April 09, 2010 3:31 PM >> To: twitter-development-talk@googlegroups.com >> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are >> sequenced >> >> >> >> Your second paragraph doesn't quite make sense. The period between your >> next poll and the timestamp of the last status is irrelevant. The issue is >> solely the magnitude of K on the roughly sorted stream of events that are >> applied to the materialized timeline vector. As K varies, so do the odds, >> however infinitesimally small, that you will miss a tweet using the last >> status id returned. The period between your polls of the API does not affect >> this K. >> >> My recommendation is to ignore this issue in nearly every use case. If you >> are, however, polling high velocity timelines (including search queries) and >> attempting to approximate an Exactly Once QoS, you should, basically, stop >> doing that. You are probably wasting resources and you'll probably never get >> Exactly Once behavior anyway. Use the Streaming API instead. >> >> -John Kalucki >> http://twitter.com/jkalucki >> Infrastructure, Twitter Inc. >> >> On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith <br...@briansmith.org> wrote: >> >> John, >> >> >> >> Thank you. That was one of the most informative emails on the Twitter API >> I have seen on the list. >> >> >> >> Basically, even now, an application should not use an ID of a tweet for >> since_id if the tweet is less than 10 seconds old, ignoring service >> abnormalities. Probably a larger threshold (30 seconds or even a minute) >> would be better, especially when you take into consideration the likelihood >> of clock skew between the servers that generate the timestamps. >> >> >> >> I think this is information that would be useful to have added to the API >> documentation, as I know many applications are taking a much more naive >> approach to pagination. >> >> >> >> Thanks again, >> >> Brian >> >> >> >> From: twitter-development-talk@googlegroups.com On Behalf Of John Kalucki >> Sent: Friday, April 09, 2010 1:20 PM >> >> To: twitter-development-talk@googlegroups.com >> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are >> sequenced >> >> >> >> Folks are making a lot of incorrect assumptions about the Twitter >> architecture, especially around how we materialize and present timeline >> vectors and just what QoS we're really offering. This new scheme does not >> significantly, or perhaps even observably, make the existing issues around >> since_id any better or any worse. And I'm being very precise here. The >> since_id situation is such that the few milliseconds skew possible in >> Snowflake are practically irrelevant and lost in the noise of a 4 to 6 >> orders-of-magnitude misconception. (That's a very big misconception.) >> >> If you do not know the rough ordering of our event stream as it applied to >> the materialized timeline vectors and also the expected rate of change on >> the timeline in question, you cannot make good choices about making since_id >> perfect. But, neither you should you try to make it perfect, nor should you >> have to worry about this. >> >> If you insist upon worrying about this, here's my slight salting of Mark's >> advice: In the existing continuously increasing id generation scheme on the >> Twitter.com API, I'd subtract about 5000 ids from since_id to ensure >> sufficient overlap in nearly all cases, but even this could be lossy in the >> face of severe operational issues -- issues of a type that we haven't seen >> in many many months. The search API has a different K in its rough ordering, >> so you might need more like 10,000 ids. In the new Snowflake scheme, I'd >> overlap by about 5000 milliseconds for twitter.com APIs and 10,000 ms for >> search APIs. >> >> Despite all this, things still could go wrong. An engineer here is known >> for pointing out that even things that almost never ever happen, happen all >> the time on the Twitter system. Now, just because they are happening, to >> someone, all the time, doesn't mean that they'll ever ever happen to you or >> your users in a thousand years -- but some's getting hit with it, somewhere, >> a few times a day. >> >> The above schemes no longer treat the id as an opaque unique ordered >> identifier. And woe lies in wait for you as changes are made to these ids. >> Woe. You also need to deduplicate. Be very careful and understand fully what >> you summon by breaking this semantic contract. >> >> In the end, since_id issues go away on the Streaming API, and other than >> around various start-up discontinuities, you can ignore this issue. I'll be >> talking about Rough Ordering, among other things Streaming, at the Chirp >> conference. Come geek out. >> >> -John Kalucki >> http://twitter.com/jkalucki >> Infrastructure, Twitter Inc. >> >> On Fri, Apr 9, 2010 at 1:58 AM, Dave Sherohman <d...@fishtwits.com> wrote: >> >> On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote: >> > However, I wanted to be clear and feel it should be made obvious that >> > with this change, there is a possibility that a tweet may not be >> > delivered to client if the implementation of how since_id is currently >> > used is not updated to cover the case. I still envision the situation >> > as more likely than you seem to believe and figure as tweet velocity >> > increases, the likelihood will also increase; But I am assuming have >> > better data to support your viewpoint than I and shall defer. >> >> Maybe I'm just missing something here, but it seems trivial to fix on >> Twitter's side (enough so that I assume it's what they've been planning >> from the start to do): Only return tweets from closed buckets. >> >> We are guaranteed that the buckets will be properly ordered. The order >> will only be randomized within a bucket. Therefore, by only returning >> tweets from buckets which are no longer receiving new tweets, since_id >> works and will never miss a tweet. >> >> And, yes, this does mean a slight delay in getting the tweets out >> because they have to wait a few milliseconds for their bucket to close >> before being exposed to calls which can use since_id, plus maybe a >> little longer for the contents of that bucket to be distributed to >> multiple servers. That's still going to only take time comparable to >> round-trip times for an HTTP request to fetch the data for display to a >> user and be far, far less than the average refresh delay required by >> those clients which fall under the API rate limit. I submit, therefore, >> that any such delay caused by waiting for buckets to close will be >> inconsequential. >> >> -- >> Dave Sherohman >> >> >> >> > -- To unsubscribe, reply using "remove me" as the subject.