A sequence can be on a continuum from unsorted to partially sorted to roughly sorted to totally sorted. Totally sorted is what we mean when we say "sorted". Partially sorted could mean anything, I suppose, but roughly sorted is a stricter definition than partially sorted. Informally it means that each item is no more than K items out of position. So, to totally sort the sequence, you need only consider K items.
This is useful stuff for dealing with infinite sequences of events -- like, picking a random example, the insertion of new tweets into a materialized timeline (a cache of the timeline vector). The events get slightly jumbled as they go through the Twitter system and this causes confusion for developers who don't understand how we apply the CAP theorem. It's Brewer's world, we just live in it. And we haven't done a good job at explaining our QoS as we've made the CAP trade-offs, or how we've evolved them, etc. etc. To make things one step more complicated, at Twitter, K is a function of a number of factors, including the timeline, the user tweeting, the phase of the moon, and the general state of the Twitter system. So, we have to think of the distribution of K over time as well. Crazy. We should just move this all into a single instance of Oracle and go home. http://twitter.com/jkalucki/statuses/10503736367 A sequence α is k-sorted IFF ∀ i, r, 1 ≤ i ≤ r ≤ n, i ≤ r - k implies aᵢ ≤ aᵣ. -John Kalucki http://twitter.com/jkalucki Infrastructure, Twitter Inc. On Sun, Apr 11, 2010 at 4:35 PM, Josh Bleecher Snyder <joshar...@gmail.com>wrote: > Hi John (et al.), > > These emails from you are great -- they are exactly the sort of > thoughtful, detailed, specific, technical emails that I would > personally love to see accompany future announcements. I think they > would prevent a fair amount of FUD. Thank you. > > I have one stupid question, if you don't mind, though. You refer in > every email to "K". What, precisely, does K refer to? What are its > units? (I think I know what it you mean by it, but I'd be interested > to hear precisely.) > > Thanks, > Josh > > > > On Sun, Apr 11, 2010 at 2:23 PM, John Kalucki <j...@twitter.com> wrote: > > If you are writing a general purpose display app, I think, (but I am not > at > > all certain), that you can ignore this issue. Reasonable polling > frequency > > on modest velocity timelines will sometimes, but very rarely, miss a > tweet. > > Also, over time, we're doing things to make this better for everyone. > Many > > of our projects have the side-effect of reducing K, decreasing the > already > > low since_id failure odds even further. Some tweet pipeline changes when > > live in the last few weeks that dramatically reduce the K distribution > for > > various user types. > > > > Since I don't know how the Last-Modified time exactly works, I'm going to > > restate your response slightly: > > > > Assuming synchronized clocks (or solely the Twitter Clock, if exposed > > properly via Last-Modified), given a poll at time t, the newest status is > at > > least t - n seconds old, and sufficient n, then even a naive since_id > > algorithm will be effectively Exactly Once. Assuming that Twitter is > running > > normally. For a given poll, when the poll time and last update time delta > > drops below this n second period, there's a non-zero loss risk. > > > > Just what is n? It is K expressed as time rather than as a discrete > count. > > For some timelines types, with some classes of users, K is as much as > > perhaps 180 seconds. For others, K is less than 1 second. There's some > > variability here that we should characterize more carefully internally > and > > then discuss publicly. I suspect there's a lot to be learned from this > > exercise. > > > > Since_id really runs into trouble when any of the following are too > great: > > the polling frequency, the updating frequency, the roughly-sorted K > value. > > If you are polling often to reduce display latency, use the Streaming > API. > > If the timeline moves too fast to capture it all exactly, you should > > reconsider your requirements or get a Commercial Data License for the > > Streaming API. Does the user really need to see every Bieber at 3 Biebers > > Per Second? How would they ever know if they missed 10^-5 of them in a > blur? > > If you need them all for analysis, consider calculating the confidence > > interval given a sample proportion of 1 - 10^6 (6 9s) or so vs. a total > > enumeration. Indistinguishable. If you need them for some other purpose, > say > > CRM, the Streaming API may be the answer. > > > > -John Kalucki > > http://twitter.com/jkalucki > > Infrastructure, Twitter Inc. > > > > > > On Fri, Apr 9, 2010 at 2:28 PM, Brian Smith <br...@briansmith.org> > wrote: > >> > >> John, > >> > >> > >> > >> I am not polling. I am simply trying to implement a basic “refresh” > >> feature like every desktop/mobile Twitter app has. Basically, I just > want to > >> let users scroll through their timelines, and be reasonably sure that I > am > >> presenting them with an accurate & complete view of the timeline, while > >> using as little bandwidth as possible. > >> > >> > >> > >> When I said “10 seconds old”/“30 seconds old”/etc. I was referring to I > >> was referring to the age at the time the page of tweets was generated. > So, > >> basically, if the tweet’s timestamp – the response’s Last-Modified time > more > >> than 10,000 ms (from what you said below), you are almost definitely > getting > >> At Least Once behavior if Twitter is operating normally, and you can use > >> that information to get At Least Once behavior that emulates Exactly > Once > >> behavior with little (usually no) overhead. Is that a correct > interpretation > >> of what you were saying? > >> > >> > >> > >> Thanks, > >> > >> Brian > >> > >> > >> > >> > >> > >> From: email@example.com > >> [mailto:twitter-development-t...@googlegroups.com] On Behalf Of John > Kalucki > >> Sent: Friday, April 09, 2010 3:31 PM > >> To: firstname.lastname@example.org > >> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs > are > >> sequenced > >> > >> > >> > >> Your second paragraph doesn't quite make sense. The period between your > >> next poll and the timestamp of the last status is irrelevant. The issue > is > >> solely the magnitude of K on the roughly sorted stream of events that > are > >> applied to the materialized timeline vector. As K varies, so do the > odds, > >> however infinitesimally small, that you will miss a tweet using the last > >> status id returned. The period between your polls of the API does not > affect > >> this K. > >> > >> My recommendation is to ignore this issue in nearly every use case. If > you > >> are, however, polling high velocity timelines (including search queries) > and > >> attempting to approximate an Exactly Once QoS, you should, basically, > stop > >> doing that. You are probably wasting resources and you'll probably never > get > >> Exactly Once behavior anyway. Use the Streaming API instead. > >> > >> -John Kalucki > >> http://twitter.com/jkalucki > >> Infrastructure, Twitter Inc. > >> > >> On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith <br...@briansmith.org> > wrote: > >> > >> John, > >> > >> > >> > >> Thank you. That was one of the most informative emails on the Twitter > API > >> I have seen on the list. > >> > >> > >> > >> Basically, even now, an application should not use an ID of a tweet for > >> since_id if the tweet is less than 10 seconds old, ignoring service > >> abnormalities. Probably a larger threshold (30 seconds or even a minute) > >> would be better, especially when you take into consideration the > likelihood > >> of clock skew between the servers that generate the timestamps. > >> > >> > >> > >> I think this is information that would be useful to have added to the > API > >> documentation, as I know many applications are taking a much more naive > >> approach to pagination. > >> > >> > >> > >> Thanks again, > >> > >> Brian > >> > >> > >> > >> From: email@example.com On Behalf Of John > Kalucki > >> Sent: Friday, April 09, 2010 1:20 PM > >> > >> To: firstname.lastname@example.org > >> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs > are > >> sequenced > >> > >> > >> > >> Folks are making a lot of incorrect assumptions about the Twitter > >> architecture, especially around how we materialize and present timeline > >> vectors and just what QoS we're really offering. This new scheme does > not > >> significantly, or perhaps even observably, make the existing issues > around > >> since_id any better or any worse. And I'm being very precise here. The > >> since_id situation is such that the few milliseconds skew possible in > >> Snowflake are practically irrelevant and lost in the noise of a 4 to 6 > >> orders-of-magnitude misconception. (That's a very big misconception.) > >> > >> If you do not know the rough ordering of our event stream as it applied > to > >> the materialized timeline vectors and also the expected rate of change > on > >> the timeline in question, you cannot make good choices about making > since_id > >> perfect. But, neither you should you try to make it perfect, nor should > you > >> have to worry about this. > >> > >> If you insist upon worrying about this, here's my slight salting of > Mark's > >> advice: In the existing continuously increasing id generation scheme on > the > >> Twitter.com API, I'd subtract about 5000 ids from since_id to ensure > >> sufficient overlap in nearly all cases, but even this could be lossy in > the > >> face of severe operational issues -- issues of a type that we haven't > seen > >> in many many months. The search API has a different K in its rough > ordering, > >> so you might need more like 10,000 ids. In the new Snowflake scheme, I'd > >> overlap by about 5000 milliseconds for twitter.com APIs and 10,000 ms > for > >> search APIs. > >> > >> Despite all this, things still could go wrong. An engineer here is known > >> for pointing out that even things that almost never ever happen, happen > all > >> the time on the Twitter system. Now, just because they are happening, to > >> someone, all the time, doesn't mean that they'll ever ever happen to you > or > >> your users in a thousand years -- but some's getting hit with it, > somewhere, > >> a few times a day. > >> > >> The above schemes no longer treat the id as an opaque unique ordered > >> identifier. And woe lies in wait for you as changes are made to these > ids. > >> Woe. You also need to deduplicate. Be very careful and understand fully > what > >> you summon by breaking this semantic contract. > >> > >> In the end, since_id issues go away on the Streaming API, and other than > >> around various start-up discontinuities, you can ignore this issue. I'll > be > >> talking about Rough Ordering, among other things Streaming, at the Chirp > >> conference. Come geek out. > >> > >> -John Kalucki > >> http://twitter.com/jkalucki > >> Infrastructure, Twitter Inc. > >> > >> On Fri, Apr 9, 2010 at 1:58 AM, Dave Sherohman <d...@fishtwits.com> > wrote: > >> > >> On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote: > >> > However, I wanted to be clear and feel it should be made obvious that > >> > with this change, there is a possibility that a tweet may not be > >> > delivered to client if the implementation of how since_id is currently > >> > used is not updated to cover the case. I still envision the situation > >> > as more likely than you seem to believe and figure as tweet velocity > >> > increases, the likelihood will also increase; But I am assuming have > >> > better data to support your viewpoint than I and shall defer. > >> > >> Maybe I'm just missing something here, but it seems trivial to fix on > >> Twitter's side (enough so that I assume it's what they've been planning > >> from the start to do): Only return tweets from closed buckets. > >> > >> We are guaranteed that the buckets will be properly ordered. The order > >> will only be randomized within a bucket. Therefore, by only returning > >> tweets from buckets which are no longer receiving new tweets, since_id > >> works and will never miss a tweet. > >> > >> And, yes, this does mean a slight delay in getting the tweets out > >> because they have to wait a few milliseconds for their bucket to close > >> before being exposed to calls which can use since_id, plus maybe a > >> little longer for the contents of that bucket to be distributed to > >> multiple servers. That's still going to only take time comparable to > >> round-trip times for an HTTP request to fetch the data for display to a > >> user and be far, far less than the average refresh delay required by > >> those clients which fall under the API rate limit. I submit, therefore, > >> that any such delay caused by waiting for buckets to close will be > >> inconsequential. > >> > >> -- > >> Dave Sherohman > >> > >> > >> > >> > > > > > -- > To unsubscribe, reply using "remove me" as the subject. >