Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

Chad Etzel Sun, 11 Apr 2010 20:05:53 -0700


I'd like to see more epsilon-delta proofs on this list personally :)


Chad



On Apr 11, 2010, at 17:14, John Kalucki <[email protected]> wrote:

A sequence can be on a continuum from unsorted to partially sortedto roughly sorted to totally sorted. Totally sorted is what we meanwhen we say "sorted". Partially sorted could mean anything, Isuppose, but roughly sorted is a stricter definition than partiallysorted. Informally it means that each item is no more than K itemsout of position. So, to totally sort the sequence, you need onlyconsider K items.
This is useful stuff for dealing with infinite sequences of events-- like, picking a random example, the insertion of new tweets intoa materialized timeline (a cache of the timeline vector). The eventsget slightly jumbled as they go through the Twitter system and thiscauses confusion for developers who don't understand how we applythe CAP theorem. It's Brewer's world, we just live in it. And wehaven't done a good job at explaining our QoS as we've made the CAPtrade-offs, or how we've evolved them, etc. etc.
To make things one step more complicated, at Twitter, K is afunction of a number of factors, including the timeline, the usertweeting, the phase of the moon, and the general state of theTwitter system. So, we have to think of the distribution of K overtime as well.
Crazy. We should just move this all into a single instance of Oracleand go home.
http://twitter.com/jkalucki/statuses/10503736367
A sequence α is k-sorted IFF ∀ i, r, 1 ≤ i ≤ r ≤ n, i ≤ r- k implies aᵢ ≤ aᵣ.
-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.
On Sun, Apr 11, 2010 at 4:35 PM, Josh Bleecher Snyder <[email protected]> wrote:
Hi John (et al.),

These emails from you are great -- they are exactly the sort of
thoughtful, detailed, specific, technical emails that I would
personally love to see accompany future announcements. I think they
would prevent a fair amount of FUD. Thank you.

I have one stupid question, if you don't mind, though. You refer in
every email to "K". What, precisely, does K refer to? What are its
units? (I think I know what it you mean by it, but I'd be interested
to hear precisely.)

Thanks,
Josh
On Sun, Apr 11, 2010 at 2:23 PM, John Kalucki <[email protected]>wrote:> If you are writing a general purpose display app, I think, (but Iam not at> all certain), that you can ignore this issue. Reasonable pollingfrequency> on modest velocity timelines will sometimes, but very rarely, missa tweet.> Also, over time, we're doing things to make this better foreveryone. Many> of our projects have the side-effect of reducing K, decreasing thealready> low since_id failure odds even further. Some tweet pipelinechanges when> live in the last few weeks that dramatically reduce the Kdistribution for
> various user types.
>
> Since I don't know how the Last-Modified time exactly works, I'mgoing to
> restate your response slightly:
>
> Assuming synchronized clocks (or solely the Twitter Clock, ifexposed> properly via Last-Modified), given a poll at time t, the neweststatus is at> least t - n seconds old, and sufficient n, then even a naivesince_id> algorithm will be effectively Exactly Once. Assuming that Twitteris running> normally. For a given poll, when the poll time and last updatetime delta
> drops below this n second period, there's a non-zero loss risk.
>
> Just what is n? It is K expressed as time rather than as adiscrete count.> For some timelines types, with some classes of users, K is as muchas> perhaps 180 seconds. For others, K is less than 1 second. There'ssome> variability here that we should characterize more carefullyinternally and> then discuss publicly. I suspect there's a lot to be learned fromthis
> exercise.
>
> Since_id really runs into trouble when any of the following aretoo great:> the polling frequency, the updating frequency, the roughly-sortedK value.> If you are polling often to reduce display latency, use theStreaming API.
> If the timeline moves too fast to capture it all exactly, you should
> reconsider your requirements or get a Commercial Data License forthe> Streaming API. Does the user really need to see every Bieber at 3Biebers> Per Second? How would they ever know if they missed 10^-5 of themin a blur?> If you need them all for analysis, consider calculating theconfidence> interval given a sample proportion of 1 - 10^6 (6 9s) or so vs. atotal> enumeration. Indistinguishable. If you need them for some otherpurpose, say
> CRM, the Streaming API may be the answer.
>
> -John Kalucki
> http://twitter.com/jkalucki
> Infrastructure, Twitter Inc.
>
>
> On Fri, Apr 9, 2010 at 2:28 PM, Brian Smith <[email protected]>wrote:
>>
>> John,
>>
>>
>>
>> I am not polling. I am simply trying to implement a basic “refresh”>> feature like every desktop/mobile Twitter app has. Basically, Ijust want to>> let users scroll through their timelines, and be reasonably surethat I am>> presenting them with an accurate & complete view of the timeline,while
>> using as little bandwidth as possible.
>>
>>
>>
>> When I said “10 seconds old”/“30 seconds old”/etc. I wasreferring to I>> was referring to the age at the time the page of tweets wasgenerated. So,>> basically, if the tweet’s timestamp – the response’s Last-Modified time more>> than 10,000 ms (from what you said below), you are almostdefinitely getting>> At Least Once behavior if Twitter is operating normally, and youcan use>> that information to get At Least Once behavior that emulatesExactly Once>> behavior with little (usually no) overhead. Is that a correctinterpretation
>> of what you were saying?
>>
>>
>>
>> Thanks,
>>
>> Brian
>>
>>
>>
>>
>>
>> From: [email protected]
>> [mailto:[email protected]] On Behalf OfJohn Kalucki
>> Sent: Friday, April 09, 2010 3:31 PM
>> To: [email protected]
>> Subject: Re: [twitter-dev] Re: Upcoming changes to the way statusIDs are
>> sequenced
>>
>>
>>
>> Your second paragraph doesn't quite make sense. The periodbetween your>> next poll and the timestamp of the last status is irrelevant. Theissue is>> solely the magnitude of K on the roughly sorted stream of eventsthat are>> applied to the materialized timeline vector. As K varies, so dothe odds,>> however infinitesimally small, that you will miss a tweet usingthe last>> status id returned. The period between your polls of the API doesnot affect
>> this K.
>>
>> My recommendation is to ignore this issue in nearly every usecase. If you>> are, however, polling high velocity timelines (including searchqueries) and>> attempting to approximate an Exactly Once QoS, you should,basically, stop>> doing that. You are probably wasting resources and you'llprobably never get
>> Exactly Once behavior anyway. Use the Streaming API instead.
>>
>> -John Kalucki
>> http://twitter.com/jkalucki
>> Infrastructure, Twitter Inc.
>>
>> On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith<[email protected]> wrote:
>>
>> John,
>>
>>
>>
>> Thank you. That was one of the most informative emails on theTwitter API
>> I have seen on the list.
>>
>>
>>
>> Basically, even now, an application should not use an ID of atweet for
>> since_id if the tweet is less than 10 seconds old, ignoring service
>> abnormalities. Probably a larger threshold (30 seconds or even aminute)>> would be better, especially when you take into consideration thelikelihood
>> of clock skew between the servers that generate the timestamps.
>>
>>
>>
>> I think this is information that would be useful to have added tothe API>> documentation, as I know many applications are taking a much morenaive
>> approach to pagination.
>>
>>
>>
>> Thanks again,
>>
>> Brian
>>
>>
>>
>> From: [email protected] On Behalf Of JohnKalucki
>> Sent: Friday, April 09, 2010 1:20 PM
>>
>> To: [email protected]
>> Subject: Re: [twitter-dev] Re: Upcoming changes to the way statusIDs are
>> sequenced
>>
>>
>>
>> Folks are making a lot of incorrect assumptions about the Twitter
>> architecture, especially around how we materialize and presenttimeline>> vectors and just what QoS we're really offering. This new schemedoes not>> significantly, or perhaps even observably, make the existingissues around>> since_id any better or any worse. And I'm being very precisehere. The>> since_id situation is such that the few milliseconds skewpossible in>> Snowflake are practically irrelevant and lost in the noise of a 4to 6>> orders-of-magnitude misconception. (That's a very bigmisconception.)
>>
>> If you do not know the rough ordering of our event stream as itapplied to>> the materialized timeline vectors and also the expected rate ofchange on>> the timeline in question, you cannot make good choices aboutmaking since_id>> perfect. But, neither you should you try to make it perfect, norshould you
>> have to worry about this.
>>
>> If you insist upon worrying about this, here's my slight saltingof Mark's>> advice: In the existing continuously increasing id generationscheme on the>> Twitter.com API, I'd subtract about 5000 ids from since_id toensure>> sufficient overlap in nearly all cases, but even this could belossy in the>> face of severe operational issues -- issues of a type that wehaven't seen>> in many many months. The search API has a different K in itsrough ordering,>> so you might need more like 10,000 ids. In the new Snowflakescheme, I'd>> overlap by about 5000 milliseconds for twitter.com APIs and10,000 ms for
>> search APIs.
>>
>> Despite all this, things still could go wrong. An engineer hereis known>> for pointing out that even things that almost never ever happen,happen all>> the time on the Twitter system. Now, just because they arehappening, to>> someone, all the time, doesn't mean that they'll ever ever happento you or>> your users in a thousand years -- but some's getting hit with it,somewhere,
>> a few times a day.
>>
>> The above schemes no longer treat the id as an opaque uniqueordered>> identifier. And woe lies in wait for you as changes are made tothese ids.>> Woe. You also need to deduplicate. Be very careful and understandfully what
>> you summon by breaking this semantic contract.
>>
>> In the end, since_id issues go away on the Streaming API, andother than>> around various start-up discontinuities, you can ignore thisissue. I'll be>> talking about Rough Ordering, among other things Streaming, atthe Chirp
>> conference. Come geek out.
>>
>> -John Kalucki
>> http://twitter.com/jkalucki
>> Infrastructure, Twitter Inc.
>>
>> On Fri, Apr 9, 2010 at 1:58 AM, Dave Sherohman<[email protected]> wrote:
>>
>> On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote:
>> > However, I wanted to be clear and feel it should be madeobvious that
>> > with this change, there is a possibility that a tweet may not be
>> > delivered to client if the implementation of how since_id iscurrently>> > used is not updated to cover the case. I still envision thesituation>> > as more likely than you seem to believe and figure as tweetvelocity>> > increases, the likelihood will also increase; But I am assuminghave
>> > better data to support your viewpoint than I and shall defer.
>>
>> Maybe I'm just missing something here, but it seems trivial tofix on>> Twitter's side (enough so that I assume it's what they've beenplanning
>> from the start to do):  Only return tweets from closed buckets.
>>
>> We are guaranteed that the buckets will be properly ordered. Theorder>> will only be randomized within a bucket. Therefore, by onlyreturning>> tweets from buckets which are no longer receiving new tweets,since_id
>> works and will never miss a tweet.
>>
>> And, yes, this does mean a slight delay in getting the tweets out
>> because they have to wait a few milliseconds for their bucket toclose
>> before being exposed to calls which can use since_id, plus maybe a
>> little longer for the contents of that bucket to be distributed to
>> multiple servers. That's still going to only take timecomparable to>> round-trip times for an HTTP request to fetch the data fordisplay to a>> user and be far, far less than the average refresh delay requiredby>> those clients which fall under the API rate limit. I submit,therefore,
>> that any such delay caused by waiting for buckets to close will be
>> inconsequential.
>>
>> --
>> Dave Sherohman
>>
>>
>>
>>
>


--
To unsubscribe, reply using "remove me" as the subject.

Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

Reply via email to