Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

Josh Bleecher Snyder Sun, 11 Apr 2010 16:36:20 -0700

Hi John (et al.),

These emails from you are great -- they are exactly the sort of
thoughtful, detailed, specific, technical emails that I would
personally love to see accompany future announcements. I think they
would prevent a fair amount of FUD. Thank you.


I have one stupid question, if you don't mind, though. You refer in
every email to "K". What, precisely, does K refer to? What are its
units? (I think I know what it you mean by it, but I'd be interested
to hear precisely.)

Thanks,
Josh



On Sun, Apr 11, 2010 at 2:23 PM, John Kalucki <[email protected]> wrote:
> If you are writing a general purpose display app, I think, (but I am not at
> all certain), that you can ignore this issue. Reasonable polling frequency
> on modest velocity timelines will sometimes, but very rarely, miss a tweet.
> Also, over time, we're doing things to make this better for everyone. Many
> of our projects have the side-effect of reducing K, decreasing the already
> low since_id failure odds even further. Some tweet pipeline changes when
> live in the last few weeks that dramatically reduce the K distribution for
> various user types.
>
> Since I don't know how the Last-Modified time exactly works, I'm going to
> restate your response slightly:
>
> Assuming synchronized clocks (or solely the Twitter Clock, if exposed
> properly via Last-Modified), given a poll at time t, the newest status is at
> least t - n seconds old, and sufficient n, then even a naive since_id
> algorithm will be effectively Exactly Once. Assuming that Twitter is running
> normally. For a given poll, when the poll time and last update time delta
> drops below this n second period, there's a non-zero loss risk.
>
> Just what is n? It is K expressed as time rather than as a discrete count.
> For some timelines types, with some classes of users, K is as much as
> perhaps 180 seconds. For others, K is less than 1 second. There's some
> variability here that we should characterize more carefully internally and
> then discuss publicly. I suspect there's a lot to be learned from this
> exercise.
>
> Since_id really runs into trouble when any of the following are too great:
> the polling frequency, the updating frequency, the roughly-sorted K value.
> If you are polling often to reduce display latency, use the Streaming API.
> If the timeline moves too fast to capture it all exactly, you should
> reconsider your requirements or get a Commercial Data License for the
> Streaming API. Does the user really need to see every Bieber at 3 Biebers
> Per Second? How would they ever know if they missed 10^-5 of them in a blur?
> If you need them all for analysis, consider calculating the confidence
> interval given a sample proportion of 1 - 10^6 (6 9s) or so vs. a total
> enumeration. Indistinguishable. If you need them for some other purpose, say
> CRM, the Streaming API may be the answer.
>
> -John Kalucki
> http://twitter.com/jkalucki
> Infrastructure, Twitter Inc.
>
>
> On Fri, Apr 9, 2010 at 2:28 PM, Brian Smith <[email protected]> wrote:
>>
>> John,
>>
>>
>>
>> I am not polling. I am simply trying to implement a basic “refresh”
>> feature like every desktop/mobile Twitter app has. Basically, I just want to
>> let users scroll through their timelines, and be reasonably sure that I am
>> presenting them with an accurate & complete view of the timeline, while
>> using as little bandwidth as possible.
>>
>>
>>
>> When I said “10 seconds old”/“30 seconds old”/etc. I was referring to I
>> was referring to the age at the time the page of tweets was generated. So,
>> basically, if the tweet’s timestamp – the response’s Last-Modified time more
>> than 10,000 ms (from what you said below), you are almost definitely getting
>> At Least Once behavior if Twitter is operating normally, and you can use
>> that information to get At Least Once behavior that emulates Exactly Once
>> behavior with little (usually no) overhead. Is that a correct interpretation
>> of what you were saying?
>>
>>
>>
>> Thanks,
>>
>> Brian
>>
>>
>>
>>
>>
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of John Kalucki
>> Sent: Friday, April 09, 2010 3:31 PM
>> To: [email protected]
>> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are
>> sequenced
>>
>>
>>
>> Your second paragraph doesn't quite make sense. The period between your
>> next poll and the timestamp of the last status is irrelevant. The issue is
>> solely the magnitude of K on the roughly sorted stream of events that are
>> applied to the materialized timeline vector. As K varies, so do the odds,
>> however infinitesimally small, that you will miss a tweet using the last
>> status id returned. The period between your polls of the API does not affect
>> this K.
>>
>> My recommendation is to ignore this issue in nearly every use case. If you
>> are, however, polling high velocity timelines (including search queries) and
>> attempting to approximate an Exactly Once QoS, you should, basically, stop
>> doing that. You are probably wasting resources and you'll probably never get
>> Exactly Once behavior anyway. Use the Streaming API instead.
>>
>> -John Kalucki
>> http://twitter.com/jkalucki
>> Infrastructure, Twitter Inc.
>>
>> On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith <[email protected]> wrote:
>>
>> John,
>>
>>
>>
>> Thank you. That was one of the most informative emails on the Twitter API
>> I have seen on the list.
>>
>>
>>
>> Basically, even now, an application should not use an ID of a tweet for
>> since_id if the tweet is less than 10 seconds old, ignoring service
>> abnormalities. Probably a larger threshold (30 seconds or even a minute)
>> would be better, especially when you take into consideration the likelihood
>> of clock skew between the servers that generate the timestamps.
>>
>>
>>
>> I think this is information that would be useful to have added to the API
>> documentation, as I know many applications are taking a much more naive
>> approach to pagination.
>>
>>
>>
>> Thanks again,
>>
>> Brian
>>
>>
>>
>> From: [email protected] On Behalf Of John Kalucki
>> Sent: Friday, April 09, 2010 1:20 PM
>>
>> To: [email protected]
>> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are
>> sequenced
>>
>>
>>
>> Folks are making a lot of incorrect assumptions about the Twitter
>> architecture, especially around how we materialize and present timeline
>> vectors and just what QoS we're really offering. This new scheme does not
>> significantly, or perhaps even observably, make the existing issues around
>> since_id any better or any worse. And I'm being very precise here. The
>> since_id situation is such that the few milliseconds skew possible in
>> Snowflake are practically irrelevant and lost in the noise of a 4 to 6
>> orders-of-magnitude misconception. (That's a very big misconception.)
>>
>> If you do not know the rough ordering of our event stream as it applied to
>> the materialized timeline vectors and also the expected rate of change on
>> the timeline in question, you cannot make good choices about making since_id
>> perfect. But, neither you should you try to make it perfect, nor should you
>> have to worry about this.
>>
>> If you insist upon worrying about this, here's my slight salting of Mark's
>> advice: In the existing continuously increasing id generation scheme on the
>> Twitter.com API, I'd subtract about 5000 ids from since_id to ensure
>> sufficient overlap in nearly all cases, but even this could be lossy in the
>> face of severe operational issues -- issues of a type that we haven't seen
>> in many many months. The search API has a different K in its rough ordering,
>> so you might need more like 10,000 ids. In the new Snowflake scheme, I'd
>> overlap by about 5000 milliseconds for twitter.com APIs and 10,000 ms for
>> search APIs.
>>
>> Despite all this, things still could go wrong. An engineer here is known
>> for pointing out that even things that almost never ever happen, happen all
>> the time on the Twitter system. Now, just because they are happening, to
>> someone, all the time, doesn't mean that they'll ever ever happen to you or
>> your users in a thousand years -- but some's getting hit with it, somewhere,
>> a few times a day.
>>
>> The above schemes no longer treat the id as an opaque unique ordered
>> identifier. And woe lies in wait for you as changes are made to these ids.
>> Woe. You also need to deduplicate. Be very careful and understand fully what
>> you summon by breaking this semantic contract.
>>
>> In the end, since_id issues go away on the Streaming API, and other than
>> around various start-up discontinuities, you can ignore this issue. I'll be
>> talking about Rough Ordering, among other things Streaming, at the Chirp
>> conference. Come geek out.
>>
>> -John Kalucki
>> http://twitter.com/jkalucki
>> Infrastructure, Twitter Inc.
>>
>> On Fri, Apr 9, 2010 at 1:58 AM, Dave Sherohman <[email protected]> wrote:
>>
>> On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote:
>> > However, I wanted to be clear and feel it should be made obvious that
>> > with this change, there is a possibility that a tweet may not be
>> > delivered to client if the implementation of how since_id is currently
>> > used is not updated to cover the case.  I still envision the situation
>> > as more likely than you seem to believe and figure as tweet velocity
>> > increases, the likelihood will also increase; But I am assuming have
>> > better data to support your viewpoint than I and shall defer.
>>
>> Maybe I'm just missing something here, but it seems trivial to fix on
>> Twitter's side (enough so that I assume it's what they've been planning
>> from the start to do):  Only return tweets from closed buckets.
>>
>> We are guaranteed that the buckets will be properly ordered.  The order
>> will only be randomized within a bucket.  Therefore, by only returning
>> tweets from buckets which are no longer receiving new tweets, since_id
>> works and will never miss a tweet.
>>
>> And, yes, this does mean a slight delay in getting the tweets out
>> because they have to wait a few milliseconds for their bucket to close
>> before being exposed to calls which can use since_id, plus maybe a
>> little longer for the contents of that bucket to be distributed to
>> multiple servers.  That's still going to only take time comparable to
>> round-trip times for an HTTP request to fetch the data for display to a
>> user and be far, far less than the average refresh delay required by
>> those clients which fall under the API rate limit.  I submit, therefore,
>> that any such delay caused by waiting for buckets to close will be
>> inconsequential.
>>
>> --
>> Dave Sherohman
>>
>>
>>
>>
>


-- 
To unsubscribe, reply using "remove me" as the subject.

Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

Reply via email to