RE: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

Brian Smith Fri, 09 Apr 2010 14:29:05 -0700

John,

I am not polling. I am simply trying to implement a basic "refresh" feature
like every desktop/mobile Twitter app has. Basically, I just want to let
users scroll through their timelines, and be reasonably sure that I am
presenting them with an accurate & complete view of the timeline, while
using as little bandwidth as possible.

When I said "10 seconds old"/"30 seconds old"/etc. I was referring to I was
referring to the age at the time the page of tweets was generated. So,
basically, if the tweet's timestamp - the response's Last-Modified time more
than 10,000 ms (from what you said below), you are almost definitely getting
At Least Once behavior if Twitter is operating normally, and you can use
that information to get At Least Once behavior that emulates Exactly Once
behavior with little (usually no) overhead. Is that a correct interpretation
of what you were saying?

Thanks,

Brian

From: [email protected]
[mailto:[email protected]] On Behalf Of John Kalucki
Sent: Friday, April 09, 2010 3:31 PM
To: [email protected]
Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are
sequenced

Your second paragraph doesn't quite make sense. The period between your next
poll and the timestamp of the last status is irrelevant. The issue is solely
the magnitude of K on the roughly sorted stream of events that are applied
to the materialized timeline vector. As K varies, so do the odds, however
infinitesimally small, that you will miss a tweet using the last status id
returned. The period between your polls of the API does not affect this K.

My recommendation is to ignore this issue in nearly every use case. If you
are, however, polling high velocity timelines (including search queries) and
attempting to approximate an Exactly Once QoS, you should, basically, stop
doing that. You are probably wasting resources and you'll probably never get
Exactly Once behavior anyway. Use the Streaming API instead.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.

On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith <[email protected]> wrote:

John,

Thank you. That was one of the most informative emails on the Twitter API I
have seen on the list.

Basically, even now, an application should not use an ID of a tweet for
since_id if the tweet is less than 10 seconds old, ignoring service
abnormalities. Probably a larger threshold (30 seconds or even a minute)
would be better, especially when you take into consideration the likelihood
of clock skew between the servers that generate the timestamps.

I think this is information that would be useful to have added to the API
documentation, as I know many applications are taking a much more naive
approach to pagination.

Thanks again,

Brian

From: [email protected] On Behalf Of John Kalucki
Sent: Friday, April 09, 2010 1:20 PM

To: [email protected]
Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are
sequenced

Folks are making a lot of incorrect assumptions about the Twitter
architecture, especially around how we materialize and present timeline
vectors and just what QoS we're really offering. This new scheme does not
significantly, or perhaps even observably, make the existing issues around
since_id any better or any worse. And I'm being very precise here. The
since_id situation is such that the few milliseconds skew possible in
Snowflake are practically irrelevant and lost in the noise of a 4 to 6
orders-of-magnitude misconception. (That's a very big misconception.)

If you do not know the rough ordering of our event stream as it applied to
the materialized timeline vectors and also the expected rate of change on
the timeline in question, you cannot make good choices about making since_id
perfect. But, neither you should you try to make it perfect, nor should you
have to worry about this.

If you insist upon worrying about this, here's my slight salting of Mark's
advice: In the existing continuously increasing id generation scheme on the
Twitter.com API, I'd subtract about 5000 ids from since_id to ensure
sufficient overlap in nearly all cases, but even this could be lossy in the
face of severe operational issues -- issues of a type that we haven't seen
in many many months. The search API has a different K in its rough ordering,
so you might need more like 10,000 ids. In the new Snowflake scheme, I'd
overlap by about 5000 milliseconds for twitter.com APIs and 10,000 ms for
search APIs.

Despite all this, things still could go wrong. An engineer here is known for
pointing out that even things that almost never ever happen, happen all the
time on the Twitter system. Now, just because they are happening, to
someone, all the time, doesn't mean that they'll ever ever happen to you or
your users in a thousand years -- but some's getting hit with it, somewhere,
a few times a day.

The above schemes no longer treat the id as an opaque unique ordered
identifier. And woe lies in wait for you as changes are made to these ids.
Woe. You also need to deduplicate. Be very careful and understand fully what
you summon by breaking this semantic contract.

In the end, since_id issues go away on the Streaming API, and other than
around various start-up discontinuities, you can ignore this issue. I'll be
talking about Rough Ordering, among other things Streaming, at the Chirp
conference. Come geek out. 

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.

On Fri, Apr 9, 2010 at 1:58 AM, Dave Sherohman <[email protected]> wrote:

On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote:
> However, I wanted to be clear and feel it should be made obvious that
> with this change, there is a possibility that a tweet may not be
> delivered to client if the implementation of how since_id is currently
> used is not updated to cover the case.  I still envision the situation
> as more likely than you seem to believe and figure as tweet velocity
> increases, the likelihood will also increase; But I am assuming have
> better data to support your viewpoint than I and shall defer.

Maybe I'm just missing something here, but it seems trivial to fix on
Twitter's side (enough so that I assume it's what they've been planning
from the start to do):  Only return tweets from closed buckets.

We are guaranteed that the buckets will be properly ordered.  The order
will only be randomized within a bucket.  Therefore, by only returning
tweets from buckets which are no longer receiving new tweets, since_id
works and will never miss a tweet.

And, yes, this does mean a slight delay in getting the tweets out
because they have to wait a few milliseconds for their bucket to close
before being exposed to calls which can use since_id, plus maybe a
little longer for the contents of that bucket to be distributed to
multiple servers.  That's still going to only take time comparable to
round-trip times for an HTTP request to fetch the data for display to a
user and be far, far less than the average refresh delay required by
those clients which fall under the API rate limit.  I submit, therefore,
that any such delay caused by waiting for buckets to close will be
inconsequential.

--
Dave Sherohman

-- 
To unsubscribe, reply using "remove me" as the subject.

RE: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

Reply via email to