Folks are making a lot of incorrect assumptions about the Twitter
architecture, especially around how we materialize and present timeline
vectors and just what QoS we're really offering. This new scheme does not
significantly, or perhaps even observably, make the existing issues around
since_id any better or any worse. And I'm being very precise here. The
since_id situation is such that the few milliseconds skew possible in
Snowflake are practically irrelevant and lost in the noise of a 4 to 6
orders-of-magnitude misconception. (That's a very big misconception.)

If you do not know the rough ordering of our event stream as it applied to
the materialized timeline vectors and also the expected rate of change on
the timeline in question, you cannot make good choices about making since_id
perfect. But, neither you should you try to make it perfect, nor should you
have to worry about this.

If you insist upon worrying about this, here's my slight salting of Mark's
advice: In the existing continuously increasing id generation scheme on the
Twitter.com API, I'd subtract about 5000 ids from since_id to ensure
sufficient overlap in nearly all cases, but even this could be lossy in the
face of severe operational issues -- issues of a type that we haven't seen
in many many months. The search API has a different K in its rough ordering,
so you might need more like 10,000 ids. In the new Snowflake scheme, I'd
overlap by about 5000 milliseconds for twitter.com APIs and 10,000 ms for
search APIs.

Despite all this, things still could go wrong. An engineer here is known for
pointing out that even things that almost never ever happen, happen all the
time on the Twitter system. Now, just because they are happening, to
someone, all the time, doesn't mean that they'll ever ever happen to you or
your users in a thousand years -- but some's getting hit with it, somewhere,
a few times a day.

The above schemes no longer treat the id as an opaque unique ordered
identifier. And woe lies in wait for you as changes are made to these ids.
Woe. You also need to deduplicate. Be very careful and understand fully what
you summon by breaking this semantic contract.

In the end, since_id issues go away on the Streaming API, and other than
around various start-up discontinuities, you can ignore this issue. I'll be
talking about Rough Ordering, among other things Streaming, at the Chirp
conference. Come geek out.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.


On Fri, Apr 9, 2010 at 1:58 AM, Dave Sherohman <[email protected]> wrote:

> On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote:
> > However, I wanted to be clear and feel it should be made obvious that
> > with this change, there is a possibility that a tweet may not be
> > delivered to client if the implementation of how since_id is currently
> > used is not updated to cover the case.  I still envision the situation
> > as more likely than you seem to believe and figure as tweet velocity
> > increases, the likelihood will also increase; But I am assuming have
> > better data to support your viewpoint than I and shall defer.
>
> Maybe I'm just missing something here, but it seems trivial to fix on
> Twitter's side (enough so that I assume it's what they've been planning
> from the start to do):  Only return tweets from closed buckets.
>
> We are guaranteed that the buckets will be properly ordered.  The order
> will only be randomized within a bucket.  Therefore, by only returning
> tweets from buckets which are no longer receiving new tweets, since_id
> works and will never miss a tweet.
>
> And, yes, this does mean a slight delay in getting the tweets out
> because they have to wait a few milliseconds for their bucket to close
> before being exposed to calls which can use since_id, plus maybe a
> little longer for the contents of that bucket to be distributed to
> multiple servers.  That's still going to only take time comparable to
> round-trip times for an HTTP request to fetch the data for display to a
> user and be far, far less than the average refresh delay required by
> those clients which fall under the API rate limit.  I submit, therefore,
> that any such delay caused by waiting for buckets to close will be
> inconsequential.
>
> --
> Dave Sherohman
>


-- 
To unsubscribe, reply using "remove me" as the subject.

Reply via email to