If you are writing a general purpose display app, I think, (but I am not at
all certain), that you can ignore this issue. Reasonable polling frequency
on modest velocity timelines will sometimes, but very rarely, miss a tweet.
Also, over time, we're doing things to make this better for everyone. Many
of our projects have the side-effect of reducing K, decreasing the already
low since_id failure odds even further. Some tweet pipeline changes when
live in the last few weeks that dramatically reduce the K distribution for
various user types.

Since I don't know how the Last-Modified time exactly works, I'm going to
restate your response slightly:

Assuming synchronized clocks (or solely the Twitter Clock, if exposed
properly via Last-Modified), given a poll at time t, the newest status is at
least t - n seconds old, and sufficient n, then even a naive since_id
algorithm will be effectively Exactly Once. Assuming that Twitter is running
normally. For a given poll, when the poll time and last update time delta
drops below this n second period, there's a non-zero loss risk.

Just what is n? It is K expressed as time rather than as a discrete count.
For some timelines types, with some classes of users, K is as much as
perhaps 180 seconds. For others, K is less than 1 second. There's some
variability here that we should characterize more carefully internally and
then discuss publicly. I suspect there's a lot to be learned from this
exercise.

Since_id really runs into trouble when any of the following are too great:
the polling frequency, the updating frequency, the roughly-sorted K value.
If you are polling often to reduce display latency, use the Streaming API.
If the timeline moves too fast to capture it all exactly, you should
reconsider your requirements or get a Commercial Data License for the
Streaming API. Does the user really need to see every Bieber at 3 Biebers
Per Second? How would they ever know if they missed 10^-5 of them in a blur?
If you need them all for analysis, consider calculating the confidence
interval given a sample proportion of 1 - 10^6 (6 9s) or so vs. a total
enumeration. Indistinguishable. If you need them for some other purpose, say
CRM, the Streaming API may be the answer.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.


On Fri, Apr 9, 2010 at 2:28 PM, Brian Smith <[email protected]> wrote:

> John,
>
>
>
> I am not polling. I am simply trying to implement a basic “refresh” feature
> like every desktop/mobile Twitter app has. Basically, I just want to let
> users scroll through their timelines, and be reasonably sure that I am
> presenting them with an accurate & complete view of the timeline, while
> using as little bandwidth as possible.
>
>
>
> When I said “10 seconds old”/“30 seconds old”/etc. I was referring to I was
> referring to the age at the time the page of tweets was generated. So,
> basically, if the tweet’s timestamp – the response’s Last-Modified time more
> than 10,000 ms (from what you said below), you are almost definitely getting
> At Least Once behavior if Twitter is operating normally, and you can use
> that information to get At Least Once behavior that emulates Exactly Once
> behavior with little (usually no) overhead. Is that a correct interpretation
> of what you were saying?
>
>
>
> Thanks,
>
> Brian
>
>
>
>
>
> *From:* [email protected] [mailto:
> [email protected]] *On Behalf Of *John Kalucki
> *Sent:* Friday, April 09, 2010 3:31 PM
>
> *To:* [email protected]
> *Subject:* Re: [twitter-dev] Re: Upcoming changes to the way status IDs
> are sequenced
>
>
>
> Your second paragraph doesn't quite make sense. The period between your
> next poll and the timestamp of the last status is irrelevant. The issue is
> solely the magnitude of K on the roughly sorted stream of events that are
> applied to the materialized timeline vector. As K varies, so do the odds,
> however infinitesimally small, that you will miss a tweet using the last
> status id returned. The period between your polls of the API does not affect
> this K.
>
> My recommendation is to ignore this issue in nearly every use case. If you
> are, however, polling high velocity timelines (including search queries) and
> attempting to approximate an Exactly Once QoS, you should, basically, stop
> doing that. You are probably wasting resources and you'll probably never get
> Exactly Once behavior anyway. Use the Streaming API instead.
>
> -John Kalucki
> http://twitter.com/jkalucki
> Infrastructure, Twitter Inc.
>
> On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith <[email protected]> wrote:
>
> John,
>
>
>
> Thank you. That was one of the most informative emails on the Twitter API I
> have seen on the list.
>
>
>
> Basically, even now, an application should not use an ID of a tweet for
> since_id if the tweet is less than 10 seconds old, ignoring service
> abnormalities. Probably a larger threshold (30 seconds or even a minute)
> would be better, especially when you take into consideration the likelihood
> of clock skew between the servers that generate the timestamps.
>
>
>
> I think this is information that would be useful to have added to the API
> documentation, as I know many applications are taking a much more naive
> approach to pagination.
>
>
>
> Thanks again,
>
> Brian
>
>
>
> *From:* [email protected] *On Behalf Of *John
> Kalucki
> *Sent:* Friday, April 09, 2010 1:20 PM
>
>
> *To:* [email protected]
> *Subject:* Re: [twitter-dev] Re: Upcoming changes to the way status IDs
> are sequenced
>
>
>
> Folks are making a lot of incorrect assumptions about the Twitter
> architecture, especially around how we materialize and present timeline
> vectors and just what QoS we're really offering. This new scheme does not
> significantly, or perhaps even observably, make the existing issues around
> since_id any better or any worse. And I'm being very precise here. The
> since_id situation is such that the few milliseconds skew possible in
> Snowflake are practically irrelevant and lost in the noise of a 4 to 6
> orders-of-magnitude misconception. (That's a very big misconception.)
>
>
>
> If you do not know the rough ordering of our event stream as it applied to
> the materialized timeline vectors and also the expected rate of change on
> the timeline in question, you cannot make good choices about making since_id
> perfect. But, neither you should you try to make it perfect, nor should you
> have to worry about this.
>
> If you insist upon worrying about this, here's my slight salting of Mark's
> advice: In the existing continuously increasing id generation scheme on the
> Twitter.com API, I'd subtract about 5000 ids from since_id to ensure
> sufficient overlap in nearly all cases, but even this could be lossy in the
> face of severe operational issues -- issues of a type that we haven't seen
> in many many months. The search API has a different K in its rough ordering,
> so you might need more like 10,000 ids. In the new Snowflake scheme, I'd
> overlap by about 5000 milliseconds for twitter.com APIs and 10,000 ms for
> search APIs.
>
> Despite all this, things still could go wrong. An engineer here is known
> for pointing out that even things that almost never ever happen, happen all
> the time on the Twitter system. Now, just because they are happening, to
> someone, all the time, doesn't mean that they'll ever ever happen to you or
> your users in a thousand years -- but some's getting hit with it, somewhere,
> a few times a day.
>
> The above schemes no longer treat the id as an opaque unique ordered
> identifier. And woe lies in wait for you as changes are made to these ids.
> Woe. You also need to deduplicate. Be very careful and understand fully what
> you summon by breaking this semantic contract.
>
> In the end, since_id issues go away on the Streaming API, and other than
> around various start-up discontinuities, you can ignore this issue. I'll be
> talking about Rough Ordering, among other things Streaming, at the Chirp
> conference. Come geek out.
>
> -John Kalucki
> http://twitter.com/jkalucki
> Infrastructure, Twitter Inc.
>
> On Fri, Apr 9, 2010 at 1:58 AM, Dave Sherohman <[email protected]> wrote:
>
> On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote:
> > However, I wanted to be clear and feel it should be made obvious that
> > with this change, there is a possibility that a tweet may not be
> > delivered to client if the implementation of how since_id is currently
> > used is not updated to cover the case.  I still envision the situation
> > as more likely than you seem to believe and figure as tweet velocity
> > increases, the likelihood will also increase; But I am assuming have
> > better data to support your viewpoint than I and shall defer.
>
> Maybe I'm just missing something here, but it seems trivial to fix on
> Twitter's side (enough so that I assume it's what they've been planning
> from the start to do):  Only return tweets from closed buckets.
>
> We are guaranteed that the buckets will be properly ordered.  The order
> will only be randomized within a bucket.  Therefore, by only returning
> tweets from buckets which are no longer receiving new tweets, since_id
> works and will never miss a tweet.
>
> And, yes, this does mean a slight delay in getting the tweets out
> because they have to wait a few milliseconds for their bucket to close
> before being exposed to calls which can use since_id, plus maybe a
> little longer for the contents of that bucket to be distributed to
> multiple servers.  That's still going to only take time comparable to
> round-trip times for an HTTP request to fetch the data for display to a
> user and be far, far less than the average refresh delay required by
> those clients which fall under the API rate limit.  I submit, therefore,
> that any such delay caused by waiting for buckets to close will be
> inconsequential.
>
> --
> Dave Sherohman
>
>
>
>
>


-- 
To unsubscribe, reply using "remove me" as the subject.

Reply via email to