A sequence can be on a continuum from unsorted to partially sorted
to roughly sorted to totally sorted. Totally sorted is what we mean
when we say "sorted". Partially sorted could mean anything, I
suppose, but roughly sorted is a stricter definition than partially
sorted. Informally it means that each item is no more than K items
out of position. So, to totally sort the sequence, you need only
consider K items.
This is useful stuff for dealing with infinite sequences of events
-- like, picking a random example, the insertion of new tweets into
a materialized timeline (a cache of the timeline vector). The events
get slightly jumbled as they go through the Twitter system and this
causes confusion for developers who don't understand how we apply
the CAP theorem. It's Brewer's world, we just live in it. And we
haven't done a good job at explaining our QoS as we've made the CAP
trade-offs, or how we've evolved them, etc. etc.
To make things one step more complicated, at Twitter, K is a
function of a number of factors, including the timeline, the user
tweeting, the phase of the moon, and the general state of the
Twitter system. So, we have to think of the distribution of K over
time as well.
Crazy. We should just move this all into a single instance of Oracle
and go home.
http://twitter.com/jkalucki/statuses/10503736367
A sequence α is k-sorted IFF ∀ i, r, 1 ≤ i ≤ r ≤ n, i ≤ r
- k implies aᵢ ≤ aᵣ.
-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.
On Sun, Apr 11, 2010 at 4:35 PM, Josh Bleecher Snyder <[email protected]
> wrote:
Hi John (et al.),
These emails from you are great -- they are exactly the sort of
thoughtful, detailed, specific, technical emails that I would
personally love to see accompany future announcements. I think they
would prevent a fair amount of FUD. Thank you.
I have one stupid question, if you don't mind, though. You refer in
every email to "K". What, precisely, does K refer to? What are its
units? (I think I know what it you mean by it, but I'd be interested
to hear precisely.)
Thanks,
Josh
On Sun, Apr 11, 2010 at 2:23 PM, John Kalucki <[email protected]>
wrote:
> If you are writing a general purpose display app, I think, (but I
am not at
> all certain), that you can ignore this issue. Reasonable polling
frequency
> on modest velocity timelines will sometimes, but very rarely, miss
a tweet.
> Also, over time, we're doing things to make this better for
everyone. Many
> of our projects have the side-effect of reducing K, decreasing the
already
> low since_id failure odds even further. Some tweet pipeline
changes when
> live in the last few weeks that dramatically reduce the K
distribution for
> various user types.
>
> Since I don't know how the Last-Modified time exactly works, I'm
going to
> restate your response slightly:
>
> Assuming synchronized clocks (or solely the Twitter Clock, if
exposed
> properly via Last-Modified), given a poll at time t, the newest
status is at
> least t - n seconds old, and sufficient n, then even a naive
since_id
> algorithm will be effectively Exactly Once. Assuming that Twitter
is running
> normally. For a given poll, when the poll time and last update
time delta
> drops below this n second period, there's a non-zero loss risk.
>
> Just what is n? It is K expressed as time rather than as a
discrete count.
> For some timelines types, with some classes of users, K is as much
as
> perhaps 180 seconds. For others, K is less than 1 second. There's
some
> variability here that we should characterize more carefully
internally and
> then discuss publicly. I suspect there's a lot to be learned from
this
> exercise.
>
> Since_id really runs into trouble when any of the following are
too great:
> the polling frequency, the updating frequency, the roughly-sorted
K value.
> If you are polling often to reduce display latency, use the
Streaming API.
> If the timeline moves too fast to capture it all exactly, you should
> reconsider your requirements or get a Commercial Data License for
the
> Streaming API. Does the user really need to see every Bieber at 3
Biebers
> Per Second? How would they ever know if they missed 10^-5 of them
in a blur?
> If you need them all for analysis, consider calculating the
confidence
> interval given a sample proportion of 1 - 10^6 (6 9s) or so vs. a
total
> enumeration. Indistinguishable. If you need them for some other
purpose, say
> CRM, the Streaming API may be the answer.
>
> -John Kalucki
> http://twitter.com/jkalucki
> Infrastructure, Twitter Inc.
>
>
> On Fri, Apr 9, 2010 at 2:28 PM, Brian Smith <[email protected]>
wrote:
>>
>> John,
>>
>>
>>
>> I am not polling. I am simply trying to implement a basic “refres
h”
>> feature like every desktop/mobile Twitter app has. Basically, I
just want to
>> let users scroll through their timelines, and be reasonably sure
that I am
>> presenting them with an accurate & complete view of the timeline,
while
>> using as little bandwidth as possible.
>>
>>
>>
>> When I said “10 seconds old”/“30 seconds old”/etc. I was
referring to I
>> was referring to the age at the time the page of tweets was
generated. So,
>> basically, if the tweet’s timestamp – the response’s Last-
Modified time more
>> than 10,000 ms (from what you said below), you are almost
definitely getting
>> At Least Once behavior if Twitter is operating normally, and you
can use
>> that information to get At Least Once behavior that emulates
Exactly Once
>> behavior with little (usually no) overhead. Is that a correct
interpretation
>> of what you were saying?
>>
>>
>>
>> Thanks,
>>
>> Brian
>>
>>
>>
>>
>>
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of
John Kalucki
>> Sent: Friday, April 09, 2010 3:31 PM
>> To: [email protected]
>> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status
IDs are
>> sequenced
>>
>>
>>
>> Your second paragraph doesn't quite make sense. The period
between your
>> next poll and the timestamp of the last status is irrelevant. The
issue is
>> solely the magnitude of K on the roughly sorted stream of events
that are
>> applied to the materialized timeline vector. As K varies, so do
the odds,
>> however infinitesimally small, that you will miss a tweet using
the last
>> status id returned. The period between your polls of the API does
not affect
>> this K.
>>
>> My recommendation is to ignore this issue in nearly every use
case. If you
>> are, however, polling high velocity timelines (including search
queries) and
>> attempting to approximate an Exactly Once QoS, you should,
basically, stop
>> doing that. You are probably wasting resources and you'll
probably never get
>> Exactly Once behavior anyway. Use the Streaming API instead.
>>
>> -John Kalucki
>> http://twitter.com/jkalucki
>> Infrastructure, Twitter Inc.
>>
>> On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith
<[email protected]> wrote:
>>
>> John,
>>
>>
>>
>> Thank you. That was one of the most informative emails on the
Twitter API
>> I have seen on the list.
>>
>>
>>
>> Basically, even now, an application should not use an ID of a
tweet for
>> since_id if the tweet is less than 10 seconds old, ignoring service
>> abnormalities. Probably a larger threshold (30 seconds or even a
minute)
>> would be better, especially when you take into consideration the
likelihood
>> of clock skew between the servers that generate the timestamps.
>>
>>
>>
>> I think this is information that would be useful to have added to
the API
>> documentation, as I know many applications are taking a much more
naive
>> approach to pagination.
>>
>>
>>
>> Thanks again,
>>
>> Brian
>>
>>
>>
>> From: [email protected] On Behalf Of John
Kalucki
>> Sent: Friday, April 09, 2010 1:20 PM
>>
>> To: [email protected]
>> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status
IDs are
>> sequenced
>>
>>
>>
>> Folks are making a lot of incorrect assumptions about the Twitter
>> architecture, especially around how we materialize and present
timeline
>> vectors and just what QoS we're really offering. This new scheme
does not
>> significantly, or perhaps even observably, make the existing
issues around
>> since_id any better or any worse. And I'm being very precise
here. The
>> since_id situation is such that the few milliseconds skew
possible in
>> Snowflake are practically irrelevant and lost in the noise of a 4
to 6
>> orders-of-magnitude misconception. (That's a very big
misconception.)
>>
>> If you do not know the rough ordering of our event stream as it
applied to
>> the materialized timeline vectors and also the expected rate of
change on
>> the timeline in question, you cannot make good choices about
making since_id
>> perfect. But, neither you should you try to make it perfect, nor
should you
>> have to worry about this.
>>
>> If you insist upon worrying about this, here's my slight salting
of Mark's
>> advice: In the existing continuously increasing id generation
scheme on the
>> Twitter.com API, I'd subtract about 5000 ids from since_id to
ensure
>> sufficient overlap in nearly all cases, but even this could be
lossy in the
>> face of severe operational issues -- issues of a type that we
haven't seen
>> in many many months. The search API has a different K in its
rough ordering,
>> so you might need more like 10,000 ids. In the new Snowflake
scheme, I'd
>> overlap by about 5000 milliseconds for twitter.com APIs and
10,000 ms for
>> search APIs.
>>
>> Despite all this, things still could go wrong. An engineer here
is known
>> for pointing out that even things that almost never ever happen,
happen all
>> the time on the Twitter system. Now, just because they are
happening, to
>> someone, all the time, doesn't mean that they'll ever ever happen
to you or
>> your users in a thousand years -- but some's getting hit with it,
somewhere,
>> a few times a day.
>>
>> The above schemes no longer treat the id as an opaque unique
ordered
>> identifier. And woe lies in wait for you as changes are made to
these ids.
>> Woe. You also need to deduplicate. Be very careful and understand
fully what
>> you summon by breaking this semantic contract.
>>
>> In the end, since_id issues go away on the Streaming API, and
other than
>> around various start-up discontinuities, you can ignore this
issue. I'll be
>> talking about Rough Ordering, among other things Streaming, at
the Chirp
>> conference. Come geek out.
>>
>> -John Kalucki
>> http://twitter.com/jkalucki
>> Infrastructure, Twitter Inc.
>>
>> On Fri, Apr 9, 2010 at 1:58 AM, Dave Sherohman
<[email protected]> wrote:
>>
>> On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote:
>> > However, I wanted to be clear and feel it should be made
obvious that
>> > with this change, there is a possibility that a tweet may not be
>> > delivered to client if the implementation of how since_id is
currently
>> > used is not updated to cover the case. I still envision the
situation
>> > as more likely than you seem to believe and figure as tweet
velocity
>> > increases, the likelihood will also increase; But I am assuming
have
>> > better data to support your viewpoint than I and shall defer.
>>
>> Maybe I'm just missing something here, but it seems trivial to
fix on
>> Twitter's side (enough so that I assume it's what they've been
planning
>> from the start to do): Only return tweets from closed buckets.
>>
>> We are guaranteed that the buckets will be properly ordered. The
order
>> will only be randomized within a bucket. Therefore, by only
returning
>> tweets from buckets which are no longer receiving new tweets,
since_id
>> works and will never miss a tweet.
>>
>> And, yes, this does mean a slight delay in getting the tweets out
>> because they have to wait a few milliseconds for their bucket to
close
>> before being exposed to calls which can use since_id, plus maybe a
>> little longer for the contents of that bucket to be distributed to
>> multiple servers. That's still going to only take time
comparable to
>> round-trip times for an HTTP request to fetch the data for
display to a
>> user and be far, far less than the average refresh delay required
by
>> those clients which fall under the API rate limit. I submit,
therefore,
>> that any such delay caused by waiting for buckets to close will be
>> inconsequential.
>>
>> --
>> Dave Sherohman
>>
>>
>>
>>
>
--
To unsubscribe, reply using "remove me" as the subject.