Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-11 Thread M. Edward (Ed) Borasky
On 04/11/2010 09:33 PM, Nick Arnett wrote:
> On Sun, Apr 11, 2010 at 5:14 PM, John Kalucki  wrote:
> 
>>
>> This is useful stuff for dealing with infinite sequences of events -- like,
>> picking a random example, the insertion of new tweets into a materialized
>> timeline (a cache of the timeline vector).
> 
> 
> The Twitter stream is an infinite sequence of events... now that's serious
> optimism about how long Twitter will exist!
> 
> Sorry, just had to say it.
> 
> Of course, some infinities are bigger than others.
> 
> Nick
> 
> 

Ah yes ... and the "tweet rate" is growing "exponentially" ... except
that such growth is economically implausible. Thanks for reminding me -
another Chirp question for Google Moderator. ;-)

--
M. Edward (Ed) Borasky
http://borasky-research.net/m-edward-ed-borasky/ @znmeb

"I've always regarded nature as the clothing of God." ~Alan Hovhaness


-- 
To unsubscribe, reply using "remove me" as the subject.


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-11 Thread Nick Arnett
On Sun, Apr 11, 2010 at 5:14 PM, John Kalucki  wrote:

>
> This is useful stuff for dealing with infinite sequences of events -- like,
> picking a random example, the insertion of new tweets into a materialized
> timeline (a cache of the timeline vector).


The Twitter stream is an infinite sequence of events... now that's serious
optimism about how long Twitter will exist!

Sorry, just had to say it.

Of course, some infinities are bigger than others.

Nick


-- 
To unsubscribe, reply using "remove me" as the subject.


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-11 Thread Chad Etzel
PM, Brian Smith   
wrote:

>>
>> John,
>>
>>
>>
>> I am not polling. I am simply trying to implement a basic “refres 
h”
>> feature like every desktop/mobile Twitter app has. Basically, I  
just want to
>> let users scroll through their timelines, and be reasonably sure  
that I am
>> presenting them with an accurate & complete view of the timeline,  
while

>> using as little bandwidth as possible.
>>
>>
>>
>> When I said “10 seconds old”/“30 seconds old”/etc. I was  
referring to I
>> was referring to the age at the time the page of tweets was  
generated. So,
>> basically, if the tweet’s timestamp – the response’s Last- 
Modified time more
>> than 10,000 ms (from what you said below), you are almost  
definitely getting
>> At Least Once behavior if Twitter is operating normally, and you  
can use
>> that information to get At Least Once behavior that emulates  
Exactly Once
>> behavior with little (usually no) overhead. Is that a correct  
interpretation

>> of what you were saying?
>>
>>
>>
>> Thanks,
>>
>> Brian
>>
>>
>>
>>
>>
>> From: twitter-development-talk@googlegroups.com
>> [mailto:twitter-development-t...@googlegroups.com] On Behalf Of  
John Kalucki

>> Sent: Friday, April 09, 2010 3:31 PM
>> To: twitter-development-talk@googlegroups.com
>> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status  
IDs are

>> sequenced
>>
>>
>>
>> Your second paragraph doesn't quite make sense. The period  
between your
>> next poll and the timestamp of the last status is irrelevant. The  
issue is
>> solely the magnitude of K on the roughly sorted stream of events  
that are
>> applied to the materialized timeline vector. As K varies, so do  
the odds,
>> however infinitesimally small, that you will miss a tweet using  
the last
>> status id returned. The period between your polls of the API does  
not affect

>> this K.
>>
>> My recommendation is to ignore this issue in nearly every use  
case. If you
>> are, however, polling high velocity timelines (including search  
queries) and
>> attempting to approximate an Exactly Once QoS, you should,  
basically, stop
>> doing that. You are probably wasting resources and you'll  
probably never get

>> Exactly Once behavior anyway. Use the Streaming API instead.
>>
>> -John Kalucki
>> http://twitter.com/jkalucki
>> Infrastructure, Twitter Inc.
>>
>> On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith  
 wrote:

>>
>> John,
>>
>>
>>
>> Thank you. That was one of the most informative emails on the  
Twitter API

>> I have seen on the list.
>>
>>
>>
>> Basically, even now, an application should not use an ID of a  
tweet for

>> since_id if the tweet is less than 10 seconds old, ignoring service
>> abnormalities. Probably a larger threshold (30 seconds or even a  
minute)
>> would be better, especially when you take into consideration the  
likelihood

>> of clock skew between the servers that generate the timestamps.
>>
>>
>>
>> I think this is information that would be useful to have added to  
the API
>> documentation, as I know many applications are taking a much more  
naive

>> approach to pagination.
>>
>>
>>
>> Thanks again,
>>
>> Brian
>>
>>
>>
>> From: twitter-development-talk@googlegroups.com On Behalf Of John  
Kalucki

>> Sent: Friday, April 09, 2010 1:20 PM
>>
>> To: twitter-development-talk@googlegroups.com
>> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status  
IDs are

>> sequenced
>>
>>
>>
>> Folks are making a lot of incorrect assumptions about the Twitter
>> architecture, especially around how we materialize and present  
timeline
>> vectors and just what QoS we're really offering. This new scheme  
does not
>> significantly, or perhaps even observably, make the existing  
issues around
>> since_id any better or any worse. And I'm being very precise  
here. The
>> since_id situation is such that the few milliseconds skew  
possible in
>> Snowflake are practically irrelevant and lost in the noise of a 4  
to 6
>> orders-of-magnitude misconception. (That's a very big  
misconception.)

>>
>> If you do not know the rough ordering of our event stream as it  
applied to
>> the materialized timeline vectors and also the expected rate of  
change on
>> the timeline in question, you cannot make good choices about  
making since_id
>> perfect. But, neithe

Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-11 Thread John Kalucki
i
> > http://twitter.com/jkalucki
> > Infrastructure, Twitter Inc.
> >
> >
> > On Fri, Apr 9, 2010 at 2:28 PM, Brian Smith 
> wrote:
> >>
> >> John,
> >>
> >>
> >>
> >> I am not polling. I am simply trying to implement a basic “refresh”
> >> feature like every desktop/mobile Twitter app has. Basically, I just
> want to
> >> let users scroll through their timelines, and be reasonably sure that I
> am
> >> presenting them with an accurate & complete view of the timeline, while
> >> using as little bandwidth as possible.
> >>
> >>
> >>
> >> When I said “10 seconds old”/“30 seconds old”/etc. I was referring to I
> >> was referring to the age at the time the page of tweets was generated.
> So,
> >> basically, if the tweet’s timestamp – the response’s Last-Modified time
> more
> >> than 10,000 ms (from what you said below), you are almost definitely
> getting
> >> At Least Once behavior if Twitter is operating normally, and you can use
> >> that information to get At Least Once behavior that emulates Exactly
> Once
> >> behavior with little (usually no) overhead. Is that a correct
> interpretation
> >> of what you were saying?
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Brian
> >>
> >>
> >>
> >>
> >>
> >> From: twitter-development-talk@googlegroups.com
> >> [mailto:twitter-development-t...@googlegroups.com] On Behalf Of John
> Kalucki
> >> Sent: Friday, April 09, 2010 3:31 PM
> >> To: twitter-development-talk@googlegroups.com
> >> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs
> are
> >> sequenced
> >>
> >>
> >>
> >> Your second paragraph doesn't quite make sense. The period between your
> >> next poll and the timestamp of the last status is irrelevant. The issue
> is
> >> solely the magnitude of K on the roughly sorted stream of events that
> are
> >> applied to the materialized timeline vector. As K varies, so do the
> odds,
> >> however infinitesimally small, that you will miss a tweet using the last
> >> status id returned. The period between your polls of the API does not
> affect
> >> this K.
> >>
> >> My recommendation is to ignore this issue in nearly every use case. If
> you
> >> are, however, polling high velocity timelines (including search queries)
> and
> >> attempting to approximate an Exactly Once QoS, you should, basically,
> stop
> >> doing that. You are probably wasting resources and you'll probably never
> get
> >> Exactly Once behavior anyway. Use the Streaming API instead.
> >>
> >> -John Kalucki
> >> http://twitter.com/jkalucki
> >> Infrastructure, Twitter Inc.
> >>
> >> On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith 
> wrote:
> >>
> >> John,
> >>
> >>
> >>
> >> Thank you. That was one of the most informative emails on the Twitter
> API
> >> I have seen on the list.
> >>
> >>
> >>
> >> Basically, even now, an application should not use an ID of a tweet for
> >> since_id if the tweet is less than 10 seconds old, ignoring service
> >> abnormalities. Probably a larger threshold (30 seconds or even a minute)
> >> would be better, especially when you take into consideration the
> likelihood
> >> of clock skew between the servers that generate the timestamps.
> >>
> >>
> >>
> >> I think this is information that would be useful to have added to the
> API
> >> documentation, as I know many applications are taking a much more naive
> >> approach to pagination.
> >>
> >>
> >>
> >> Thanks again,
> >>
> >> Brian
> >>
> >>
> >>
> >> From: twitter-development-talk@googlegroups.com On Behalf Of John
> Kalucki
> >> Sent: Friday, April 09, 2010 1:20 PM
> >>
> >> To: twitter-development-talk@googlegroups.com
> >> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs
> are
> >> sequenced
> >>
> >>
> >>
> >> Folks are making a lot of incorrect assumptions about the Twitter
> >> architecture, especially around how we materialize and present timeline
> >> vectors and just what QoS we're really offering. This new scheme does
> not
> >> significantly, or

Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-11 Thread Josh Bleecher Snyder
Hi John (et al.),

These emails from you are great -- they are exactly the sort of
thoughtful, detailed, specific, technical emails that I would
personally love to see accompany future announcements. I think they
would prevent a fair amount of FUD. Thank you.

I have one stupid question, if you don't mind, though. You refer in
every email to "K". What, precisely, does K refer to? What are its
units? (I think I know what it you mean by it, but I'd be interested
to hear precisely.)

Thanks,
Josh



On Sun, Apr 11, 2010 at 2:23 PM, John Kalucki  wrote:
> If you are writing a general purpose display app, I think, (but I am not at
> all certain), that you can ignore this issue. Reasonable polling frequency
> on modest velocity timelines will sometimes, but very rarely, miss a tweet.
> Also, over time, we're doing things to make this better for everyone. Many
> of our projects have the side-effect of reducing K, decreasing the already
> low since_id failure odds even further. Some tweet pipeline changes when
> live in the last few weeks that dramatically reduce the K distribution for
> various user types.
>
> Since I don't know how the Last-Modified time exactly works, I'm going to
> restate your response slightly:
>
> Assuming synchronized clocks (or solely the Twitter Clock, if exposed
> properly via Last-Modified), given a poll at time t, the newest status is at
> least t - n seconds old, and sufficient n, then even a naive since_id
> algorithm will be effectively Exactly Once. Assuming that Twitter is running
> normally. For a given poll, when the poll time and last update time delta
> drops below this n second period, there's a non-zero loss risk.
>
> Just what is n? It is K expressed as time rather than as a discrete count.
> For some timelines types, with some classes of users, K is as much as
> perhaps 180 seconds. For others, K is less than 1 second. There's some
> variability here that we should characterize more carefully internally and
> then discuss publicly. I suspect there's a lot to be learned from this
> exercise.
>
> Since_id really runs into trouble when any of the following are too great:
> the polling frequency, the updating frequency, the roughly-sorted K value.
> If you are polling often to reduce display latency, use the Streaming API.
> If the timeline moves too fast to capture it all exactly, you should
> reconsider your requirements or get a Commercial Data License for the
> Streaming API. Does the user really need to see every Bieber at 3 Biebers
> Per Second? How would they ever know if they missed 10^-5 of them in a blur?
> If you need them all for analysis, consider calculating the confidence
> interval given a sample proportion of 1 - 10^6 (6 9s) or so vs. a total
> enumeration. Indistinguishable. If you need them for some other purpose, say
> CRM, the Streaming API may be the answer.
>
> -John Kalucki
> http://twitter.com/jkalucki
> Infrastructure, Twitter Inc.
>
>
> On Fri, Apr 9, 2010 at 2:28 PM, Brian Smith  wrote:
>>
>> John,
>>
>>
>>
>> I am not polling. I am simply trying to implement a basic “refresh”
>> feature like every desktop/mobile Twitter app has. Basically, I just want to
>> let users scroll through their timelines, and be reasonably sure that I am
>> presenting them with an accurate & complete view of the timeline, while
>> using as little bandwidth as possible.
>>
>>
>>
>> When I said “10 seconds old”/“30 seconds old”/etc. I was referring to I
>> was referring to the age at the time the page of tweets was generated. So,
>> basically, if the tweet’s timestamp – the response’s Last-Modified time more
>> than 10,000 ms (from what you said below), you are almost definitely getting
>> At Least Once behavior if Twitter is operating normally, and you can use
>> that information to get At Least Once behavior that emulates Exactly Once
>> behavior with little (usually no) overhead. Is that a correct interpretation
>> of what you were saying?
>>
>>
>>
>> Thanks,
>>
>> Brian
>>
>>
>>
>>
>>
>> From: twitter-development-talk@googlegroups.com
>> [mailto:twitter-development-t...@googlegroups.com] On Behalf Of John Kalucki
>> Sent: Friday, April 09, 2010 3:31 PM
>> To: twitter-development-talk@googlegroups.com
>> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are
>> sequenced
>>
>>
>>
>> Your second paragraph doesn't quite make sense. The period between your
>> next poll and the timestamp of the last status is irrelevant. The issue is
>> solely the magnitude of K on the roughly sorted stream of events that ar

Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-11 Thread John Kalucki
If you are writing a general purpose display app, I think, (but I am not at
all certain), that you can ignore this issue. Reasonable polling frequency
on modest velocity timelines will sometimes, but very rarely, miss a tweet.
Also, over time, we're doing things to make this better for everyone. Many
of our projects have the side-effect of reducing K, decreasing the already
low since_id failure odds even further. Some tweet pipeline changes when
live in the last few weeks that dramatically reduce the K distribution for
various user types.

Since I don't know how the Last-Modified time exactly works, I'm going to
restate your response slightly:

Assuming synchronized clocks (or solely the Twitter Clock, if exposed
properly via Last-Modified), given a poll at time t, the newest status is at
least t - n seconds old, and sufficient n, then even a naive since_id
algorithm will be effectively Exactly Once. Assuming that Twitter is running
normally. For a given poll, when the poll time and last update time delta
drops below this n second period, there's a non-zero loss risk.

Just what is n? It is K expressed as time rather than as a discrete count.
For some timelines types, with some classes of users, K is as much as
perhaps 180 seconds. For others, K is less than 1 second. There's some
variability here that we should characterize more carefully internally and
then discuss publicly. I suspect there's a lot to be learned from this
exercise.

Since_id really runs into trouble when any of the following are too great:
the polling frequency, the updating frequency, the roughly-sorted K value.
If you are polling often to reduce display latency, use the Streaming API.
If the timeline moves too fast to capture it all exactly, you should
reconsider your requirements or get a Commercial Data License for the
Streaming API. Does the user really need to see every Bieber at 3 Biebers
Per Second? How would they ever know if they missed 10^-5 of them in a blur?
If you need them all for analysis, consider calculating the confidence
interval given a sample proportion of 1 - 10^6 (6 9s) or so vs. a total
enumeration. Indistinguishable. If you need them for some other purpose, say
CRM, the Streaming API may be the answer.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.


On Fri, Apr 9, 2010 at 2:28 PM, Brian Smith  wrote:

> John,
>
>
>
> I am not polling. I am simply trying to implement a basic “refresh” feature
> like every desktop/mobile Twitter app has. Basically, I just want to let
> users scroll through their timelines, and be reasonably sure that I am
> presenting them with an accurate & complete view of the timeline, while
> using as little bandwidth as possible.
>
>
>
> When I said “10 seconds old”/“30 seconds old”/etc. I was referring to I was
> referring to the age at the time the page of tweets was generated. So,
> basically, if the tweet’s timestamp – the response’s Last-Modified time more
> than 10,000 ms (from what you said below), you are almost definitely getting
> At Least Once behavior if Twitter is operating normally, and you can use
> that information to get At Least Once behavior that emulates Exactly Once
> behavior with little (usually no) overhead. Is that a correct interpretation
> of what you were saying?
>
>
>
> Thanks,
>
> Brian
>
>
>
>
>
> *From:* twitter-development-talk@googlegroups.com [mailto:
> twitter-development-t...@googlegroups.com] *On Behalf Of *John Kalucki
> *Sent:* Friday, April 09, 2010 3:31 PM
>
> *To:* twitter-development-talk@googlegroups.com
> *Subject:* Re: [twitter-dev] Re: Upcoming changes to the way status IDs
> are sequenced
>
>
>
> Your second paragraph doesn't quite make sense. The period between your
> next poll and the timestamp of the last status is irrelevant. The issue is
> solely the magnitude of K on the roughly sorted stream of events that are
> applied to the materialized timeline vector. As K varies, so do the odds,
> however infinitesimally small, that you will miss a tweet using the last
> status id returned. The period between your polls of the API does not affect
> this K.
>
> My recommendation is to ignore this issue in nearly every use case. If you
> are, however, polling high velocity timelines (including search queries) and
> attempting to approximate an Exactly Once QoS, you should, basically, stop
> doing that. You are probably wasting resources and you'll probably never get
> Exactly Once behavior anyway. Use the Streaming API instead.
>
> -John Kalucki
> http://twitter.com/jkalucki
> Infrastructure, Twitter Inc.
>
> On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith  wrote:
>
> John,
>
>
>
> Thank you. That was one of the most informative emails on the Twitter API I
> have seen on the lis

RE: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-09 Thread Brian Smith
John,

 

I am not polling. I am simply trying to implement a basic "refresh" feature
like every desktop/mobile Twitter app has. Basically, I just want to let
users scroll through their timelines, and be reasonably sure that I am
presenting them with an accurate & complete view of the timeline, while
using as little bandwidth as possible.

 

When I said "10 seconds old"/"30 seconds old"/etc. I was referring to I was
referring to the age at the time the page of tweets was generated. So,
basically, if the tweet's timestamp - the response's Last-Modified time more
than 10,000 ms (from what you said below), you are almost definitely getting
At Least Once behavior if Twitter is operating normally, and you can use
that information to get At Least Once behavior that emulates Exactly Once
behavior with little (usually no) overhead. Is that a correct interpretation
of what you were saying?

 

Thanks,

Brian

 

 

From: twitter-development-talk@googlegroups.com
[mailto:twitter-development-t...@googlegroups.com] On Behalf Of John Kalucki
Sent: Friday, April 09, 2010 3:31 PM
To: twitter-development-talk@googlegroups.com
Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are
sequenced

 

Your second paragraph doesn't quite make sense. The period between your next
poll and the timestamp of the last status is irrelevant. The issue is solely
the magnitude of K on the roughly sorted stream of events that are applied
to the materialized timeline vector. As K varies, so do the odds, however
infinitesimally small, that you will miss a tweet using the last status id
returned. The period between your polls of the API does not affect this K.

My recommendation is to ignore this issue in nearly every use case. If you
are, however, polling high velocity timelines (including search queries) and
attempting to approximate an Exactly Once QoS, you should, basically, stop
doing that. You are probably wasting resources and you'll probably never get
Exactly Once behavior anyway. Use the Streaming API instead.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.

On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith  wrote:

John,

 

Thank you. That was one of the most informative emails on the Twitter API I
have seen on the list.

 

Basically, even now, an application should not use an ID of a tweet for
since_id if the tweet is less than 10 seconds old, ignoring service
abnormalities. Probably a larger threshold (30 seconds or even a minute)
would be better, especially when you take into consideration the likelihood
of clock skew between the servers that generate the timestamps.

 

I think this is information that would be useful to have added to the API
documentation, as I know many applications are taking a much more naive
approach to pagination.

 

Thanks again,

Brian

 

From: twitter-development-talk@googlegroups.com On Behalf Of John Kalucki
Sent: Friday, April 09, 2010 1:20 PM


To: twitter-development-talk@googlegroups.com
Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are
sequenced

 

Folks are making a lot of incorrect assumptions about the Twitter
architecture, especially around how we materialize and present timeline
vectors and just what QoS we're really offering. This new scheme does not
significantly, or perhaps even observably, make the existing issues around
since_id any better or any worse. And I'm being very precise here. The
since_id situation is such that the few milliseconds skew possible in
Snowflake are practically irrelevant and lost in the noise of a 4 to 6
orders-of-magnitude misconception. (That's a very big misconception.)



If you do not know the rough ordering of our event stream as it applied to
the materialized timeline vectors and also the expected rate of change on
the timeline in question, you cannot make good choices about making since_id
perfect. But, neither you should you try to make it perfect, nor should you
have to worry about this.

If you insist upon worrying about this, here's my slight salting of Mark's
advice: In the existing continuously increasing id generation scheme on the
Twitter.com API, I'd subtract about 5000 ids from since_id to ensure
sufficient overlap in nearly all cases, but even this could be lossy in the
face of severe operational issues -- issues of a type that we haven't seen
in many many months. The search API has a different K in its rough ordering,
so you might need more like 10,000 ids. In the new Snowflake scheme, I'd
overlap by about 5000 milliseconds for twitter.com APIs and 10,000 ms for
search APIs.

Despite all this, things still could go wrong. An engineer here is known for
pointing out that even things that almost never ever happen, happen all the
time on the Twitter system. Now, just because they are happening, to
someone, all the time, doesn't mean that they'll ever ever happen to you or
yo

Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-09 Thread John Kalucki
Your second paragraph doesn't quite make sense. The period between your next
poll and the timestamp of the last status is irrelevant. The issue is solely
the magnitude of K on the roughly sorted stream of events that are applied
to the materialized timeline vector. As K varies, so do the odds, however
infinitesimally small, that you will miss a tweet using the last status id
returned. The period between your polls of the API does not affect this K.

My recommendation is to ignore this issue in nearly every use case. If you
are, however, polling high velocity timelines (including search queries) and
attempting to approximate an Exactly Once QoS, you should, basically, stop
doing that. You are probably wasting resources and you'll probably never get
Exactly Once behavior anyway. Use the Streaming API instead.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.

On Fri, Apr 9, 2010 at 12:20 PM, Brian Smith  wrote:

> John,
>
>
>
> Thank you. That was one of the most informative emails on the Twitter API I
> have seen on the list.
>
>
>
> Basically, even now, an application should not use an ID of a tweet for
> since_id if the tweet is less than 10 seconds old, ignoring service
> abnormalities. Probably a larger threshold (30 seconds or even a minute)
> would be better, especially when you take into consideration the likelihood
> of clock skew between the servers that generate the timestamps.
>
>
>
> I think this is information that would be useful to have added to the API
> documentation, as I know many applications are taking a much more naive
> approach to pagination.
>
>
>
> Thanks again,
>
> Brian
>
>
>
> *From:* twitter-development-talk@googlegroups.com *On Behalf Of *John
> Kalucki
> *Sent:* Friday, April 09, 2010 1:20 PM
>
> *To:* twitter-development-talk@googlegroups.com
> *Subject:* Re: [twitter-dev] Re: Upcoming changes to the way status IDs
> are sequenced
>
>
>
> Folks are making a lot of incorrect assumptions about the Twitter
> architecture, especially around how we materialize and present timeline
> vectors and just what QoS we're really offering. This new scheme does not
> significantly, or perhaps even observably, make the existing issues around
> since_id any better or any worse. And I'm being very precise here. The
> since_id situation is such that the few milliseconds skew possible in
> Snowflake are practically irrelevant and lost in the noise of a 4 to 6
> orders-of-magnitude misconception. (That's a very big misconception.)
>
>
> If you do not know the rough ordering of our event stream as it applied to
> the materialized timeline vectors and also the expected rate of change on
> the timeline in question, you cannot make good choices about making since_id
> perfect. But, neither you should you try to make it perfect, nor should you
> have to worry about this.
>
> If you insist upon worrying about this, here's my slight salting of Mark's
> advice: In the existing continuously increasing id generation scheme on the
> Twitter.com API, I'd subtract about 5000 ids from since_id to ensure
> sufficient overlap in nearly all cases, but even this could be lossy in the
> face of severe operational issues -- issues of a type that we haven't seen
> in many many months. The search API has a different K in its rough ordering,
> so you might need more like 10,000 ids. In the new Snowflake scheme, I'd
> overlap by about 5000 milliseconds for twitter.com APIs and 10,000 ms for
> search APIs.
>
> Despite all this, things still could go wrong. An engineer here is known
> for pointing out that even things that almost never ever happen, happen all
> the time on the Twitter system. Now, just because they are happening, to
> someone, all the time, doesn't mean that they'll ever ever happen to you or
> your users in a thousand years -- but some's getting hit with it, somewhere,
> a few times a day.
>
> The above schemes no longer treat the id as an opaque unique ordered
> identifier. And woe lies in wait for you as changes are made to these ids.
> Woe. You also need to deduplicate. Be very careful and understand fully what
> you summon by breaking this semantic contract.
>
> In the end, since_id issues go away on the Streaming API, and other than
> around various start-up discontinuities, you can ignore this issue. I'll be
> talking about Rough Ordering, among other things Streaming, at the Chirp
> conference. Come geek out.
>
> -John Kalucki
> http://twitter.com/jkalucki
> Infrastructure, Twitter Inc.
>
> On Fri, Apr 9, 2010 at 1:58 AM, Dave Sherohman  wrote:
>
> On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote:
> > However, I wanted 

RE: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-09 Thread Brian Smith
John,

 

Thank you. That was one of the most informative emails on the Twitter API I
have seen on the list.

 

Basically, even now, an application should not use an ID of a tweet for
since_id if the tweet is less than 10 seconds old, ignoring service
abnormalities. Probably a larger threshold (30 seconds or even a minute)
would be better, especially when you take into consideration the likelihood
of clock skew between the servers that generate the timestamps.

 

I think this is information that would be useful to have added to the API
documentation, as I know many applications are taking a much more naive
approach to pagination.

 

Thanks again,

Brian

 

From: twitter-development-talk@googlegroups.com On Behalf Of John Kalucki
Sent: Friday, April 09, 2010 1:20 PM
To: twitter-development-talk@googlegroups.com
Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are
sequenced

 

Folks are making a lot of incorrect assumptions about the Twitter
architecture, especially around how we materialize and present timeline
vectors and just what QoS we're really offering. This new scheme does not
significantly, or perhaps even observably, make the existing issues around
since_id any better or any worse. And I'm being very precise here. The
since_id situation is such that the few milliseconds skew possible in
Snowflake are practically irrelevant and lost in the noise of a 4 to 6
orders-of-magnitude misconception. (That's a very big misconception.)

If you do not know the rough ordering of our event stream as it applied to
the materialized timeline vectors and also the expected rate of change on
the timeline in question, you cannot make good choices about making since_id
perfect. But, neither you should you try to make it perfect, nor should you
have to worry about this.

If you insist upon worrying about this, here's my slight salting of Mark's
advice: In the existing continuously increasing id generation scheme on the
Twitter.com API, I'd subtract about 5000 ids from since_id to ensure
sufficient overlap in nearly all cases, but even this could be lossy in the
face of severe operational issues -- issues of a type that we haven't seen
in many many months. The search API has a different K in its rough ordering,
so you might need more like 10,000 ids. In the new Snowflake scheme, I'd
overlap by about 5000 milliseconds for twitter.com APIs and 10,000 ms for
search APIs.

Despite all this, things still could go wrong. An engineer here is known for
pointing out that even things that almost never ever happen, happen all the
time on the Twitter system. Now, just because they are happening, to
someone, all the time, doesn't mean that they'll ever ever happen to you or
your users in a thousand years -- but some's getting hit with it, somewhere,
a few times a day.

The above schemes no longer treat the id as an opaque unique ordered
identifier. And woe lies in wait for you as changes are made to these ids.
Woe. You also need to deduplicate. Be very careful and understand fully what
you summon by breaking this semantic contract.

In the end, since_id issues go away on the Streaming API, and other than
around various start-up discontinuities, you can ignore this issue. I'll be
talking about Rough Ordering, among other things Streaming, at the Chirp
conference. Come geek out. 

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.



On Fri, Apr 9, 2010 at 1:58 AM, Dave Sherohman  wrote:

On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote:
> However, I wanted to be clear and feel it should be made obvious that
> with this change, there is a possibility that a tweet may not be
> delivered to client if the implementation of how since_id is currently
> used is not updated to cover the case.  I still envision the situation
> as more likely than you seem to believe and figure as tweet velocity
> increases, the likelihood will also increase; But I am assuming have
> better data to support your viewpoint than I and shall defer.

Maybe I'm just missing something here, but it seems trivial to fix on
Twitter's side (enough so that I assume it's what they've been planning
from the start to do):  Only return tweets from closed buckets.

We are guaranteed that the buckets will be properly ordered.  The order
will only be randomized within a bucket.  Therefore, by only returning
tweets from buckets which are no longer receiving new tweets, since_id
works and will never miss a tweet.

And, yes, this does mean a slight delay in getting the tweets out
because they have to wait a few milliseconds for their bucket to close
before being exposed to calls which can use since_id, plus maybe a
little longer for the contents of that bucket to be distributed to
multiple servers.  That's still going to only take time comparable to
round-trip times for an HTTP request to fetch the data for display to a
user a

Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-09 Thread M. Edward (Ed) Borasky
On 04/09/2010 11:20 AM, John Kalucki wrote:

[snip]


> 
> In the end, since_id issues go away on the Streaming API, and other than
> around various start-up discontinuities, you can ignore this issue. I'll be
> talking about Rough Ordering, among other things Streaming, at the Chirp
> conference. Come geek out.

Thanks, John - that's the plan. ;-)

-- 
M. Edward (Ed) Borasky
borasky-research.net/m-edward-ed-borasky

"A mathematician is a device for turning coffee into theorems." ~ Paul Erdős


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-09 Thread John Kalucki
Folks are making a lot of incorrect assumptions about the Twitter
architecture, especially around how we materialize and present timeline
vectors and just what QoS we're really offering. This new scheme does not
significantly, or perhaps even observably, make the existing issues around
since_id any better or any worse. And I'm being very precise here. The
since_id situation is such that the few milliseconds skew possible in
Snowflake are practically irrelevant and lost in the noise of a 4 to 6
orders-of-magnitude misconception. (That's a very big misconception.)

If you do not know the rough ordering of our event stream as it applied to
the materialized timeline vectors and also the expected rate of change on
the timeline in question, you cannot make good choices about making since_id
perfect. But, neither you should you try to make it perfect, nor should you
have to worry about this.

If you insist upon worrying about this, here's my slight salting of Mark's
advice: In the existing continuously increasing id generation scheme on the
Twitter.com API, I'd subtract about 5000 ids from since_id to ensure
sufficient overlap in nearly all cases, but even this could be lossy in the
face of severe operational issues -- issues of a type that we haven't seen
in many many months. The search API has a different K in its rough ordering,
so you might need more like 10,000 ids. In the new Snowflake scheme, I'd
overlap by about 5000 milliseconds for twitter.com APIs and 10,000 ms for
search APIs.

Despite all this, things still could go wrong. An engineer here is known for
pointing out that even things that almost never ever happen, happen all the
time on the Twitter system. Now, just because they are happening, to
someone, all the time, doesn't mean that they'll ever ever happen to you or
your users in a thousand years -- but some's getting hit with it, somewhere,
a few times a day.

The above schemes no longer treat the id as an opaque unique ordered
identifier. And woe lies in wait for you as changes are made to these ids.
Woe. You also need to deduplicate. Be very careful and understand fully what
you summon by breaking this semantic contract.

In the end, since_id issues go away on the Streaming API, and other than
around various start-up discontinuities, you can ignore this issue. I'll be
talking about Rough Ordering, among other things Streaming, at the Chirp
conference. Come geek out.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.


On Fri, Apr 9, 2010 at 1:58 AM, Dave Sherohman  wrote:

> On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote:
> > However, I wanted to be clear and feel it should be made obvious that
> > with this change, there is a possibility that a tweet may not be
> > delivered to client if the implementation of how since_id is currently
> > used is not updated to cover the case.  I still envision the situation
> > as more likely than you seem to believe and figure as tweet velocity
> > increases, the likelihood will also increase; But I am assuming have
> > better data to support your viewpoint than I and shall defer.
>
> Maybe I'm just missing something here, but it seems trivial to fix on
> Twitter's side (enough so that I assume it's what they've been planning
> from the start to do):  Only return tweets from closed buckets.
>
> We are guaranteed that the buckets will be properly ordered.  The order
> will only be randomized within a bucket.  Therefore, by only returning
> tweets from buckets which are no longer receiving new tweets, since_id
> works and will never miss a tweet.
>
> And, yes, this does mean a slight delay in getting the tweets out
> because they have to wait a few milliseconds for their bucket to close
> before being exposed to calls which can use since_id, plus maybe a
> little longer for the contents of that bucket to be distributed to
> multiple servers.  That's still going to only take time comparable to
> round-trip times for an HTTP request to fetch the data for display to a
> user and be far, far less than the average refresh delay required by
> those clients which fall under the API rate limit.  I submit, therefore,
> that any such delay caused by waiting for buckets to close will be
> inconsequential.
>
> --
> Dave Sherohman
>


-- 
To unsubscribe, reply using "remove me" as the subject.


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-09 Thread Dave Sherohman
On Thu, Apr 08, 2010 at 05:03:29PM -0700, Naveen wrote:
> However, I wanted to be clear and feel it should be made obvious that
> with this change, there is a possibility that a tweet may not be
> delivered to client if the implementation of how since_id is currently
> used is not updated to cover the case.  I still envision the situation
> as more likely than you seem to believe and figure as tweet velocity
> increases, the likelihood will also increase; But I am assuming have
> better data to support your viewpoint than I and shall defer.

Maybe I'm just missing something here, but it seems trivial to fix on
Twitter's side (enough so that I assume it's what they've been planning
from the start to do):  Only return tweets from closed buckets.

We are guaranteed that the buckets will be properly ordered.  The order
will only be randomized within a bucket.  Therefore, by only returning
tweets from buckets which are no longer receiving new tweets, since_id
works and will never miss a tweet.

And, yes, this does mean a slight delay in getting the tweets out
because they have to wait a few milliseconds for their bucket to close
before being exposed to calls which can use since_id, plus maybe a
little longer for the contents of that bucket to be distributed to
multiple servers.  That's still going to only take time comparable to
round-trip times for an HTTP request to fetch the data for display to a
user and be far, far less than the average refresh delay required by
those clients which fall under the API rate limit.  I submit, therefore,
that any such delay caused by waiting for buckets to close will be
inconsequential.

-- 
Dave Sherohman


RE: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-08 Thread Brian Smith
Mark, thank you for taking the time to respond. 

 

What is the smallest “comfort threshold” that will guarantee that we will see 
all the tweets, with none skipped over and the fewest tweets returned multiple 
times?

 

Let’s say the comfort threshold was 2 seconds. It seems to me like there could 
realistically be dozens or hundreds of tweets within those two seconds in a 
single timeline, and a request that used the logic you mentioned would return 
an entire page (200 tweets) consisting of tweets that the application already 
has; the application would be making a relatively large download, receiving 
nothing useful for it, and not be able to make any progress because its 
since_id would get “stuck”. This is at odds with many (most?) applications goal 
in using since_id, which is to transfer as little data as possible.

 

It seems like a better alternative would a new parameter that says “don’t give 
me any tweets that are less than  seconds old,” where  seconds is the 
comfort threshold. That way, the application may lag behind by a few of 
seconds, but at least it would be able to confidently page through the timeline 
without excessive data transfer. Without such a mechanism, it looks like this 
change will be a significant degradation of service that result in 
applications’ “refresh” features becoming either unreliable or very wasteful.

 

But, is it realistic for applications to expect the Twitter cluster to be in 
sync within 2 seconds? 10 seconds? 30 seconds? That is the part that is unclear 
to me. 

 

Thanks again,

Brian

 

 

From: twitter-development-talk@googlegroups.com 
[mailto:twitter-development-t...@googlegroups.com] On Behalf Of Mark McBride
Sent: Thursday, April 08, 2010 6:38 PM
To: twitter-development-talk@googlegroups.com
Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are 
sequenced

 

It's a possibility, but by no means a probability.  Note that you can mitigate 
this by using the newest tweet that is outside your "danger zone".  For example 
in a sequence of tweets t1, t2 ... ti ... tn with creation times c1, c2 ... ci 
... cn and a comfort threshold e you could use since_id from the latest ti such 
that c1 - ci > e.


  ---Mark

http://twitter.com/mccv



On Thu, Apr 8, 2010 at 4:27 PM, Naveen  wrote:

This was my initial concern with the randomly generated ids that I
brought up, though I think Brian described it better than I.

It simply seems very likely that when using since_id to populate newer
tweets for the user, that some tweets will never be seen, because the
since_id of the last message received will be larger than one
generated 1ms later.

With the random generation of ids, I can see two way guarantee
delivery of all tweets in a users timeline
1. Page forwards and backwards to ensure no tweets generated at or
near the same time as the newest one did not receive a lower id. This
will be very expensive for a mobile client not to mention complicate
any refresh algorithms significantly.
2. Given that we know how IDs are generated (i.e. which bits represent
the time) we can simply over request by decrementing the since_id time
bits, by a second or two and filter out duplicates. (again, not really
ideal for mobile clients where battery life is an issue, plus it then
makes the implementation very dependent on twitters id format
remaining stable)

Please anyone explain if Brian and I are misinterpreting this as a
very real possibility of never displaying some tweets in a time line,
without changing how we request data from twitter (i.e. since_id
doesn't break)

--Naveen Ayyagari
@knight9
@SocialScope



On Apr 8, 7:01 pm, "Brian Smith"  wrote:
> What does “within the caveats given above” mean? Either since_id will work or 
> it won’t. It seems to me that if IDs are only in a “rough” order, since_id 
> won’t work—in particular, there is a possibility that paging through tweets 
> using since_id will completely skip over some tweets.
>
> My concern is that, since tweets will not be serialized at the time they are 
> written, there will be a race condition between me making a request and users 
> posting new statuses. That is, I could get a response with the largest id in 
> the response being X that gets evaluated just before a tweet (X-1) has been 
> saved in the database; If so, when I issue a request with since_id=X, my 
> program will never see the newer tweet (X-1).
>
> Are you going to change the implementation of the timeline methods so that 
> they never return a tweet with ID X until all nodes in the cluster guarantee 
> that they won’t create a new tweet with an ID less than X?
>
> I implement the following logic:
>
> 1.  Let LATEST start out as the earliest tweet available in the user’s 
> timeline.
>
> 2.  Make a request with since_id={LATEST}, which returns a set of tweets 
> T.
>
> 3.  If T is empty then stop.
>

Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-08 Thread Mark McBride
It's a possibility, but by no means a probability.  Note that you can
mitigate this by using the newest tweet that is outside your "danger zone".
 For example in a sequence of tweets t1, t2 ... ti ... tn with creation
times c1, c2 ... ci ... cn and a comfort threshold e you could use since_id
from the latest ti such that c1 - ci > e.

  ---Mark

http://twitter.com/mccv


On Thu, Apr 8, 2010 at 4:27 PM, Naveen  wrote:

> This was my initial concern with the randomly generated ids that I
> brought up, though I think Brian described it better than I.
>
> It simply seems very likely that when using since_id to populate newer
> tweets for the user, that some tweets will never be seen, because the
> since_id of the last message received will be larger than one
> generated 1ms later.
>
> With the random generation of ids, I can see two way guarantee
> delivery of all tweets in a users timeline
> 1. Page forwards and backwards to ensure no tweets generated at or
> near the same time as the newest one did not receive a lower id. This
> will be very expensive for a mobile client not to mention complicate
> any refresh algorithms significantly.
> 2. Given that we know how IDs are generated (i.e. which bits represent
> the time) we can simply over request by decrementing the since_id time
> bits, by a second or two and filter out duplicates. (again, not really
> ideal for mobile clients where battery life is an issue, plus it then
> makes the implementation very dependent on twitters id format
> remaining stable)
>
> Please anyone explain if Brian and I are misinterpreting this as a
> very real possibility of never displaying some tweets in a time line,
> without changing how we request data from twitter (i.e. since_id
> doesn't break)
>
> --Naveen Ayyagari
> @knight9
> @SocialScope
>
>
> On Apr 8, 7:01 pm, "Brian Smith"  wrote:
> > What does “within the caveats given above” mean? Either since_id will
> work or it won’t. It seems to me that if IDs are only in a “rough” order,
> since_id won’t work—in particular, there is a possibility that paging
> through tweets using since_id will completely skip over some tweets.
> >
> > My concern is that, since tweets will not be serialized at the time they
> are written, there will be a race condition between me making a request and
> users posting new statuses. That is, I could get a response with the largest
> id in the response being X that gets evaluated just before a tweet (X-1) has
> been saved in the database; If so, when I issue a request with since_id=X,
> my program will never see the newer tweet (X-1).
> >
> > Are you going to change the implementation of the timeline methods so
> that they never return a tweet with ID X until all nodes in the cluster
> guarantee that they won’t create a new tweet with an ID less than X?
> >
> > I implement the following logic:
> >
> > 1.  Let LATEST start out as the earliest tweet available in the
> user’s timeline.
> >
> > 2.  Make a request with since_id={LATEST}, which returns a set of
> tweets T.
> >
> > 3.  If T is empty then stop.
> >
> > 4.  Let LATEST= max({ id(t), for all t in T}).
> >
> > 5.  Goto 2.
> >
> > Will I be guaranteed not to skip over any tweets in the timeline using
> this logic? If not, what do I need to do to ensure I get them all?
> >
> > Thanks,
> >
> > Brian
> >
> > From: twitter-development-talk@googlegroups.com [mailto:
> twitter-development-t...@googlegroups.com] On Behalf Of Mark McBride
> > Sent: Thursday, April 08, 2010 5:10 PM
> > To: twitter-development-talk@googlegroups.com
> > Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are
> sequenced
> >
> > Thank you for the feedback.  It's great to hear about the variety of use
> cases people have for the API, and in particular all the different ways
> people are using IDs. To alleviate some of the concerns raised in this
> thread we thought it would be useful to give more details about how we plan
> to generate IDs
> >
> > 1) IDs are still 64-bit integers.  This should minimize any migration
> pains.
> >
> > 2) You can still sort on ID.  Within a few millieconds you may get out of
> order results, but for most use cases this shouldn't be an issue.
> >
> > 3) since_id will still work (within the caveats given above).
> >
> > 4) We will provide a way to backfill from the streaming API.
> >
> > 5) You cannot use the generated ID to reverse engineer tweet velocity.
>  Note that you can still use the streaming API to determine the rate of
> public statuses.
> >
> >

RE: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-08 Thread Brian Smith
What does “within the caveats given above” mean? Either since_id will work or 
it won’t. It seems to me that if IDs are only in a “rough” order, since_id 
won’t work—in particular, there is a possibility that paging through tweets 
using since_id will completely skip over some tweets. 

 

My concern is that, since tweets will not be serialized at the time they are 
written, there will be a race condition between me making a request and users 
posting new statuses. That is, I could get a response with the largest id in 
the response being X that gets evaluated just before a tweet (X-1) has been 
saved in the database; If so, when I issue a request with since_id=X, my 
program will never see the newer tweet (X-1).

 

Are you going to change the implementation of the timeline methods so that they 
never return a tweet with ID X until all nodes in the cluster guarantee that 
they won’t create a new tweet with an ID less than X?

 

I implement the following logic:

 

1.  Let LATEST start out as the earliest tweet available in the user’s 
timeline.

2.  Make a request with since_id={LATEST}, which returns a set of tweets T.

3.  If T is empty then stop.

4.  Let LATEST= max({ id(t), for all t in T}).

5.  Goto 2.

 

Will I be guaranteed not to skip over any tweets in the timeline using this 
logic? If not, what do I need to do to ensure I get them all?

 

Thanks,

Brian

 

 

From: twitter-development-talk@googlegroups.com 
[mailto:twitter-development-t...@googlegroups.com] On Behalf Of Mark McBride
Sent: Thursday, April 08, 2010 5:10 PM
To: twitter-development-talk@googlegroups.com
Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are 
sequenced

 

Thank you for the feedback.  It's great to hear about the variety of use cases 
people have for the API, and in particular all the different ways people are 
using IDs. To alleviate some of the concerns raised in this thread we thought 
it would be useful to give more details about how we plan to generate IDs

 

1) IDs are still 64-bit integers.  This should minimize any migration pains.

2) You can still sort on ID.  Within a few millieconds you may get out of order 
results, but for most use cases this shouldn't be an issue.  

3) since_id will still work (within the caveats given above).  

4) We will provide a way to backfill from the streaming API.

5) You cannot use the generated ID to reverse engineer tweet velocity.  Note 
that you can still use the streaming API to determine the rate of public 
statuses.

 

Additional items of interest

1) At some point we will likely start using this as an ID for direct messages 
too

2) We will almost certainly open source the ID generation code, probably before 
we actually cut over to using it.

3) We STRONGLY suggest that you treat IDs as roughly sorted (roughly being 
within a few ms buckets), opaque 64-bit integers.  We may need to change the 
scheme again at some point in the future, and want to minimize migration pains 
should we need to do this.

 

Hopefully this puts you more at ease with the changes we're making.  If it 
raises new concerns, please let us know!

 

  ---Mark

 <http://twitter.com/mccv> http://twitter.com/mccv

 

On Mon, Apr 5, 2010 at 4:18 PM, M. Edward (Ed) Borasky  
wrote:

On 04/05/2010 12:55 AM, Tim Haines wrote:
> This made me laugh.  Hard.
>
> On Fri, Apr 2, 2010 at 6:47 AM, Dewald Pretorius  wrote:
>
>> Mark,
>>
>> It's extremely important where you have two bots that reply to each
>> others' tweets. With incorrectly sorted tweets, you get conversations
>> that look completely unnatural.
>>
>> On Apr 1, 1:39 pm, Mark McBride  wrote:
>>> Just out of curiosity, what applications are you building that require
>>> sub-second sorting resolution for tweets?

Yeah - my bot laughed too ;-)

--
M. Edward (Ed) Borasky
borasky-research.net/m-edward-ed-borasky

"A mathematician is a device for turning coffee into theorems." ~ Paul Erdős



--

To unsubscribe, reply using "remove me" as the subject.

 



Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-08 Thread Lil Peck
On Thu, Apr 8, 2010 at 5:39 PM, Nick Arnett  wrote:
>
> I'd love to see an example of two bots replying to each other and looking
> entirely natural!
>
> We all knew this sort of thing was going on, removing the pesky humans from
> the loop, but I always thought it was unintentional.
>
> There's a science fiction story in there somewhere.
>
>

Do Twitterbots dream of electric sheep?


-- 
Subscription settings: 
http://groups.google.com/group/twitter-development-talk/subscribe?hl=en


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-08 Thread Nick Arnett
On Thu, Apr 1, 2010 at 10:47 AM, Dewald Pretorius  wrote:

> Mark,
>
> It's extremely important where you have two bots that reply to each
> others' tweets. With incorrectly sorted tweets, you get conversations
> that look completely unnatural.


I'd love to see an example of two bots replying to each other and looking
entirely natural!

We all knew this sort of thing was going on, removing the pesky humans from
the loop, but I always thought it was unintentional.

There's a science fiction story in there somewhere.

Nick


-- 
Subscription settings: 
http://groups.google.com/group/twitter-development-talk/subscribe?hl=en


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-08 Thread Mark McBride
Thank you for the feedback.  It's great to hear about the variety of use
cases people have for the API, and in particular all the different ways
people are using IDs. To alleviate some of the concerns raised in this
thread we thought it would be useful to give more details about how we plan
to generate IDs

1) IDs are still 64-bit integers.  This should minimize any migration pains.
2) You can still sort on ID.  Within a few millieconds you may get out of
order results, but for most use cases this shouldn't be an issue.
3) since_id will still work (within the caveats given above).
4) We will provide a way to backfill from the streaming API.
5) You cannot use the generated ID to reverse engineer tweet velocity.  Note
that you can still use the streaming API to determine the rate of public
statuses.

Additional items of interest
1) At some point we will likely start using this as an ID for direct
messages too
2) We will almost certainly open source the ID generation code, probably
before we actually cut over to using it.
3) We STRONGLY suggest that you treat IDs as roughly sorted (roughly being
within a few ms buckets), opaque 64-bit integers.  We may need to change the
scheme again at some point in the future, and want to minimize migration
pains should we need to do this.

Hopefully this puts you more at ease with the changes we're making.  If it
raises new concerns, please let us know!

  ---Mark

http://twitter.com/mccv

On Mon, Apr 5, 2010 at 4:18 PM, M. Edward (Ed) Borasky wrote:

> On 04/05/2010 12:55 AM, Tim Haines wrote:
> > This made me laugh.  Hard.
> >
> > On Fri, Apr 2, 2010 at 6:47 AM, Dewald Pretorius 
> wrote:
> >
> >> Mark,
> >>
> >> It's extremely important where you have two bots that reply to each
> >> others' tweets. With incorrectly sorted tweets, you get conversations
> >> that look completely unnatural.
> >>
> >> On Apr 1, 1:39 pm, Mark McBride  wrote:
> >>> Just out of curiosity, what applications are you building that require
> >>> sub-second sorting resolution for tweets?
>
> Yeah - my bot laughed too ;-)
> --
> M. Edward (Ed) Borasky
> borasky-research.net/m-edward-ed-borasky
>
> "A mathematician is a device for turning coffee into theorems." ~ Paul
> Erdős
>
>
> --
> To unsubscribe, reply using "remove me" as the subject.
>


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-05 Thread M. Edward (Ed) Borasky
On 04/05/2010 12:55 AM, Tim Haines wrote:
> This made me laugh.  Hard.
> 
> On Fri, Apr 2, 2010 at 6:47 AM, Dewald Pretorius  wrote:
> 
>> Mark,
>>
>> It's extremely important where you have two bots that reply to each
>> others' tweets. With incorrectly sorted tweets, you get conversations
>> that look completely unnatural.
>>
>> On Apr 1, 1:39 pm, Mark McBride  wrote:
>>> Just out of curiosity, what applications are you building that require
>>> sub-second sorting resolution for tweets?

Yeah - my bot laughed too ;-)
-- 
M. Edward (Ed) Borasky
borasky-research.net/m-edward-ed-borasky

"A mathematician is a device for turning coffee into theorems." ~ Paul Erdős


-- 
To unsubscribe, reply using "remove me" as the subject.


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-05 Thread Tim Haines
This made me laugh.  Hard.

On Fri, Apr 2, 2010 at 6:47 AM, Dewald Pretorius  wrote:

> Mark,
>
> It's extremely important where you have two bots that reply to each
> others' tweets. With incorrectly sorted tweets, you get conversations
> that look completely unnatural.
>
> On Apr 1, 1:39 pm, Mark McBride  wrote:
> > Just out of curiosity, what applications are you building that require
> > sub-second sorting resolution for tweets?
> >
> >   ---Mark
> >
> > http://twitter.com/mccv
> >
> >
> >
> > On Wed, Mar 31, 2010 at 11:01 PM, Aki  wrote:
> > > It actually makes sense to use tweet ID to sort tweets, because
> > > timestamp is not a valid source of information for accurate sorting.
> > > It is a very common case to have multiple tweets posted at the exact
> > > same second, and it is not possible to reproduce the correct ordering
> > > of tweets on the client side. This can be improved by having better
> > > precision for timestamp (maybe milliseconds), but it is still possible
> > > to get tweets posted at the exact same milliseconds (although it is
> > > very rare).
> >
> > > If Twitter really needs to change the tweet ID scheme, I think better
> > > solution for sorting is required to be provided through API.
> >
> > > On Mar 27, 7:41 am, Taylor Singletary 
> > > wrote:
> > > > Hi Developers,
> >
> > > > It's no secret that Twitter is growing exponentially. The tweets keep
> > > coming
> > > > with ever increasing velocity, thanks in large part to your great
> > > > applications.
> >
> > > > Twitter has adapted to the increasing number of tweets in ways that
> have
> > > > affected you in the past: We moved from 32 bit unsigned integers to
> > > 64-bit
> > > > unsigned integers for status IDs some time ago. You all weathered
> that
> > > storm
> > > > with ease. The tweetapoclypse was averted, and the tweets kept
> flowing.
> >
> > > > Now we're reaching the scalability limit of our current tweet ID
> > > generation
> > > > scheme. Unlike the previous tweet ID migrations, the solution to the
> > > current
> > > > issue is significantly different. However, in most cases the new
> approach
> > > we
> > > > will take will not result in any noticeable differences to you the
> > > developer
> > > > or your users.
> >
> > > > We are planning to replace our current sequential tweet ID generation
> > > > routine with a simple, more scalable solution. IDs will still be
> 64-bit
> > > > unsigned integers. However, this new solution is no longer guaranteed
> to
> > > > generate sequential IDs.  Instead IDs will be derived based on time:
> the
> > > > most significant bits being sourced from a timestamp and the least
> > > > significant bits will be effectively random.
> >
> > > > Please don't depend on the exact format of the ID. As our
> infrastructure
> > > > needs evolve, we might need to tweak the generation algorithm again.
> >
> > > > If you've been trying to divine meaning from status IDs aside from
> their
> > > > role as a primary key, you won't be able to anymore. Likewise for
> usage
> > > of
> > > > IDs in mathematical operations -- for instance, subtracting two
> status
> > > IDs
> > > > to determine the number of tweets in between will no longer be
> possible.
> >
> > > > For the majority of applications we think this scheme switch will be
> a
> > > > non-event. Before implementing these changes, we'd like to know if
> your
> > > > applications currently depend on the sequential nature of IDs. Do you
> > > depend
> > > > on the density of the tweet sequence being constant?  Are you trying
> to
> > > > analyze the IDs as anything other than opaque, ordered identifiers?
> Aside
> > > > for guaranteed sequential tweet ID ordering, what APIs can we provide
> you
> > > to
> > > > accomplish your goals?
> >
> > > > Taylor Singletary
> > > > Developer Advocate, Twitterhttp://twitter.com/episod
> >
> > > --
> > > To unsubscribe, reply using "remove me" as the subject.- Hide quoted
> text -
> >
> > - Show quoted text -
>


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-04-01 Thread Mark McBride
Just out of curiosity, what applications are you building that require
sub-second sorting resolution for tweets?

  ---Mark

http://twitter.com/mccv


On Wed, Mar 31, 2010 at 11:01 PM, Aki  wrote:

> It actually makes sense to use tweet ID to sort tweets, because
> timestamp is not a valid source of information for accurate sorting.
> It is a very common case to have multiple tweets posted at the exact
> same second, and it is not possible to reproduce the correct ordering
> of tweets on the client side. This can be improved by having better
> precision for timestamp (maybe milliseconds), but it is still possible
> to get tweets posted at the exact same milliseconds (although it is
> very rare).
>
> If Twitter really needs to change the tweet ID scheme, I think better
> solution for sorting is required to be provided through API.
>
> On Mar 27, 7:41 am, Taylor Singletary 
> wrote:
> > Hi Developers,
> >
> > It's no secret that Twitter is growing exponentially. The tweets keep
> coming
> > with ever increasing velocity, thanks in large part to your great
> > applications.
> >
> > Twitter has adapted to the increasing number of tweets in ways that have
> > affected you in the past: We moved from 32 bit unsigned integers to
> 64-bit
> > unsigned integers for status IDs some time ago. You all weathered that
> storm
> > with ease. The tweetapoclypse was averted, and the tweets kept flowing.
> >
> > Now we're reaching the scalability limit of our current tweet ID
> generation
> > scheme. Unlike the previous tweet ID migrations, the solution to the
> current
> > issue is significantly different. However, in most cases the new approach
> we
> > will take will not result in any noticeable differences to you the
> developer
> > or your users.
> >
> > We are planning to replace our current sequential tweet ID generation
> > routine with a simple, more scalable solution. IDs will still be 64-bit
> > unsigned integers. However, this new solution is no longer guaranteed to
> > generate sequential IDs.  Instead IDs will be derived based on time: the
> > most significant bits being sourced from a timestamp and the least
> > significant bits will be effectively random.
> >
> > Please don't depend on the exact format of the ID. As our infrastructure
> > needs evolve, we might need to tweak the generation algorithm again.
> >
> > If you've been trying to divine meaning from status IDs aside from their
> > role as a primary key, you won't be able to anymore. Likewise for usage
> of
> > IDs in mathematical operations -- for instance, subtracting two status
> IDs
> > to determine the number of tweets in between will no longer be possible.
> >
> > For the majority of applications we think this scheme switch will be a
> > non-event. Before implementing these changes, we'd like to know if your
> > applications currently depend on the sequential nature of IDs. Do you
> depend
> > on the density of the tweet sequence being constant?  Are you trying to
> > analyze the IDs as anything other than opaque, ordered identifiers? Aside
> > for guaranteed sequential tweet ID ordering, what APIs can we provide you
> to
> > accomplish your goals?
> >
> > Taylor Singletary
> > Developer Advocate, Twitterhttp://twitter.com/episod
>
>
> --
> To unsubscribe, reply using "remove me" as the subject.
>


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-03-31 Thread Adam Fields
On Wed, Mar 31, 2010 at 07:30:00AM -0700, eugene.man...@gmail.com wrote:
> Second that. Our app continuously retrieves feeds of individual users
> and lists. Monotonically increasing are required to be able to do that
> (using since_id).
[...]

Since the most significant bits are generated from a timestamp, later
tweets will always have a higher number than earlier ones (except in
the case of the black hole explorer probe tweeting its progress from
within the event horizon).

To illustrate this with decimal numbers from 0-9:

If two users post three tweets each in the space of three seconds,
they may space like this (the first digit is the timestamp, the second
is the random digit):

User 1: 05
User 2: 06
User 1: 17
User 2: 12
User 1: 27
User 2: 29

Tweets 12 and 17 are "out of order", but they're not really "in
order", since they happened at the same time (depending on the
precision of the timestamp) by different users. User 1's tweets (05,
17, 27) and User 2's tweets (06, 12, 29) will always be ordered
properly by time within each user even though the second digit is
random.

-- 
- Adam
--
If you liked this email, you might also like:
"Good article on technical aspects of lens variation" 
-- http://workstuff.tumblr.com/post/479306926
"Cooking at home is different" 
-- http://www.aquick.org/blog/2009/10/15/cooking-at-home-is-different/
"Bloom" 
-- http://www.flickr.com/photos/fields/4449638140/
"fields: RT @smokingapples: Warning: Clicking this link might result in 
uncontr..." 
-- http://twitter.com/fields/statuses/11338927699
--
** I design intricate-yet-elegant processes for user and machine problems.
** Custom development project broken? Contact me, I can help.
** Some of what I do: http://workstuff.tumblr.com/post/70505118/aboutworkstuff

[ http://www.adamfields.com/resume.html ].. Experience
[ http://www.morningside-analytics.com ] .. Latest Venture
[ http://www.confabb.com ]  Founder


-- 
To unsubscribe, reply using "remove me" as the subject.


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-03-27 Thread Chad Etzel
So, I guess for the since_id issue, it boils down to this question:

Regarding the since_id parameter, when you (Twitter) flip the switch
on the new ID format, will I (as a developer) have to change any of my
code in order for it to function the way it does now? This question
applies equally for both the Twitter API and the Search API.

Check One:

[ ] YES
[ ] NO

Taylor's previous response alluded to "no" (a good thing), but I
wasn't 100% assured.

-Chad

To unsubscribe from this group, send email to 
twitter-development-talk+unsubscribegooglegroups.com or reply to this email 
with the words "REMOVE ME" as the subject.


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-03-27 Thread Josh Bleecher Snyder
> So I think we need to allow Twitter some leeway here.

I apologize if my tone came off badly; it was not intended. I've just
had bumpy rides using timestamps for coordination in distributed
systems (less cool ones than space flight), so this worried me a
little. In the end, whatever Twitter decides to do, I'll work with.


> As far as occasional glitches are concerned, we have those now. Every
> so often, we still get Fail Whales, 5xx errors, DDos attacks, etc.

The difference is that those errors are straightforwardly detectable
on the client side and can be handled more or less gracefully. Minor,
intermittent data issues (like the odd missing tweet) are less
straightforward to detect, but still trigger support emails. :)

-josh

To unsubscribe from this group, send email to 
twitter-development-talk+unsubscribegooglegroups.com or reply to this email 
with the words "REMOVE ME" as the subject.


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-03-27 Thread TjL
Will you still be able to look at two relative IDs and tell which one
came first and which one came second?

To unsubscribe from this group, send email to 
twitter-development-talk+unsubscribegooglegroups.com or reply to this email 
with the words "REMOVE ME" as the subject.


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-03-26 Thread Josh Bleecher Snyder
Hi Taylor (et al.),

There are two reasons to think that, with the scheme you propose,
tweet ids will not necessarily be monotonically increasing.

Naveen hit the first:

> It seems if two messages are posted at very close to same time, they may not
> be sequential since the bottom bits will be randomly generated

There is another: Time synchronization is hard to always get right
(Einstein jokes aside). Clock skew happens for any number of reasons
-- sometimes ntpd sends time backwards when network i/o gets really
ugly, machine clocks wander, colos get out of sync, humans err, etc.
These are rare events, but they do happen, and they can cause
misalignment of clocks big enough for the odd tweet or two to fall
through.

Does missing the odd tweet or two matter? As for the tweet themselves:
Probably not. But if it gets noticed, it causes users / developers to
lose some amount of trust in their app / platform...and that matters a
lot and can also generate a lot of annoying support emails.


You wrote:

> since_id will work as well as it does today as a result of this change.

Is that assuming monotonically increasing tweet ids? If not, would you
mind elaborating?


Having a universal counter is untenable, but having occasional,
undiagnosable, unreproducible glitches also sucks. :) Thinking out
loud, perhaps there is some middle ground -- a way to have generally
monotonically increasing ids globally, and guaranteed monotonically
increasing ids along some useful dimension, such as per user (this
doesn't play nicely e.g. w/ Cassandra, but it is still reasonably
scalable by other means). Not sure whether that would help folks or
not...

-josh

To unsubscribe from this group, send email to 
twitter-development-talk+unsubscribegooglegroups.com or reply to this email 
with the words "REMOVE ME" as the subject.


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-03-26 Thread Naveen Ayyagari
I am still a little unclear if we will be able to determine the correct 
since_id to pass to the api by always looking for the largest tweet id we have 
seen. 

It seems if two messages are posted at very close to same time, they may not be 
sequential since the bottom bits will be randomly generated and I will not be 
able to safely just always use the largest id I have seen as the since_id??

Correct me if I am confusing myself please. 



On Mar 26, 2010, at 5:33 PM, Taylor Singletary wrote:

> A quick clarification for you all since there seems to be the most concern 
> around using since_id as a parameter:
> 
> since_id will work as well as it does today as a result of this change. 
> 
> Also, a reminder that the actual integer format of the tweet IDs will not be 
> changing. They'll still be unsigned 64bit integers as they are today.
> 
> Taylor Singletary
> Developer Advocate, Twitter
> http://twitter.com/episod
> 
> 
> On Fri, Mar 26, 2010 at 1:41 PM, Taylor Singletary 
>  wrote:
> Hi Developers,
> 
> It's no secret that Twitter is growing exponentially. The tweets keep coming 
> with ever increasing velocity, thanks in large part to your great 
> applications.
> 
> Twitter has adapted to the increasing number of tweets in ways that have 
> affected you in the past: We moved from 32 bit unsigned integers to 64-bit 
> unsigned integers for status IDs some time ago. You all weathered that storm 
> with ease. The tweetapoclypse was averted, and the tweets kept flowing.
> 
> Now we're reaching the scalability limit of our current tweet ID generation 
> scheme. Unlike the previous tweet ID migrations, the solution to the current 
> issue is significantly different. However, in most cases the new approach we 
> will take will not result in any noticeable differences to you the developer 
> or your users.
> 
> We are planning to replace our current sequential tweet ID generation routine 
> with a simple, more scalable solution. IDs will still be 64-bit unsigned 
> integers. However, this new solution is no longer guaranteed to generate 
> sequential IDs.  Instead IDs will be derived based on time: the most 
> significant bits being sourced from a timestamp and the least significant 
> bits will be effectively random. 
> 
> Please don't depend on the exact format of the ID. As our infrastructure 
> needs evolve, we might need to tweak the generation algorithm again.
> 
> If you've been trying to divine meaning from status IDs aside from their role 
> as a primary key, you won't be able to anymore. Likewise for usage of IDs in 
> mathematical operations -- for instance, subtracting two status IDs to 
> determine the number of tweets in between will no longer be possible.
> 
> For the majority of applications we think this scheme switch will be a 
> non-event. Before implementing these changes, we'd like to know if your 
> applications currently depend on the sequential nature of IDs. Do you depend 
> on the density of the tweet sequence being constant?  Are you trying to 
> analyze the IDs as anything other than opaque, ordered identifiers? Aside for 
> guaranteed sequential tweet ID ordering, what APIs can we provide you to 
> accomplish your goals?
> 
> Taylor Singletary
> Developer Advocate, Twitter
> http://twitter.com/episod
> 
> 
> To unsubscribe from this group, send email to 
> twitter-development-talk+unsubscribegooglegroups.com or reply to this email 
> with the words "REMOVE ME" as the subject.

To unsubscribe from this group, send email to 
twitter-development-talk+unsubscribegooglegroups.com or reply to this email 
with the words "REMOVE ME" as the subject.


Re: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

2010-03-26 Thread Nigel Legg
I hope you're right, but my app design depends on since_id, and before I
proceed further I want to be sure that I will not have to rebuild when this
new format comes in.

On 26 March 2010 21:09, Ray Krueger  wrote:

> I would think that this would make no difference for since_id. The
> purpose of since_id is for us to the API "give me the data I need
> that's happened since this id". Don't assume it's implemented as
> "select * from tweets were id > since_id". :)
>
>
> On Mar 26, 4:01 pm, Michael Bleigh  wrote:
> > To those voicing concerns about since_id I believe the key word is
> > that they will no longer be *sequential*, something entirely different
> > from them no longer being *increasing*. Since ID is a core part of the
> > Twitter API that I very much doubt will be in jeopardy from this
> > change. Twitter devs feel free to back me up or refute me. :)
> >
> > On Mar 26, 4:41 pm, Taylor Singletary 
> > wrote:
> >
> > > Hi Developers,
> >
> > > It's no secret that Twitter is growing exponentially. The tweets keep
> coming
> > > with ever increasing velocity, thanks in large part to your great
> > > applications.
> >
> > > Twitter has adapted to the increasing number of tweets in ways that
> have
> > > affected you in the past: We moved from 32 bit unsigned integers to
> 64-bit
> > > unsigned integers for status IDs some time ago. You all weathered that
> storm
> > > with ease. The tweetapoclypse was averted, and the tweets kept flowing.
> >
> > > Now we're reaching the scalability limit of our current tweet ID
> generation
> > > scheme. Unlike the previous tweet ID migrations, the solution to the
> current
> > > issue is significantly different. However, in most cases the new
> approach we
> > > will take will not result in any noticeable differences to you the
> developer
> > > or your users.
> >
> > > We are planning to replace our current sequential tweet ID generation
> > > routine with a simple, more scalable solution. IDs will still be 64-bit
> > > unsigned integers. However, this new solution is no longer guaranteed
> to
> > > generate sequential IDs.  Instead IDs will be derived based on time:
> the
> > > most significant bits being sourced from a timestamp and the least
> > > significant bits will be effectively random.
> >
> > > Please don't depend on the exact format of the ID. As our
> infrastructure
> > > needs evolve, we might need to tweak the generation algorithm again.
> >
> > > If you've been trying to divine meaning from status IDs aside from
> their
> > > role as a primary key, you won't be able to anymore. Likewise for usage
> of
> > > IDs in mathematical operations -- for instance, subtracting two status
> IDs
> > > to determine the number of tweets in between will no longer be
> possible.
> >
> > > For the majority of applications we think this scheme switch will be a
> > > non-event. Before implementing these changes, we'd like to know if your
> > > applications currently depend on the sequential nature of IDs. Do you
> depend
> > > on the density of the tweet sequence being constant?  Are you trying to
> > > analyze the IDs as anything other than opaque, ordered identifiers?
> Aside
> > > for guaranteed sequential tweet ID ordering, what APIs can we provide
> you to
> > > accomplish your goals?
> >
> > > Taylor Singletary
> > > Developer Advocate, Twitterhttp://twitter.com/episod
>
> To unsubscribe from this group, send email to twitter-development-talk+
> unsubscribegooglegroups.com or reply to this email with the words "REMOVE
> ME" as the subject.
>

To unsubscribe from this group, send email to 
twitter-development-talk+unsubscribegooglegroups.com or reply to this email 
with the words "REMOVE ME" as the subject.