RE: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

Brian Smith Thu, 08 Apr 2010 17:44:50 -0700

Mark, thank you for taking the time to respond.

What is the smallest “comfort threshold” that will guarantee that we will see 
all the tweets, with none skipped over and the fewest tweets returned multiple 
times?

Let’s say the comfort threshold was 2 seconds. It seems to me like there could 
realistically be dozens or hundreds of tweets within those two seconds in a 
single timeline, and a request that used the logic you mentioned would return 
an entire page (200 tweets) consisting of tweets that the application already 
has; the application would be making a relatively large download, receiving 
nothing useful for it, and not be able to make any progress because its 
since_id would get “stuck”. This is at odds with many (most?) applications goal 
in using since_id, which is to transfer as little data as possible.

It seems like a better alternative would a new parameter that says “don’t give 
me any tweets that are less than <X> seconds old,” where <X> seconds is the 
comfort threshold. That way, the application may lag behind by a few of 
seconds, but at least it would be able to confidently page through the timeline 
without excessive data transfer. Without such a mechanism, it looks like this 
change will be a significant degradation of service that result in 
applications’ “refresh” features becoming either unreliable or very wasteful.

But, is it realistic for applications to expect the Twitter cluster to be in 
sync within 2 seconds? 10 seconds? 30 seconds? That is the part that is unclear 
to me. 

Thanks again,

Brian

From: [email protected] 
[mailto:[email protected]] On Behalf Of Mark McBride
Sent: Thursday, April 08, 2010 6:38 PM
To: [email protected]
Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are 
sequenced

It's a possibility, but by no means a probability.  Note that you can mitigate 
this by using the newest tweet that is outside your "danger zone".  For example 
in a sequence of tweets t1, t2 ... ti ... tn with creation times c1, c2 ... ci 
... cn and a comfort threshold e you could use since_id from the latest ti such 
that c1 - ci > e.

  ---Mark

http://twitter.com/mccv

On Thu, Apr 8, 2010 at 4:27 PM, Naveen <[email protected]> wrote:

This was my initial concern with the randomly generated ids that I
brought up, though I think Brian described it better than I.

It simply seems very likely that when using since_id to populate newer
tweets for the user, that some tweets will never be seen, because the
since_id of the last message received will be larger than one
generated 1ms later.

With the random generation of ids, I can see two way guarantee
delivery of all tweets in a users timeline
1. Page forwards and backwards to ensure no tweets generated at or
near the same time as the newest one did not receive a lower id. This
will be very expensive for a mobile client not to mention complicate
any refresh algorithms significantly.
2. Given that we know how IDs are generated (i.e. which bits represent
the time) we can simply over request by decrementing the since_id time
bits, by a second or two and filter out duplicates. (again, not really
ideal for mobile clients where battery life is an issue, plus it then
makes the implementation very dependent on twitters id format
remaining stable)

Please anyone explain if Brian and I are misinterpreting this as a
very real possibility of never displaying some tweets in a time line,
without changing how we request data from twitter (i.e. since_id
doesn't break)

--Naveen Ayyagari
@knight9
@SocialScope

On Apr 8, 7:01 pm, "Brian Smith" <[email protected]> wrote:
> What does “within the caveats given above” mean? Either since_id will work or 
> it won’t. It seems to me that if IDs are only in a “rough” order, since_id 
> won’t work—in particular, there is a possibility that paging through tweets 
> using since_id will completely skip over some tweets.
>
> My concern is that, since tweets will not be serialized at the time they are 
> written, there will be a race condition between me making a request and users 
> posting new statuses. That is, I could get a response with the largest id in 
> the response being X that gets evaluated just before a tweet (X-1) has been 
> saved in the database; If so, when I issue a request with since_id=X, my 
> program will never see the newer tweet (X-1).
>
> Are you going to change the implementation of the timeline methods so that 
> they never return a tweet with ID X until all nodes in the cluster guarantee 
> that they won’t create a new tweet with an ID less than X?
>
> I implement the following logic:
>
> 1.      Let LATEST start out as the earliest tweet available in the user’s 
> timeline.
>
> 2.      Make a request with since_id={LATEST}, which returns a set of tweets 
> T.
>
> 3.      If T is empty then stop.
>
> 4.      Let LATEST= max({ id(t), for all t in T}).
>
> 5.      Goto 2.
>
> Will I be guaranteed not to skip over any tweets in the timeline using this 
> logic? If not, what do I need to do to ensure I get them all?
>
> Thanks,
>
> Brian
>
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Mark McBride
> Sent: Thursday, April 08, 2010 5:10 PM
> To: [email protected]
> Subject: Re: [twitter-dev] Re: Upcoming changes to the way status IDs are 
> sequenced
>
> Thank you for the feedback.  It's great to hear about the variety of use 
> cases people have for the API, and in particular all the different ways 
> people are using IDs. To alleviate some of the concerns raised in this thread 
> we thought it would be useful to give more details about how we plan to 
> generate IDs
>
> 1) IDs are still 64-bit integers.  This should minimize any migration pains.
>
> 2) You can still sort on ID.  Within a few millieconds you may get out of 
> order results, but for most use cases this shouldn't be an issue.  
>
> 3) since_id will still work (within the caveats given above).  
>
> 4) We will provide a way to backfill from the streaming API.
>
> 5) You cannot use the generated ID to reverse engineer tweet velocity.  Note 
> that you can still use the streaming API to determine the rate of public 
> statuses.
>
> Additional items of interest
>
> 1) At some point we will likely start using this as an ID for direct messages 
> too
>
> 2) We will almost certainly open source the ID generation code, probably 
> before we actually cut over to using it.
>
> 3) We STRONGLY suggest that you treat IDs as roughly sorted (roughly being 
> within a few ms buckets), opaque 64-bit integers.  We may need to change the 
> scheme again at some point in the future, and want to minimize migration 
> pains should we need to do this.
>
> Hopefully this puts you more at ease with the changes we're making.  If it 
> raises new concerns, please let us know!
>
>   ---Mark
>

>  <http://twitter.com/mccv>http://twitter.com/mccv

>
> On Mon, Apr 5, 2010 at 4:18 PM, M. Edward (Ed) Borasky <[email protected]> 
> wrote:
>
> On 04/05/2010 12:55 AM, Tim Haines wrote:
>
> > This made me laugh.  Hard.
>

> > On Fri, Apr 2, 2010 at 6:47 AM, Dewald Pretorius <[email protected]> wrote:
>
> >> Mark,
>
> >> It's extremely important where you have two bots that reply to each
> >> others' tweets. With incorrectly sorted tweets, you get conversations
> >> that look completely unnatural.
>
> >> On Apr 1, 1:39 pm, Mark McBride <[email protected]> wrote:
> >>> Just out of curiosity, what applications are you building that require
> >>> sub-second sorting resolution for tweets?
>
> Yeah - my bot laughed too ;-)
>
> --
> M. Edward (Ed) Borasky
> borasky-research.net/m-edward-ed-borasky
>
> "A mathematician is a device for turning coffee into theorems." ~ Paul Erdős
>
> --
>
> To unsubscribe, reply using "remove me" as the subject.

RE: [twitter-dev] Re: Upcoming changes to the way status IDs are sequenced

Reply via email to