[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results
Hey developers, any hints/tips on how I can get the Twitter API team to focus on this issue? It's hard to build a business on the Twitter API when a crucial feature like this just stops working and we get radio silence for days. Any tips on how I can help the team focus on this?? On Sep 9, 10:10 am, alexc chy101...@gmail.com wrote: this issue still pops up :http://twitter.com/friends/ids/downingstreet.xml?page=3
[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results
this issue still pops up : http://twitter.com/friends/ids/downingstreet.xml?page=3
[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results
Flat file generation and maintenance would be foolish at this stage. Seperating out the individual data sets purely for api to be served by different clusters with server side caching may fit the bill - but tbh if this isn't happening already I'll be shocked. On Sep 7, 5:40 am, Jesse Stay jesses...@gmail.com wrote: As far as retrieving the large graphs from a DB, flat files are one way - another is to just store the full graph (of ids) in a single column in the database and parse on retrieval. This is what FriendFeed is doing currently, so they've said. Dewald and I are both talking about this because we're also having to duplicate this on our own servers, so we too have to deal with the pains of the social graph. (and oh the pain it is!) On Sun, Sep 6, 2009 at 8:44 PM, Dewald Pretorius dpr...@gmail.com wrote: If I worked for Twitter, here's what I would have done. I would have grabbed the follower id list of the large accounts (those that usually kicked back 502s) and written them to flat files once every 5 or so minutes. When an API request comes in for that list, I'd just grab it from the flat file, instead of asking the DB to select 2+ million ids from amongst a few billion records, while it's trying to do a few thousand other selects at the same time. That's one way of getting rid of 502s on large social graph lists. Okay, the data is going to be 5 minutes out-dated. To that I say, so bloody what? Dewald- Hide quoted text - - Show quoted text -
[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results
This describes what I'd call row-based pagination. Cursor-based pagination does not suffer from the same jitter issues. A cursor-based approach returns a unique, within the total set, ordered opaque value that is indexed for constant time access. Removals in pages before or after do not affect the stability of next page retrieval. For a user with a large following, you'll never have a point-in-time snapshot of their followings with any approach, but you can retrieve a complete unique set of users that were followers throughout the duration of the query. Additions made while the query is running may or may not be returned, as chance allows. A row-based approach with OFFSET and LIMIT is doomed for reasons beyond correctness. The latency and CPU consumption, in MySQL at least, tends to O(N^2). The first few blocks aren't bad. The last few blocks for a 10M, or even 1M set are miserable. The jitter demonstrated by the current API is due to a minor and correctable design flaw in the allocation of the opaque cursor values. A fix is scheduled. -John Kalucki http://twitter.com/jkalucki Services, Twitter Inc. On Sep 6, 7:27 pm, Dewald Pretorius dpr...@gmail.com wrote: There is no way that paging through a large and volatile data set can ever return results that are 100% accurate. Let's say one wants to page through @aplusk's followers list. That's going to take between 3 and 5 minutes just to collect the follower ids with page (or the new cursors). It is likely that some of the follower ids that you have gone past and have already colledted, have unfollowed @aplusk while you are still collecting the rest. I assume that the Twitter system does paging by doing a standard SQL LIMIT clause. If you do LIMIT 100, 20 and some of the ids that you have already paged past have been deleted, the result set is going to shift to the left and you are going to miss the ones that were above 100 but have subsequently moved left to below 100. There really are only two solutions to this problem: a) we need to have the capability to reliably retrieve the entire result set in one API call, or b) everyone has to accept that the result set cannot be guaranteed to be 100% accurate. Dewald
[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results
I could really go for jittery right now... instead I'm getting totally broken! I'm getting two pages of results, using ?page=x, then empty. To me, it looks like all my accounts have max 10K followers. I'd love some kind of official response from Twitter on the status of paging (John?). Example: user @starbucks has nearly 300K followers, however: http://twitter.com/followers/ids.xml?id=30973page=3 returns empty result. - Waldron On Sep 7, 10:24 pm, John Kalucki jkalu...@gmail.com wrote: This describes what I'd call row-based pagination. Cursor-based pagination does not suffer from the same jitter issues. A cursor-based approach returns a unique, within the total set, ordered opaque value that is indexed for constant time access. Removals in pages before or after do not affect the stability of next page retrieval. For a user with a large following, you'll never have a point-in-time snapshot of their followings with any approach, but you can retrieve a complete unique set of users that were followers throughout the duration of the query. Additions made while the query is running may or may not be returned, as chance allows. A row-based approach with OFFSET and LIMIT is doomed for reasons beyond correctness. The latency and CPU consumption, in MySQL at least, tends to O(N^2). The first few blocks aren't bad. The last few blocks for a 10M, or even 1M set are miserable. The jitter demonstrated by the current API is due to a minor and correctable design flaw in the allocation of the opaque cursor values. A fix is scheduled. -John Kaluckihttp://twitter.com/jkalucki Services, Twitter Inc. On Sep 6, 7:27 pm, Dewald Pretorius dpr...@gmail.com wrote: There is no way that paging through a large and volatile data set can ever return results that are 100% accurate. Let's say one wants to page through @aplusk's followers list. That's going to take between 3 and 5 minutes just to collect the follower ids with page (or the new cursors). It is likely that some of the follower ids that you have gone past and have already colledted, have unfollowed @aplusk while you are still collecting the rest. I assume that the Twitter system does paging by doing a standard SQL LIMIT clause. If you do LIMIT 100, 20 and some of the ids that you have already paged past have been deleted, the result set is going to shift to the left and you are going to miss the ones that were above 100 but have subsequently moved left to below 100. There really are only two solutions to this problem: a) we need to have the capability to reliably retrieve the entire result set in one API call, or b) everyone has to accept that the result set cannot be guaranteed to be 100% accurate. Dewald
[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results
I meant to type, LIMIT 100, 5000.
[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results
Agreed. Is there a chance Twitter can return the full results in compressed (gzip or similar) format to reduce load, leaving the burden of decompressing on our end and reducing bandwidth? I'm sure there are other areas this could apply as well. I think you'll find compressing the full social graph of a user significantly reduces the size of the data you have to pass through the pipe - my tests have proved it to be a huge difference, and you'll have to get way past the 10s of millions of ids before things slow down at all after that. Jesse On Sun, Sep 6, 2009 at 8:27 PM, Dewald Pretorius dpr...@gmail.com wrote: There is no way that paging through a large and volatile data set can ever return results that are 100% accurate. Let's say one wants to page through @aplusk's followers list. That's going to take between 3 and 5 minutes just to collect the follower ids with page (or the new cursors). It is likely that some of the follower ids that you have gone past and have already colledted, have unfollowed @aplusk while you are still collecting the rest. I assume that the Twitter system does paging by doing a standard SQL LIMIT clause. If you do LIMIT 100, 20 and some of the ids that you have already paged past have been deleted, the result set is going to shift to the left and you are going to miss the ones that were above 100 but have subsequently moved left to below 100. There really are only two solutions to this problem: a) we need to have the capability to reliably retrieve the entire result set in one API call, or b) everyone has to accept that the result set cannot be guaranteed to be 100% accurate. Dewald
[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results
If I worked for Twitter, here's what I would have done. I would have grabbed the follower id list of the large accounts (those that usually kicked back 502s) and written them to flat files once every 5 or so minutes. When an API request comes in for that list, I'd just grab it from the flat file, instead of asking the DB to select 2+ million ids from amongst a few billion records, while it's trying to do a few thousand other selects at the same time. That's one way of getting rid of 502s on large social graph lists. Okay, the data is going to be 5 minutes out-dated. To that I say, so bloody what? Dewald
[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results
The other solution would be to send it to us in batch results, attaching a timestamp to the request telling us this is what the user's social graph looked like at x time. I personally would start with the compressed format though, as that makes it all possible to retrieve in a single request. On Sun, Sep 6, 2009 at 10:33 PM, Jesse Stay jesses...@gmail.com wrote: Agreed. Is there a chance Twitter can return the full results in compressed (gzip or similar) format to reduce load, leaving the burden of decompressing on our end and reducing bandwidth? I'm sure there are other areas this could apply as well. I think you'll find compressing the full social graph of a user significantly reduces the size of the data you have to pass through the pipe - my tests have proved it to be a huge difference, and you'll have to get way past the 10s of millions of ids before things slow down at all after that. Jesse On Sun, Sep 6, 2009 at 8:27 PM, Dewald Pretorius dpr...@gmail.com wrote: There is no way that paging through a large and volatile data set can ever return results that are 100% accurate. Let's say one wants to page through @aplusk's followers list. That's going to take between 3 and 5 minutes just to collect the follower ids with page (or the new cursors). It is likely that some of the follower ids that you have gone past and have already colledted, have unfollowed @aplusk while you are still collecting the rest. I assume that the Twitter system does paging by doing a standard SQL LIMIT clause. If you do LIMIT 100, 20 and some of the ids that you have already paged past have been deleted, the result set is going to shift to the left and you are going to miss the ones that were above 100 but have subsequently moved left to below 100. There really are only two solutions to this problem: a) we need to have the capability to reliably retrieve the entire result set in one API call, or b) everyone has to accept that the result set cannot be guaranteed to be 100% accurate. Dewald
[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results
As far as retrieving the large graphs from a DB, flat files are one way - another is to just store the full graph (of ids) in a single column in the database and parse on retrieval. This is what FriendFeed is doing currently, so they've said. Dewald and I are both talking about this because we're also having to duplicate this on our own servers, so we too have to deal with the pains of the social graph. (and oh the pain it is!) On Sun, Sep 6, 2009 at 8:44 PM, Dewald Pretorius dpr...@gmail.com wrote: If I worked for Twitter, here's what I would have done. I would have grabbed the follower id list of the large accounts (those that usually kicked back 502s) and written them to flat files once every 5 or so minutes. When an API request comes in for that list, I'd just grab it from the flat file, instead of asking the DB to select 2+ million ids from amongst a few billion records, while it's trying to do a few thousand other selects at the same time. That's one way of getting rid of 502s on large social graph lists. Okay, the data is going to be 5 minutes out-dated. To that I say, so bloody what? Dewald