This describes what I'd call row-based pagination. Cursor-based pagination does not suffer from the same jitter issues. A cursor-based approach returns a unique, within the total set, ordered opaque value that is indexed for constant time access. Removals in pages before or after do not affect the stability of next page retrieval.
For a user with a large following, you'll never have a point-in-time snapshot of their followings with any approach, but you can retrieve a complete unique set of users that were followers throughout the duration of the query. Additions made while the query is running may or may not be returned, as chance allows. A row-based approach with OFFSET and LIMIT is doomed for reasons beyond correctness. The latency and CPU consumption, in MySQL at least, tends to O(N^2). The first few blocks aren't bad. The last few blocks for a 10M, or even 1M set are miserable. The jitter demonstrated by the current API is due to a minor and correctable design flaw in the allocation of the opaque cursor values. A fix is scheduled. -John Kalucki http://twitter.com/jkalucki Services, Twitter Inc. On Sep 6, 7:27 pm, Dewald Pretorius <[email protected]> wrote: > There is no way that paging through a large and volatile data set can > ever return results that are 100% accurate. > > Let's say one wants to page through @aplusk's followers list. That's > going to take between 3 and 5 minutes just to collect the follower ids > with &page (or the new cursors). > > It is likely that some of the follower ids that you have gone past and > have already colledted, have unfollowed @aplusk while you are still > collecting the rest. I assume that the Twitter system does paging by > doing a standard SQL LIMIT clause. If you do LIMIT 1000000, 20 and > some of the ids that you have already paged past have been deleted, > the result set is going to "shift to the left" and you are going to > miss the ones that were above 1000000 but have subsequently moved left > to below 1000000. > > There really are only two solutions to this problem: > > a) we need to have the capability to reliably retrieve the entire > result set in one API call, or > > b) everyone has to accept that the result set cannot be guaranteed to > be 100% accurate. > > Dewald
