This describes what I'd call row-based pagination. Cursor-based
pagination does not suffer from the same jitter issues. A cursor-based
approach returns a unique, within the total set, ordered opaque value
that is indexed for constant time access. Removals in pages before or
after do not affect the stability of next page retrieval.

For a user with a large following, you'll never have a point-in-time
snapshot of their followings with any approach, but you can retrieve a
complete unique set of users that were followers throughout the
duration of the query. Additions made while the query is running may
or may not be returned, as chance allows.

A row-based approach with OFFSET and LIMIT is doomed for reasons
beyond correctness. The latency and CPU consumption, in MySQL at
least, tends to O(N^2). The first few blocks aren't bad. The last few
blocks for a 10M, or even 1M set are miserable.

The jitter demonstrated by the current API is due to a minor and
correctable design flaw in the allocation of the opaque cursor values.
A fix is scheduled.

-John Kalucki
http://twitter.com/jkalucki
Services, Twitter Inc.



On Sep 6, 7:27 pm, Dewald Pretorius <[email protected]> wrote:
> There is no way that paging through a large and volatile data set can
> ever return results that are 100% accurate.
>
> Let's say one wants to page through @aplusk's followers list. That's
> going to take between 3 and 5 minutes just to collect the follower ids
> with &page (or the new cursors).
>
> It is likely that some of the follower ids that you have gone past and
> have already colledted, have unfollowed @aplusk while you are still
> collecting the rest. I assume that the Twitter system does paging by
> doing a standard SQL LIMIT clause. If you do LIMIT 1000000, 20 and
> some of the ids that you have already paged past have been deleted,
> the result set is going to "shift to the left" and you are going to
> miss the ones that were above 1000000 but have subsequently moved left
> to below 1000000.
>
> There really are only two solutions to this problem:
>
> a) we need to have the capability to reliably retrieve the entire
> result set in one API call, or
>
> b) everyone has to accept that the result set cannot be guaranteed to
> be 100% accurate.
>
> Dewald

Reply via email to