[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results

2009-09-10 Thread Waldron Faulkner

Hey developers, any hints/tips on how I can get the Twitter API team
to focus on this issue? It's hard to build a business on the Twitter
API when a crucial feature like this just stops working and we get
radio silence for days. Any tips on how I can help the team focus on
this??

On Sep 9, 10:10 am, alexc chy101...@gmail.com wrote:
 this issue still pops up 
 :http://twitter.com/friends/ids/downingstreet.xml?page=3


[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results

2009-09-09 Thread alexc

this issue still pops up :
http://twitter.com/friends/ids/downingstreet.xml?page=3


[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results

2009-09-07 Thread freefall

Flat file generation and maintenance would be foolish at this stage.
Seperating out the individual data sets purely for api  to be served
by different clusters with server side caching may fit the bill - but
tbh if this isn't happening already I'll be shocked.

On Sep 7, 5:40 am, Jesse Stay jesses...@gmail.com wrote:
 As far as retrieving the large graphs from a DB, flat files are one way -
 another is to just store the full graph (of ids) in a single column in the
 database and parse on retrieval.  This is what FriendFeed is doing
 currently, so they've said.  Dewald and I are both talking about this
 because we're also having to duplicate this on our own servers, so we too
 have to deal with the pains of the social graph.  (and oh the pain it is!)



 On Sun, Sep 6, 2009 at 8:44 PM, Dewald Pretorius dpr...@gmail.com wrote:

  If I worked for Twitter, here's what I would have done.

  I would have grabbed the follower id list of the large accounts (those
  that usually kicked back 502s) and written them to flat files once
  every 5 or so minutes.

  When an API request comes in for that list, I'd just grab it from the
  flat file, instead of asking the DB to select 2+ million ids from
  amongst a few billion records, while it's trying to do a few thousand
  other selects at the same time.

  That's one way of getting rid of 502s on large social graph lists.

  Okay, the data is going to be 5 minutes out-dated. To that I say, so
  bloody what?

  Dewald- Hide quoted text -

 - Show quoted text -


[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results

2009-09-07 Thread John Kalucki

This describes what I'd call row-based pagination. Cursor-based
pagination does not suffer from the same jitter issues. A cursor-based
approach returns a unique, within the total set, ordered opaque value
that is indexed for constant time access. Removals in pages before or
after do not affect the stability of next page retrieval.

For a user with a large following, you'll never have a point-in-time
snapshot of their followings with any approach, but you can retrieve a
complete unique set of users that were followers throughout the
duration of the query. Additions made while the query is running may
or may not be returned, as chance allows.

A row-based approach with OFFSET and LIMIT is doomed for reasons
beyond correctness. The latency and CPU consumption, in MySQL at
least, tends to O(N^2). The first few blocks aren't bad. The last few
blocks for a 10M, or even 1M set are miserable.

The jitter demonstrated by the current API is due to a minor and
correctable design flaw in the allocation of the opaque cursor values.
A fix is scheduled.

-John Kalucki
http://twitter.com/jkalucki
Services, Twitter Inc.



On Sep 6, 7:27 pm, Dewald Pretorius dpr...@gmail.com wrote:
 There is no way that paging through a large and volatile data set can
 ever return results that are 100% accurate.

 Let's say one wants to page through @aplusk's followers list. That's
 going to take between 3 and 5 minutes just to collect the follower ids
 with page (or the new cursors).

 It is likely that some of the follower ids that you have gone past and
 have already colledted, have unfollowed @aplusk while you are still
 collecting the rest. I assume that the Twitter system does paging by
 doing a standard SQL LIMIT clause. If you do LIMIT 100, 20 and
 some of the ids that you have already paged past have been deleted,
 the result set is going to shift to the left and you are going to
 miss the ones that were above 100 but have subsequently moved left
 to below 100.

 There really are only two solutions to this problem:

 a) we need to have the capability to reliably retrieve the entire
 result set in one API call, or

 b) everyone has to accept that the result set cannot be guaranteed to
 be 100% accurate.

 Dewald


[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results

2009-09-07 Thread Waldron Faulkner

I could really go for jittery right now... instead I'm getting
totally broken!

I'm getting two pages of results, using ?page=x, then empty. To me, it
looks like all my accounts have max 10K followers. I'd love some kind
of official response from Twitter on the status of paging (John?).

Example: user @starbucks has nearly 300K followers, however:
http://twitter.com/followers/ids.xml?id=30973page=3
returns empty result.

- Waldron

On Sep 7, 10:24 pm, John Kalucki jkalu...@gmail.com wrote:
 This describes what I'd call row-based pagination. Cursor-based
 pagination does not suffer from the same jitter issues. A cursor-based
 approach returns a unique, within the total set, ordered opaque value
 that is indexed for constant time access. Removals in pages before or
 after do not affect the stability of next page retrieval.

 For a user with a large following, you'll never have a point-in-time
 snapshot of their followings with any approach, but you can retrieve a
 complete unique set of users that were followers throughout the
 duration of the query. Additions made while the query is running may
 or may not be returned, as chance allows.

 A row-based approach with OFFSET and LIMIT is doomed for reasons
 beyond correctness. The latency and CPU consumption, in MySQL at
 least, tends to O(N^2). The first few blocks aren't bad. The last few
 blocks for a 10M, or even 1M set are miserable.

 The jitter demonstrated by the current API is due to a minor and
 correctable design flaw in the allocation of the opaque cursor values.
 A fix is scheduled.

 -John Kaluckihttp://twitter.com/jkalucki
 Services, Twitter Inc.

 On Sep 6, 7:27 pm, Dewald Pretorius dpr...@gmail.com wrote:

  There is no way that paging through a large and volatile data set can
  ever return results that are 100% accurate.

  Let's say one wants to page through @aplusk's followers list. That's
  going to take between 3 and 5 minutes just to collect the follower ids
  with page (or the new cursors).

  It is likely that some of the follower ids that you have gone past and
  have already colledted, have unfollowed @aplusk while you are still
  collecting the rest. I assume that the Twitter system does paging by
  doing a standard SQL LIMIT clause. If you do LIMIT 100, 20 and
  some of the ids that you have already paged past have been deleted,
  the result set is going to shift to the left and you are going to
  miss the ones that were above 100 but have subsequently moved left
  to below 100.

  There really are only two solutions to this problem:

  a) we need to have the capability to reliably retrieve the entire
  result set in one API call, or

  b) everyone has to accept that the result set cannot be guaranteed to
  be 100% accurate.

  Dewald


[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results

2009-09-06 Thread Dewald Pretorius

I meant to type, LIMIT 100, 5000.


[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results

2009-09-06 Thread Jesse Stay
Agreed. Is there a chance Twitter can return the full results in compressed
(gzip or similar) format to reduce load, leaving the burden of decompressing
on our end and reducing bandwidth?  I'm sure there are other areas this
could apply as well.  I think you'll find compressing the full social graph
of a user significantly reduces the size of the data you have to pass
through the pipe - my tests have proved it to be a huge difference, and
you'll have to get way past the 10s of millions of ids before things slow
down at all after that.
Jesse

On Sun, Sep 6, 2009 at 8:27 PM, Dewald Pretorius dpr...@gmail.com wrote:


 There is no way that paging through a large and volatile data set can
 ever return results that are 100% accurate.

 Let's say one wants to page through @aplusk's followers list. That's
 going to take between 3 and 5 minutes just to collect the follower ids
 with page (or the new cursors).

 It is likely that some of the follower ids that you have gone past and
 have already colledted, have unfollowed @aplusk while you are still
 collecting the rest. I assume that the Twitter system does paging by
 doing a standard SQL LIMIT clause. If you do LIMIT 100, 20 and
 some of the ids that you have already paged past have been deleted,
 the result set is going to shift to the left and you are going to
 miss the ones that were above 100 but have subsequently moved left
 to below 100.

 There really are only two solutions to this problem:

 a) we need to have the capability to reliably retrieve the entire
 result set in one API call, or

 b) everyone has to accept that the result set cannot be guaranteed to
 be 100% accurate.

 Dewald



[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results

2009-09-06 Thread Dewald Pretorius

If I worked for Twitter, here's what I would have done.

I would have grabbed the follower id list of the large accounts (those
that usually kicked back 502s) and written them to flat files once
every 5 or so minutes.

When an API request comes in for that list, I'd just grab it from the
flat file, instead of asking the DB to select 2+ million ids from
amongst a few billion records, while it's trying to do a few thousand
other selects at the same time.

That's one way of getting rid of 502s on large social graph lists.

Okay, the data is going to be 5 minutes out-dated. To that I say, so
bloody what?

Dewald


[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results

2009-09-06 Thread Jesse Stay
The other solution would be to send it to us in batch results, attaching a
timestamp to the request telling us this is what the user's social graph
looked like at x time.  I personally would start with the compressed format
though, as that makes it all possible to retrieve in a single request.

On Sun, Sep 6, 2009 at 10:33 PM, Jesse Stay jesses...@gmail.com wrote:

 Agreed. Is there a chance Twitter can return the full results in compressed
 (gzip or similar) format to reduce load, leaving the burden of decompressing
 on our end and reducing bandwidth?  I'm sure there are other areas this
 could apply as well.  I think you'll find compressing the full social graph
 of a user significantly reduces the size of the data you have to pass
 through the pipe - my tests have proved it to be a huge difference, and
 you'll have to get way past the 10s of millions of ids before things slow
 down at all after that.
 Jesse


 On Sun, Sep 6, 2009 at 8:27 PM, Dewald Pretorius dpr...@gmail.com wrote:


 There is no way that paging through a large and volatile data set can
 ever return results that are 100% accurate.

 Let's say one wants to page through @aplusk's followers list. That's
 going to take between 3 and 5 minutes just to collect the follower ids
 with page (or the new cursors).

 It is likely that some of the follower ids that you have gone past and
 have already colledted, have unfollowed @aplusk while you are still
 collecting the rest. I assume that the Twitter system does paging by
 doing a standard SQL LIMIT clause. If you do LIMIT 100, 20 and
 some of the ids that you have already paged past have been deleted,
 the result set is going to shift to the left and you are going to
 miss the ones that were above 100 but have subsequently moved left
 to below 100.

 There really are only two solutions to this problem:

 a) we need to have the capability to reliably retrieve the entire
 result set in one API call, or

 b) everyone has to accept that the result set cannot be guaranteed to
 be 100% accurate.

 Dewald





[twitter-dev] Re: Paging (or cursoring) will always return unreliable (or jittery) results

2009-09-06 Thread Jesse Stay
As far as retrieving the large graphs from a DB, flat files are one way -
another is to just store the full graph (of ids) in a single column in the
database and parse on retrieval.  This is what FriendFeed is doing
currently, so they've said.  Dewald and I are both talking about this
because we're also having to duplicate this on our own servers, so we too
have to deal with the pains of the social graph.  (and oh the pain it is!)

On Sun, Sep 6, 2009 at 8:44 PM, Dewald Pretorius dpr...@gmail.com wrote:


 If I worked for Twitter, here's what I would have done.

 I would have grabbed the follower id list of the large accounts (those
 that usually kicked back 502s) and written them to flat files once
 every 5 or so minutes.

 When an API request comes in for that list, I'd just grab it from the
 flat file, instead of asking the DB to select 2+ million ids from
 amongst a few billion records, while it's trying to do a few thousand
 other selects at the same time.

 That's one way of getting rid of 502s on large social graph lists.

 Okay, the data is going to be 5 minutes out-dated. To that I say, so
 bloody what?

 Dewald