2010

Dewald Pretorius Tue, 05 Jan 2010 10:00:32 -0800

The Platform team can create a static parameter that governs the
maximum size of the initial "cursor". The API software will then use
this parameter value to build the list of ids returned in the first
API call.


The team can either modify this static parameter manually, or you can
create a process that monitors the HTTP response metrics and varies
the parameter downwards when HTTP errors go up, and upwards when HTTP
errors go down.

Your core processes don't need to change. The API layer can,
transparent to the external developer, make the necessary number of
5,000-id sized cursor calls to build the initial response set of the
first API call from the external system. This depends on the
assumption that it would be much faster to contain those 5,000-id
calls internal in your own infrastructure. Otherwise the core
processes will need modification to allow larger sized cursors to be
requested by your API layer.

On Jan 5, 1:38 pm, John Kalucki <[email protected]> wrote:
> That sounds like a good overall technique. It's very best-effort. I'm
> concerned about implementation details though. The webserver may
> defensively time out the connection a lot, and tight coordination
> between container and process is difficult to manage in our stack. And
> by difficult, I mean intractably difficult. So, you might see a
> premature 503 the processing is too aggressive and, on average,
> receive even less data. It might be best to artificially limit the
> processing before the safety kicks in.
>
> In the end, this effort would be more usefully be applied towards
> streaming social graph deltas.
>
> -John Kaluckihttp://twitter.com/jkalucki
> Services, Twitter Inc.
>
> On Tue, Jan 5, 2010 at 7:31 AM, Dewald Pretorius <[email protected]> wrote:
> > John,
>
> > To try and make it as transparent and seamless for external developers
> > as possible, I propose the following solution.
>
> > Change the API layer so that it returns as many ids as possible in the
> > first API call, regardless of whether cursor=-1 is present or omitted.
> > If your system is able to return the entire social graph, then simply
> > set next_cursor to 0. If it could return only a subset of ids, then
> > set next_cursor to its next value.
>
> > The benefits are:
>
> > a) You can dynamically manage the number of ids returned in the first
> > call, in accordance with system load at the time of the call.
>
> > b) We, as developers, only need to check for next_cursor > 0 to know
> > whether we got the full set or whether we need to make subsequent
> > cursored calls.
>
> > On Jan 5, 10:34 am, John Kalucki <[email protected]> wrote:
> >> Jessie,
>
> >> My surprise shouldn't be a surprise. I'm sure the platform team is
> >> well aware of the issues.
>
> >> The fact that it works at 200k users could very well be inherently
> >> unstable. Minor changes to the system elsewhere could cause this
> >> number to drop without anyone knowing. We don't monitor this "breaks
> >> at" threshold in production, and we certainly don't manage the cluster
> >> to preserve such a threshold. I'd doubt that this is testable in
> >> development. In practice, should we support this, it could be
> >> difficult to guarantee such a high threshold as various systems
> >> approach their capacity limits. The most reliable approach is to make
> >> all calls approximately the same "cost" and mange the system to
> >> provide smooth delivery at that cost per request.
>
> >> -John Kaluckihttp://twitter.com/jkalucki
> >> Services, Twitter Inc.
>
> >> On Tue, Jan 5, 2010 at 1:09 AM, Jesse Stay <[email protected]> wrote:
> >> > If I can suggest you keep it backwards-compatible that would make much 
> >> > more
> >> > sense.  I think we're all aware that over 200,000 or so followers it 
> >> > breaks.
> >> >  So what if you kept the cursor-less nature, treat it like a cursor, but 
> >> > set
> >> > the returned cursor cap to be 200,000 per cursor?  Or if it needs to be
> >> > smaller (again, I think it would be much less bandwidth and process-time 
> >> > to
> >> > just keep it a high, sustainable number rather than having to traverse
> >> > multiple times to get that), maybe just return only the last 200,000 if 
> >> > no
> >> > cursor is specified?  This way those that aren't aware of the change 
> >> > aren't
> >> > affected, new methods can be put into place, documentation can be 
> >> > updated to
> >> > reflect the deprecated methods, and everyone's happy.
>
> >> > I'm a little surprised at the surprise by the Twitter team here. If you 
> >> > guys
> >> > need an account on one of my servers to test this stuff I'm happy to
> >> > provide. :-)  Hopefully you guys can trust us as much as we trust you.  
> >> > I'm
> >> > always happy to provide examples and help though.  I recognize you guys 
> >> > are
> >> > all working your tails off there. (I say this as I wear my "wearing my
> >> > Twitter shirt" proudly)
> >> > Jesse
>
> >> > On Tue, Jan 5, 2010 at 1:35 AM, John Kalucki <[email protected]> wrote:
>
> >> >> And so it is. Given the system implementation, I'm quite surprised
> >> >> that the cursorless call returns results with acceptable reliability,
> >> >> especially during peak system load. The documentation attempts to
> >> >> convey that the cursorless approach is risky. "all IDs are attempted
> >> >> to be returned, but large sets of IDs will likely fail with timeout
> >> >> errors."   When documentation says "attempted" and "fail with timeout
> >> >> errors", it doesn't take too much reading between the lines to infer
> >> >> that this is a best effort call. Building upon a risky dependency has,
> >> >> well, risks. (The passive voice, on the other hand, is a lowly crime.)
>
> >> >> I also agree that the cursored approach as currently implemented is
> >> >> quite problematic. To increase throughput, I'd support increasing the
> >> >> block size somewhat, but the boundless behavior of the cursorless
> >> >> unauthenticated call just has to go. The combination of these changes
> >> >> should reduce both query and memory pressure on the front end, which,
> >> >> in theory, if not in practice, should lead to a better overall
> >> >> experience. I'd imagine that there are complications, and numbers to
> >> >> be run, and trade-offs to be made.
>
> >> >> Trust that the platform people are trading-off many competing
> >> >> interests and that there isn't a single capricious bone in their
> >> >> collective body.
>
> >> >> -John Kalucki
> >> >>http://twitter.com/jkalucki
> >> >> Services, Twitter Inc.
>
> >> >> On Mon, Jan 4, 2010 at 10:40 PM, PJB <[email protected]> wrote:
>
> >> >> > As noted in this thread, the fact that cursor-less methods for 
> >> >> > friends/
> >> >> > followers ids will be deprecated was newly announced on December 22.
>
> >> >> > In fact, the API documentation still clearly indicates that cursors
> >> >> > are optional, and that their absence will return a complete social
> >> >> > graph.  E.g.:
>
> >> >> >http://apiwiki.twitter.com/Twitter-REST-API-Method%3A-followers%C2%A0ids
>
> >> >> > ("If the cursor parameter is not provided, all IDs are attempted to be
> >> >> > returned")
>
> >> >> > The example at the bottom of that page gives a good example of
> >> >> > retrieving 300,000+ ids in several seconds:
>
> >> >> >http://twitter.com/followers/ids.xml?screen_name=dougw
>
> >> >> > Of course, retrieving 20-40k users is significantly faster.
>
> >> >> > Again, many of us have built apps around cursor-less API calls.  To
> >> >> > now deprecate them, with just a few days warning over the holidays, is
> >> >> > clearly inappropriate and uncalled for.  Similarly, to announce that
> >> >> > we must now expect 5x slowness when doing the same calls, when these
> >> >> > existing methods work well, is shocking.
>
> >> >> > Many developers live and die by the API documentation.  It's a really
> >> >> > fouled-up situation when the API documentation is so totally wrong,
> >> >> > right?
>
> >> >> > I urge those folks addressing this issue to preserve the cursor-less
> >> >> > methods.  Barring that, I urge them to return at least 25,000 ids per
> >> >> > cursor (as you note, time progression has made 5000 per call
> >> >> > antiquated and ineffective for today's Twitter user) and grant at
> >> >> > least 3 months before deprecation.
>
> >> >> > On Jan 4, 10:23 pm, John Kalucki <[email protected]> wrote:
> >> >> >> The "existing" APIs stopped providing accurate data about a year ago
> >> >> >> and degraded substantially over a period of just a few months. Now 
> >> >> >> the
> >> >> >> only data store for social graph data requires cursors to access
> >> >> >> complete sets. Pagination is just not possible with the same latency
> >> >> >> at this scale without an order of magnitude or two increase in cost.
> >> >> >> So, instead of hardware "units" in the tens and hundreds, think about
> >> >> >> the same in the thousands and tens of thousands.
>
> >> >> >> These APIs and their now decommissioned backing stores were developed
> >> >> >> when having 20,000 followers was a lot. We're an order of magnitude 
> >> >> >> or
> >> >> >> two beyond that point along nearly every dimension. Accounts.
> >> >> >> Followers per account. Tweets per second. Etc. As systems evolve, 
> >> >> >> some
> >> >> >> evolutionary paths become extinct.
>
> >> >> >> Given boundless resources, the best we could do for a REST API, as
> >> >> >> Marcel has alluded, is to do the cursoring for you and aggregate many
> >> >> >> blocks into much larger responses. This wouldn't work very well for 
> >> >> >> at
> >> >> >> least two immediate reasons: 1) Running a system with multimodal
> >> >> >> service times is a nightmare -- we'd have to provision a specific
> >> >> >> endpoint for such a resource. 2) Ruby GC chokes on lots of objects.
> >> >> >> We'd have to consider implementing this resource in another stack, or
> >> >> >> do a lot of tuning. All this to build the opposite of what most
> >> >> >> applications want: a real-time stream of graph deltas for a set of
> >> >> >> accounts, or the list of recent set operations since the last poll --
> >> >> >> and rarely, if ever, the entire following set.
>
> >> >> >> Also, I'm a little rusty on the details on the social graph api, but
> >> >> >> please detail which public resources allow retrieval of 40,000
> >> >> >> followers in two seconds. I'd be very interested in looking at the
> >> >> >> implementing code on our end. A curl timing would be nice (time curl
> >> >> >> URL > /dev/null) too.
>
> >> >> >> -John Kaluckihttp://twitter.com/jkalucki
> >> >> >> Services, Twitter Inc.
>
> >> >> >> On Mon, Jan 4, 2010 at 9:18 PM, PJB <[email protected]> wrote:
>
> >> >> >> > On Jan 4, 8:58 pm, John Kalucki <[email protected]> wrote:
> >> >> >> >> at the moment). So, it seems that we're returning the data over 
> >> >> >> >> home
> >> >> >> >> DSL at between 2,500 and 4,000 ids per second, which seems like a
> >> >> >> >> perfectly reasonable rate and variance.
>
> >> >> >> > It's certainly not reasonable to expect it to take 10+ seconds to 
> >> >> >> > get
> >> >> >> > 25,000 to 40,000 ids, PARTICULARLY when existing methods, for
> >> >> >> > whatever
> >> >> >> > reason, return the same data in less than 2 seconds.  Twitter is
> >> >> >> > being
> >> >> >> > incredibly short-sighted if they think this is indeed reasonable.
>
> ...
>
> read more »

[twitter-dev] Re: Social Graph API: Legacy data format will be eliminated 1/11/2010

Reply via email to