Not necessarily. See this document (which I've posted earlier on this list) for details: http://code.google.com/p/pubsubhubbub/wiki/PublisherEfficiency In essence, with PSHB (Pubsub Hubbub), Twitter would only have to retrieve the latest data, add it to flat files on the server or a single column in a database somewhere as a static RSS format. Then, using a combination of persistent connections, HTTP Pipelining, and multiple, cached and linked ATOM feeds, return those feeds to either a hub or the user. ATOM feeds can be linked, and Twitter doesn't need to return the entire dataset in each feed, just the latest data, linked to older data on the server (if I understand ATOM correctly - someone correct me if I'm wrong).
So in essence Twitter only needs to retrieve, and return to the user or hub the latest (cached) data, and can do so in a persistent connection, multiple HTTP requests at a time. And of course this doesn't take into account the biggest advantage of PSHB - the hub. PSHB is built to be distributed. I know Twitter doesn't want to go there, but if they wanted to they could allow other authorized hubs to distribute the load of such data, and only the hubs would fetch data from Twitter, significantly reducing the load for Twitter regardless of the size of request and ensuring a) users own their data in a publicly owned format, and b) if Twitter ever goes down the content is still available via the API. IMO this is the only way Twitter will become a "utility" as Jack Dorsey wants it to be. I would love to see Twitter adopt a more publicly accepted standard like this. Or, if it's not meeting their needs, either create their own public standard and take the lead in open real-time stream standards, or join an existing one so the standards can be perfected to a manner a company like Twitter can handle. I know it would make my coding much easier as more companies begin to adopt these protocols and I'm stuck having to write the code for each one. Leaving the data retrieval in a closed, proprietary format benefits nobody. Jesse On Mon, Sep 7, 2009 at 7:52 AM, Dewald Pretorius <[email protected]> wrote: > > SUP will not work for Twitter or any other service that deals with > very large data sets. > > In essence, a Twitter SUP feed would be one JSON array of all the > Twitter users who have posted a status update in the past 60 seconds. > > a) The SUP feed will consistently contain a few million array entries. > > b) To build that feed you must do a select against the tweets table, > which contains a few billion records, and extract all the user ids > with a tweet that has a published time greater than now() - 60. Good > luck asking any DB to do that kind of select once every 60 seconds. > > Dewald >
