Not necessarily.  See this document (which I've posted earlier on this list)
for details: http://code.google.com/p/pubsubhubbub/wiki/PublisherEfficiency
In essence, with PSHB (Pubsub Hubbub), Twitter would only have to retrieve
the latest data, add it to flat files on the server or a single column in a
database somewhere as a static RSS format.  Then, using a combination of
persistent connections, HTTP Pipelining, and multiple, cached and linked
ATOM feeds, return those feeds to either a hub or the user.  ATOM feeds can
be linked, and Twitter doesn't need to return the entire dataset in each
feed, just the latest data, linked to older data on the server (if I
understand ATOM correctly - someone correct me if I'm wrong).

So in essence Twitter only needs to retrieve, and return to the user or hub
the latest (cached) data, and can do so in a persistent connection, multiple
HTTP requests at a time.  And of course this doesn't take into account the
biggest advantage of PSHB - the hub.  PSHB is built to be distributed.  I
know Twitter doesn't want to go there, but if they wanted to they could
allow other authorized hubs to distribute the load of such data, and only
the hubs would fetch data from Twitter, significantly reducing the load for
Twitter regardless of the size of request and ensuring a) users own their
data in a publicly owned format, and b) if Twitter ever goes down the
content is still available via the API.  IMO this is the only way Twitter
will become a "utility" as Jack Dorsey wants it to be.

I would love to see Twitter adopt a more publicly accepted standard like
this.  Or, if it's not meeting their needs, either create their own public
standard and take the lead in open real-time stream standards, or join an
existing one so the standards can be perfected to a manner a company like
Twitter can handle.  I know it would make my coding much easier as more
companies begin to adopt these protocols and I'm stuck having to write the
code for each one.

Leaving the data retrieval in a closed, proprietary format benefits nobody.

Jesse

On Mon, Sep 7, 2009 at 7:52 AM, Dewald Pretorius <dpr...@gmail.com> wrote:

>
> SUP will not work for Twitter or any other service that deals with
> very large data sets.
>
> In essence, a Twitter SUP feed would be one JSON array of all the
> Twitter users who have posted a status update in the past 60 seconds.
>
> a) The SUP feed will consistently contain a few million array entries.
>
> b) To build that feed you must do a select against the tweets table,
> which contains a few billion records, and extract all the user ids
> with a tweet that has a published time greater than now() - 60. Good
> luck asking any DB to do that kind of select once every 60 seconds.
>
> Dewald
>

Reply via email to