I don't usually respond to non-Streaming API questions, but we just
spent a few months working on a large mySQL datastore at Twitter. I've
been over this ground extensively recently, so I'm unusually compelled
to respond.

Mysql performance does measurably decrease as you offset and limit,
even if all rows are cached. Create a table that comfortably fits in
memory. Say a few hundred million narrow rows and benchmark on a
result set of a few hundred thousand items chopped into 1k row blocks.
The last block query latency is many times larger than the first
block. Painful. To support deep pagination, you need an generally
unique indexed cursor column to index directly to the first row of the
next block.

Now, create a table with 2 billion fairly wide rows, where only the
last few tens of millions of rows fit into memory. Queries deep into
the past will have neither index or data cached, and performance will
be miserable. If you do a select * from statuses where userid = xxxxx
on a warm idle status database, that query would be mkilled before it
got started, or it would take tens of minutes to hours to complete.

And I'm sure there are other reasons why this feature isn't offered.

(Apologies to Doug and the API team.)

-John Kalucki
Services, Twitter Inc.



On Jun 30, 4:15 pm, Scott Haneda <talkli...@newgeo.com> wrote:
> Been pondering this today.  There seem to be 7 day limits, or around  
> 3000 tweet limits to the API.  At first, my gut told me that was for  
> load reasons, and it made sense.
>
> I started thinking about paging results in development projects I have  
> worked on.
>
> Looking at this from a database perspective
>
> SELECT foo, bar from something where name = 'test' ORDER BY id limit  
> 1, 200;
> Start at id 1, get me 200, may take x seconds.
>
> Next page:
> SELECT foo, bar from something where name = 'test' ORDER BY id limit  
> 200, 200;
> Start at id 200, get me 200, may also take x seconds
>
> Arbitrary page:
> SELECT foo, bar from something where name = 'test' ORDER BY id limit  
> 5000, 200;
> Start at id 5000, get me 200, will also take x seconds
>
> In each case, x as time, does not change.  Now, this assumes all the  
> data is in a single database, or is normal in a way that facilitates  
> this.
>
> This question is just one of curiosity.  I am betting, the tweets  
> table has been distributed across many tables, and there is no simple  
> way to get to the "pages" results as shown above?
>
> If it is not, I am not seeing any performance hit to getting the first  
> 100 records, or a subset that is 20,000 tweets into the record set.
> --
> Scott * If you contact me off list replace talklists@ with scott@ *

Reply via email to