Interesting. I have never hit more than around 1 million in data sets. I just assumed that since with 1 million, and I am not seeing difference in (1, 100) and (400, 100) for the offsets, that it would scale out the same.

Good to know if I ever get a chance to make something that will hit billions of records. Twitter probably has to think about things that I never will have to. I am not sure that is a good thing or a bad thing, but interesting none the less. Thanks for the insight.

On Jun 30, 2009, at 4:32 PM, John Kalucki wrote:

I don't usually respond to non-Streaming API questions, but we just
spent a few months working on a large mySQL datastore at Twitter. I've
been over this ground extensively recently, so I'm unusually compelled
to respond.

Mysql performance does measurably decrease as you offset and limit,
even if all rows are cached. Create a table that comfortably fits in
memory. Say a few hundred million narrow rows and benchmark on a
result set of a few hundred thousand items chopped into 1k row blocks.
The last block query latency is many times larger than the first
block. Painful. To support deep pagination, you need an generally
unique indexed cursor column to index directly to the first row of the
next block.

Now, create a table with 2 billion fairly wide rows, where only the
last few tens of millions of rows fit into memory. Queries deep into
the past will have neither index or data cached, and performance will
be miserable. If you do a select * from statuses where userid = xxxxx
on a warm idle status database, that query would be mkilled before it
got started, or it would take tens of minutes to hours to complete.

And I'm sure there are other reasons why this feature isn't offered.

(Apologies to Doug and the API team.)

--
Scott * If you contact me off list replace talklists@ with scott@ *

Reply via email to