Interesting. I have never hit more than around 1 million in data sets. I just assumed that since with 1 million, and I am not seeing difference in (1, 100) and (400, 100) for the offsets, that it would scale out the same.
Good to know if I ever get a chance to make something that will hit billions of records. Twitter probably has to think about things that I never will have to. I am not sure that is a good thing or a bad thing, but interesting none the less. Thanks for the insight.
On Jun 30, 2009, at 4:32 PM, John Kalucki wrote:
I don't usually respond to non-Streaming API questions, but we just spent a few months working on a large mySQL datastore at Twitter. I've been over this ground extensively recently, so I'm unusually compelled to respond. Mysql performance does measurably decrease as you offset and limit, even if all rows are cached. Create a table that comfortably fits in memory. Say a few hundred million narrow rows and benchmark on a result set of a few hundred thousand items chopped into 1k row blocks. The last block query latency is many times larger than the first block. Painful. To support deep pagination, you need an generally unique indexed cursor column to index directly to the first row of the next block. Now, create a table with 2 billion fairly wide rows, where only the last few tens of millions of rows fit into memory. Queries deep into the past will have neither index or data cached, and performance will be miserable. If you do a select * from statuses where userid = xxxxx on a warm idle status database, that query would be mkilled before it got started, or it would take tens of minutes to hours to complete. And I'm sure there are other reasons why this feature isn't offered. (Apologies to Doug and the API team.)
-- Scott * If you contact me off list replace talklists@ with scott@ *