A flat distribution of queries is a poor test. Real queries have a zipf distribution. The flat distribution will get almost no benefit from caching, so it will give too low a number and stress disk IO too much. The 99th percentile is probably the same for both distributions, because that is dominated by rare queries.
Real query loads will get a much smaller boost from SSD in the median and up to about 75th percentile. wunder Search guy for Netflix and now Chegg On Oct 30, 2013, at 1:43 AM, Toke Eskildsen <t...@statsbiblioteket.dk> wrote: > On Tue, 2013-10-29 at 14:24 +0100, eShard wrote: >> I have a 1 TB repository with approximately 500,000 documents (that will >> probably grow from there) that needs to be indexed. > > As Shawn point out, that isn't telling us much. If you describe the > documents, how and how often you index and how you query them, it will > help a lot. > > > Let me offer some observations from a related project we are starting at > Statsbiblioteket. > > > We are planning to index 20 TB harvested web resources (*.dk from the > last 8 years, or at least the resources our crawlers sunk their > tentacles into). We have two text indexes generated from about 1% and 2% > of that corpus, respectively. They are 200GB and 420GB in size and > contains ~75 million and (whoops, offline, so rememberguessing here) > ~150 million documents. > > For testing purposes we issued simple searches: 2-4 OR'ed terms, picked > at random from a Danish dictionary. One of our test machines is an 2*8 > core Xeon machine with 32GB of RAM (about ~12GB free for caching) and > SSD as storage. We had room for a 2-shard cloud on the SSD's, so > searches were issued to 2*200GB index of a total of 150 million > documents. CentOS/Solr 4.3. > > Hammering that machine with 32 threads gave us a median response time of > 200ms and a 99-percentile of 5-800 ms (depending on test run), single > thread has median 30ms and 99-percentile 70-130ms. CPU load peaked at > 300-400% and IOWait at 30-40%, but was not closely monitored. > > Our current vision is to shard the projected 20TB index into ~800GB or > ~1TB chunks (depending on which drives we choose) and put one chard on > each physical SSD, thereby sidestepping the whole RAID & TRIM-problem. > > We do have the great luxury of running nightly batch index updates on a > single shard instead of continuous updates. We would probably go for > smaller shards if they were all updated continuously. > > Projected price for the full setup range from $50.000-$100.000, > depending on where we land on the off-the-shelf -> enterprise scale. > > (I need to write a blog post on this) > > > With that in mind, I urge you to do some testing on a machine with SSD > and modest memory vs. a traditional spinning drives and monster-memory > machine. > > > - Toke Eskildsen, State and University Library, Denmark