Re: Configuration and specs to index a 1 terabyte (TB) repository

Walter Underwood Wed, 30 Oct 2013 16:21:30 -0700

A flat distribution of queries is a poor test. Real queries have a zipf 
distribution. The flat distribution will get almost no benefit from caching, so 
it will give too low a number and stress disk IO too much. The 99th percentile 
is probably the same for both distributions, because that is dominated by rare 
queries.


Real query loads will get a much smaller boost from SSD in the median and up to 
about 75th percentile.

wunder
Search guy for Netflix and now Chegg

On Oct 30, 2013, at 1:43 AM, Toke Eskildsen <t...@statsbiblioteket.dk> wrote:

> On Tue, 2013-10-29 at 14:24 +0100, eShard wrote:
>> I have a 1 TB repository with approximately 500,000 documents (that will
>> probably grow from there) that needs to be indexed.  
> 
> As Shawn point out, that isn't telling us much. If you describe the
> documents, how and how often you index and how you query them, it will
> help a lot.
> 
> 
> Let me offer some observations from a related project we are starting at
> Statsbiblioteket.
> 
> 
> We are planning to index 20 TB harvested web resources (*.dk from the
> last 8 years, or at least the resources our crawlers sunk their
> tentacles into). We have two text indexes generated from about 1% and 2%
> of that corpus, respectively. They are 200GB and 420GB in size and
> contains ~75 million and (whoops, offline, so rememberguessing here)
> ~150 million documents.
> 
> For testing purposes we issued simple searches: 2-4 OR'ed terms, picked
> at random from a Danish dictionary. One of our test machines is an 2*8
> core Xeon machine with 32GB of RAM (about ~12GB free for caching) and
> SSD as storage. We had room for a 2-shard cloud on the SSD's, so
> searches were issued to 2*200GB index of a total of 150 million
> documents. CentOS/Solr 4.3.
> 
> Hammering that machine with 32 threads gave us a median response time of
> 200ms and a 99-percentile of 5-800 ms (depending on test run), single
> thread has median 30ms and 99-percentile 70-130ms. CPU load peaked at
> 300-400% and IOWait at 30-40%, but was not closely monitored.
> 
> Our current vision is to shard the projected 20TB index into ~800GB or
> ~1TB chunks (depending on which drives we choose) and put one chard on
> each physical SSD, thereby sidestepping the whole RAID & TRIM-problem. 
> 
> We do have the great luxury of running nightly batch index updates on a
> single shard instead of continuous updates. We would probably go for
> smaller shards if they were all updated continuously.
> 
> Projected price for the full setup range from $50.000-$100.000,
> depending on where we land on the off-the-shelf -> enterprise scale.
> 
> (I need to write a blog post on this)
> 
> 
> With that in mind, I urge you to do some testing on a machine with SSD
> and modest memory vs. a traditional spinning drives and monster-memory
> machine.
> 
> 
> - Toke Eskildsen, State and University Library, Denmark

Re: Configuration and specs to index a 1 terabyte (TB) repository

Reply via email to