On Tue, 2013-10-29 at 14:24 +0100, eShard wrote:
> I have a 1 TB repository with approximately 500,000 documents (that will
> probably grow from there) that needs to be indexed.  

As Shawn point out, that isn't telling us much. If you describe the
documents, how and how often you index and how you query them, it will
help a lot.


Let me offer some observations from a related project we are starting at
Statsbiblioteket.


We are planning to index 20 TB harvested web resources (*.dk from the
last 8 years, or at least the resources our crawlers sunk their
tentacles into). We have two text indexes generated from about 1% and 2%
of that corpus, respectively. They are 200GB and 420GB in size and
contains ~75 million and (whoops, offline, so rememberguessing here)
~150 million documents.

For testing purposes we issued simple searches: 2-4 OR'ed terms, picked
at random from a Danish dictionary. One of our test machines is an 2*8
core Xeon machine with 32GB of RAM (about ~12GB free for caching) and
SSD as storage. We had room for a 2-shard cloud on the SSD's, so
searches were issued to 2*200GB index of a total of 150 million
documents. CentOS/Solr 4.3.

Hammering that machine with 32 threads gave us a median response time of
200ms and a 99-percentile of 5-800 ms (depending on test run), single
thread has median 30ms and 99-percentile 70-130ms. CPU load peaked at
300-400% and IOWait at 30-40%, but was not closely monitored.

Our current vision is to shard the projected 20TB index into ~800GB or
~1TB chunks (depending on which drives we choose) and put one chard on
each physical SSD, thereby sidestepping the whole RAID & TRIM-problem. 

We do have the great luxury of running nightly batch index updates on a
single shard instead of continuous updates. We would probably go for
smaller shards if they were all updated continuously.

Projected price for the full setup range from $50.000-$100.000,
depending on where we land on the off-the-shelf -> enterprise scale.

(I need to write a blog post on this)


With that in mind, I urge you to do some testing on a machine with SSD
and modest memory vs. a traditional spinning drives and monster-memory
machine.


- Toke Eskildsen, State and University Library, Denmark


Reply via email to