On Tue, 2013-10-29 at 14:24 +0100, eShard wrote: > I have a 1 TB repository with approximately 500,000 documents (that will > probably grow from there) that needs to be indexed.
As Shawn point out, that isn't telling us much. If you describe the documents, how and how often you index and how you query them, it will help a lot. Let me offer some observations from a related project we are starting at Statsbiblioteket. We are planning to index 20 TB harvested web resources (*.dk from the last 8 years, or at least the resources our crawlers sunk their tentacles into). We have two text indexes generated from about 1% and 2% of that corpus, respectively. They are 200GB and 420GB in size and contains ~75 million and (whoops, offline, so rememberguessing here) ~150 million documents. For testing purposes we issued simple searches: 2-4 OR'ed terms, picked at random from a Danish dictionary. One of our test machines is an 2*8 core Xeon machine with 32GB of RAM (about ~12GB free for caching) and SSD as storage. We had room for a 2-shard cloud on the SSD's, so searches were issued to 2*200GB index of a total of 150 million documents. CentOS/Solr 4.3. Hammering that machine with 32 threads gave us a median response time of 200ms and a 99-percentile of 5-800 ms (depending on test run), single thread has median 30ms and 99-percentile 70-130ms. CPU load peaked at 300-400% and IOWait at 30-40%, but was not closely monitored. Our current vision is to shard the projected 20TB index into ~800GB or ~1TB chunks (depending on which drives we choose) and put one chard on each physical SSD, thereby sidestepping the whole RAID & TRIM-problem. We do have the great luxury of running nightly batch index updates on a single shard instead of continuous updates. We would probably go for smaller shards if they were all updated continuously. Projected price for the full setup range from $50.000-$100.000, depending on where we land on the off-the-shelf -> enterprise scale. (I need to write a blog post on this) With that in mind, I urge you to do some testing on a machine with SSD and modest memory vs. a traditional spinning drives and monster-memory machine. - Toke Eskildsen, State and University Library, Denmark