On 4/25/2014 1:48 PM, Ed Smiley wrote: > Anyone with experience, suggestions or lessons learned in the 10 -100 TB > scale they'd like to share? > Researching optimum design for a Solr Cloud with, say, about 20TB index.
You've gotten some good information already in the replies that have come your way. The following blog post is even more relevant (in the "we don't know" department) for large indexes than it is for small indexes: http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ My own index is nowhere near that size. It has 95 million records and seven shards. A single copy is about 108GB and lives on two servers that each have 64GB of RAM. I'm not running in SolrCloudmode. The most important resource for Solr scalability is RAM. This includes the Java heap on each server, as well as unallocated memory so the operating system can cache the index data that lives on that server. http://wiki.apache.org/solr/SolrPerformanceProblems As the wiki page says, ideally you'd want as much RAM for the OS disk cache as the index takes up on disk, but 40TB of RAM across all servers just for the OS disk cache (in addition to whatever you need for the java heap) is too expensive to contemplate. A 1:1 ratio is not an absolute requirement, although it does produce the best results. For that 40TB ideal figure, I am assuming that you mean a single replica of your index would be 20TB, and that you'd have two. Doing everything you can to reduce the index size will go a long way towards improving Solr performance. Having SSD in each server for the index data would also help. If the query volume is high, a large number of very fast CPU cores is also required. Thanks, Shawn