On 10/29/2013 7:24 AM, eShard wrote: > Good morning, > I have a 1 TB repository with approximately 500,000 documents (that will > probably grow from there) that needs to be indexed. > I'm limited to Solr 4.0 final (we're close to beta release, so I can't > upgrade right now) and I can't use SolrCloud because work currently won't > allow it for some reason. > > I found this configuration from this link: > http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-td3656484.html#a3657056 > > He said he was able to index 1 TB on a single server with 40 cores and 128 > GB of RAM with 10 shards. > > Is this my only option? Or is there a better configuration? > Is there some formula for calculating server specifications (this much data > and documents equals this many cores, RAM, hard disk space etc)?
Solr 4.0.0 final is OLD - it was released a year ago, on October 12th, 2012. Solr 4.5.1 is the eighth release since 4.0.0, and the number of bugs fixed and performance improvements added since then are staggering. Planning for hardware failure is critical ... because servers and their components DO fail. SolrCloud gives you easy redundancy - it automates many functions that you'd have to manually design and write yourself if you don't use it, especially if you plan to go sharded. I know this first-hand, because I have a sharded index that was initially built using Solr 1.4.0, back when SolrCloud was nowhere near release. Now that I've bashed high-level details of your plan, let's talk about things that are independent of version and SolrCloud. An important thing to say right up front is that there are so many variables involved in Solr requirements that nobody can say for sure what you will need. Until you see how your indexing and queries actually perform, hard numbers are mostly impossible to calculate. How much of that 1TB will actually need to be in Solr? The answer to that question will drive the rest of the discussion. Have you done any experimentation yet to determine how big your actual index will get? Taking steps to reduce the index size will help performance greatly. For the indexed fields, data has a tendency to shrink a little bit when you index it. We have an index for an archive of photos, text, and video that's over 200TB ... but the actual metadata that goes into Solr is a database that's about 200GB in size. Not all of that database gets indexed, and not all of the source fields are stored. The Solr index totals about 93GB. Hopefully when it comes to stored data, you can get away with only storing minimal information - just enough to display search results. When someone wants detail, your system can use an ID stored in Solr to go to your canonical data source to retrieve the full record. Ideally, you want enough total RAM to cache your entire index. If the total index size on one machine is 250GB, a machine with 256GB RAM is a good idea. Total RAM of 128 to 192GB might be enough in reality, though. If you put the index on SSD, you could get by with less RAM, but a RAID solution that works properly with SSD (TRIM support) is hard to find, so SSD failure in most situations effectively means a server failure. Solr and Lucene have a track record of shredding SSD into failure, because typically there is a LOT of writing involved. If I had to design an index where the *solr* data (not the source repository) was 1TB, I would work things out so I had enough servers so I had a total RAM capacity (across all servers) of between 1.5 and 2 TB. The requirements are double because I'd have two copies of the index, for redundancy. I would also want a single-stranded development environment with a total RAM capacity of at least 512GB, for planning and testing of upgrades and new features. Designing at this scale is not cheap. Thanks, Shawn