Re: Configuration and specs to index a 1 terabyte (TB) repository

Shawn Heisey Tue, 29 Oct 2013 08:48:35 -0700

On 10/29/2013 7:24 AM, eShard wrote:
> Good morning,
> I have a 1 TB repository with approximately 500,000 documents (that will
> probably grow from there) that needs to be indexed.  
> I'm limited to Solr 4.0 final (we're close to beta release, so I can't
> upgrade right now) and I can't use SolrCloud because work currently won't
> allow it for some reason.
> 
> I found this configuration from this link:
> http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-td3656484.html#a3657056
>  
> He said he was able to index 1 TB on a single server with 40 cores and 128
> GB of RAM with 10 shards.
> 
> Is this my only option? Or is there a better configuration?
> Is there some formula for calculating server specifications (this much data
> and documents equals this many cores, RAM, hard disk space etc)?


Solr 4.0.0 final is OLD - it was released a year ago, on October 12th,
2012.  Solr 4.5.1 is the eighth release since 4.0.0, and the number of
bugs fixed and performance improvements added since then are staggering.

Planning for hardware failure is critical ... because servers and their
components DO fail.  SolrCloud gives you easy redundancy - it automates
many functions that you'd have to manually design and write yourself if
you don't use it, especially if you plan to go sharded.  I know this
first-hand, because I have a sharded index that was initially built
using Solr 1.4.0, back when SolrCloud was nowhere near release.

Now that I've bashed high-level details of your plan, let's talk about
things that are independent of version and SolrCloud.

An important thing to say right up front is that there are so many
variables involved in Solr requirements that nobody can say for sure
what you will need.  Until you see how your indexing and queries
actually perform, hard numbers are mostly impossible to calculate.

How much of that 1TB will actually need to be in Solr?  The answer to
that question will drive the rest of the discussion.  Have you done any
experimentation yet to determine how big your actual index will get?
Taking steps to reduce the index size will help performance greatly.

For the indexed fields, data has a tendency to shrink a little bit when
you index it.  We have an index for an archive of photos, text, and
video that's over 200TB ... but the actual metadata that goes into Solr
is a database that's about 200GB in size.  Not all of that database gets
indexed, and not all of the source fields are stored.  The Solr index
totals about 93GB.

Hopefully when it comes to stored data, you can get away with only
storing minimal information - just enough to display search results.
When someone wants detail, your system can use an ID stored in Solr to
go to your canonical data source to retrieve the full record.

Ideally, you want enough total RAM to cache your entire index.  If the
total index size on one machine is 250GB, a machine with 256GB RAM is a
good idea.  Total RAM of 128 to 192GB might be enough in reality, though.

If you put the index on SSD, you could get by with less RAM, but a RAID
solution that works properly with SSD (TRIM support) is hard to find, so
SSD failure in most situations effectively means a server failure.  Solr
and Lucene have a track record of shredding SSD into failure, because
typically there is a LOT of writing involved.

If I had to design an index where the *solr* data (not the source
repository) was 1TB, I would work things out so I had enough servers so
I had a total RAM capacity (across all servers) of between 1.5 and 2 TB.
 The requirements are double because I'd have two copies of the index,
for redundancy.  I would also want a single-stranded development
environment with a total RAM capacity of at least 512GB, for planning
and testing of upgrades and new features.

Designing at this scale is not cheap.

Thanks,
Shawn

Re: Configuration and specs to index a 1 terabyte (TB) repository

Reply via email to