Re: How large is your solr index?

Joseph Obernberger Wed, 07 Jan 2015 10:01:40 -0800

Kinda late to the party on this very interesting thread, but I'mwondering if anyone has been using SolrCloud with HDFS at large scales?We really like this capability since our data is inside of Hadoop and wecan run the Solr shards on the same nodes, and we only need to manageone pool of storage (HDFS). Our current setup consists of 11 systems(bare metal - although our cluster is actually 16 nodes not all runSolr) with a 2.9TByte index and just under 900 million docs spreadacross 22 shards (2 shards per physical box) in a single collection. Weindex around 3 million docs per day.

A typical query takes about 7 seconds to run, but we also do facetingand clustering. Those can take in the 3 - 5 minute range depends onwhat was queried, but can be as little as 10 seconds. The index containsabout 100 fields.We are looking at switching to a different method of indexing our datawhich will involve a much larger number of fields, and very littlestored in the index (index only) to help improve performance.

I've used the SHARDSPLIT with success, but the server doing the splitneeds to have triple the amount of direct memory when using HDFS as onenode needs three X the amount because it will be running three shards.This can lead to 'swap hell' if you're not careful. On large indexes,the split can take a very long time to run; much longer than the RESTtimeout, but can be monitored by checking zookeeper's clusterstate.json.


-Joe


On 1/7/2015 4:25 AM, Bram Van Dam wrote:

On 01/06/2015 07:54 PM, Erick Erickson wrote:
Have you considered pre-supposing SolrCloud and using the SPLITSHARD
API command?
I think that's the direction we'll probably be going. Index size (atleast for us) can be unpredictable in some cases. Some clients startout small and then grow exponentially, while others start big and thendon't grow much at all. Starting with SolrCloud would at least give usthat flexibility.
That being said, SPLITSHARD doesn't seem ideal. If a shard reaches acertain size, it would be better for us to simply add an extra shard,without splitting.
On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge<peter.stu...@gmail.com> wrote:
++1 for the automagic shard creator. We've been looking into doing this
sort of thing internally - i.e. when a shard reaches a certain size/num
docs, it creates 'sub-shards' to which new commits are sent andqueries tothe 'parent' shard are included. The concept works, as long as youdon't
try any non-dist stuff - it's one reason why all our fields are always
single valued.
Is there a problem with multi-valued fields and distributed queries?
A cool side-effect of sub-sharding (for lack of a snappy term) isthat theparent shard then stops suffering from auto-warming latency due tocommits
(we do a fair amount of committing). In theory, you could carry on
sub-sharding until your hardware starts gasping for air.
Sounds like you're doing something similar to us. In some cases wehave a hard commit every minute. Keeping the caches hot seems like avery good reason to send data to a specific shard. At least I'massuming that when you add documents to a single shard and commit; theother shards won't be impacted...
 - Bram

Re: How large is your solr index?

Reply via email to