Kinda late to the party on this very interesting thread, but I'm wondering if anyone has been using SolrCloud with HDFS at large scales? We really like this capability since our data is inside of Hadoop and we can run the Solr shards on the same nodes, and we only need to manage one pool of storage (HDFS). Our current setup consists of 11 systems (bare metal - although our cluster is actually 16 nodes not all run Solr) with a 2.9TByte index and just under 900 million docs spread across 22 shards (2 shards per physical box) in a single collection. We index around 3 million docs per day.

A typical query takes about 7 seconds to run, but we also do faceting and clustering. Those can take in the 3 - 5 minute range depends on what was queried, but can be as little as 10 seconds. The index contains about 100 fields. We are looking at switching to a different method of indexing our data which will involve a much larger number of fields, and very little stored in the index (index only) to help improve performance.

I've used the SHARDSPLIT with success, but the server doing the split needs to have triple the amount of direct memory when using HDFS as one node needs three X the amount because it will be running three shards. This can lead to 'swap hell' if you're not careful. On large indexes, the split can take a very long time to run; much longer than the REST timeout, but can be monitored by checking zookeeper's clusterstate.json.

-Joe


On 1/7/2015 4:25 AM, Bram Van Dam wrote:
On 01/06/2015 07:54 PM, Erick Erickson wrote:
Have you considered pre-supposing SolrCloud and using the SPLITSHARD
API command?

I think that's the direction we'll probably be going. Index size (at least for us) can be unpredictable in some cases. Some clients start out small and then grow exponentially, while others start big and then don't grow much at all. Starting with SolrCloud would at least give us that flexibility.

That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a certain size, it would be better for us to simply add an extra shard, without splitting.


On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge <peter.stu...@gmail.com> wrote:
++1 for the automagic shard creator. We've been looking into doing this
sort of thing internally - i.e. when a shard reaches a certain size/num
docs, it creates 'sub-shards' to which new commits are sent and queries to the 'parent' shard are included. The concept works, as long as you don't
try any non-dist stuff - it's one reason why all our fields are always
single valued.

Is there a problem with multi-valued fields and distributed queries?

A cool side-effect of sub-sharding (for lack of a snappy term) is that the parent shard then stops suffering from auto-warming latency due to commits
(we do a fair amount of committing). In theory, you could carry on
sub-sharding until your hardware starts gasping for air.

Sounds like you're doing something similar to us. In some cases we have a hard commit every minute. Keeping the caches hot seems like a very good reason to send data to a specific shard. At least I'm assuming that when you add documents to a single shard and commit; the other shards won't be impacted...

 - Bram



Reply via email to