Kinda late to the party on this very interesting thread, but I'm
wondering if anyone has been using SolrCloud with HDFS at large scales?
We really like this capability since our data is inside of Hadoop and we
can run the Solr shards on the same nodes, and we only need to manage
one pool of storage (HDFS). Our current setup consists of 11 systems
(bare metal - although our cluster is actually 16 nodes not all run
Solr) with a 2.9TByte index and just under 900 million docs spread
across 22 shards (2 shards per physical box) in a single collection. We
index around 3 million docs per day.
A typical query takes about 7 seconds to run, but we also do faceting
and clustering. Those can take in the 3 - 5 minute range depends on
what was queried, but can be as little as 10 seconds. The index contains
about 100 fields.
We are looking at switching to a different method of indexing our data
which will involve a much larger number of fields, and very little
stored in the index (index only) to help improve performance.
I've used the SHARDSPLIT with success, but the server doing the split
needs to have triple the amount of direct memory when using HDFS as one
node needs three X the amount because it will be running three shards.
This can lead to 'swap hell' if you're not careful. On large indexes,
the split can take a very long time to run; much longer than the REST
timeout, but can be monitored by checking zookeeper's clusterstate.json.
-Joe
On 1/7/2015 4:25 AM, Bram Van Dam wrote:
On 01/06/2015 07:54 PM, Erick Erickson wrote:
Have you considered pre-supposing SolrCloud and using the SPLITSHARD
API command?
I think that's the direction we'll probably be going. Index size (at
least for us) can be unpredictable in some cases. Some clients start
out small and then grow exponentially, while others start big and then
don't grow much at all. Starting with SolrCloud would at least give us
that flexibility.
That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a
certain size, it would be better for us to simply add an extra shard,
without splitting.
On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge
<peter.stu...@gmail.com> wrote:
++1 for the automagic shard creator. We've been looking into doing this
sort of thing internally - i.e. when a shard reaches a certain size/num
docs, it creates 'sub-shards' to which new commits are sent and
queries to
the 'parent' shard are included. The concept works, as long as you
don't
try any non-dist stuff - it's one reason why all our fields are always
single valued.
Is there a problem with multi-valued fields and distributed queries?
A cool side-effect of sub-sharding (for lack of a snappy term) is
that the
parent shard then stops suffering from auto-warming latency due to
commits
(we do a fair amount of committing). In theory, you could carry on
sub-sharding until your hardware starts gasping for air.
Sounds like you're doing something similar to us. In some cases we
have a hard commit every minute. Keeping the caches hot seems like a
very good reason to send data to a specific shard. At least I'm
assuming that when you add documents to a single shard and commit; the
other shards won't be impacted...
- Bram