bq: I'm wondering if anyone has been using SolrCloud with HDFS at large scales

Absolutely, there are several companies doing this, see Lucidworks and
Cloudera for two instances.

Solr itself has the MapReduceIndexerTool for indexing to Solr's
running on HDFS FWIW.

About needing 3x the memory.. simply splitting shards is only _part_
of the solution as you're seeing.
Next bit would be to move the sub-shards on to other metal, pointing
at the HDFS index to be sure.

FWIW,
Erick



On Wed, Jan 7, 2015 at 9:50 AM, Joseph Obernberger
<j...@lovehorsepower.com> wrote:
> Kinda late to the party on this very interesting thread, but I'm wondering
> if anyone has been using SolrCloud with HDFS at large scales?  We really
> like this capability since our data is inside of Hadoop and we can run the
> Solr shards on the same nodes, and we only need to manage one pool of
> storage (HDFS).  Our current setup consists of 11 systems (bare metal -
> although our cluster is actually 16 nodes not all run Solr) with a 2.9TByte
> index and just under 900 million docs spread across 22 shards (2 shards per
> physical box) in a single collection.  We index around 3 million docs per
> day.
>
> A typical query takes about 7 seconds to run, but we also do faceting and
> clustering.  Those can take in the 3 - 5 minute range depends on what was
> queried, but can be as little as 10 seconds. The index contains about 100
> fields.
> We are looking at switching to a different method of indexing our data which
> will involve a much larger number of fields, and very little stored in the
> index (index only) to help improve performance.
>
> I've used the SHARDSPLIT with success, but the server doing the split needs
> to have triple the amount of direct memory when using HDFS as one node needs
> three X the amount because it will be running three shards.  This can lead
> to 'swap hell' if you're not careful. On large indexes, the split can take a
> very long time to run; much longer than the REST timeout, but can be
> monitored by checking zookeeper's clusterstate.json.
>
> -Joe
>
>
>
> On 1/7/2015 4:25 AM, Bram Van Dam wrote:
>>
>> On 01/06/2015 07:54 PM, Erick Erickson wrote:
>>>
>>> Have you considered pre-supposing SolrCloud and using the SPLITSHARD
>>> API command?
>>
>>
>> I think that's the direction we'll probably be going. Index size (at least
>> for us) can be unpredictable in some cases. Some clients start out small and
>> then grow exponentially, while others start big and then don't grow much at
>> all. Starting with SolrCloud would at least give us that flexibility.
>>
>> That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a
>> certain size, it would be better for us to simply add an extra shard,
>> without splitting.
>>
>>
>>> On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge <peter.stu...@gmail.com>
>>> wrote:
>>>>
>>>> ++1 for the automagic shard creator. We've been looking into doing this
>>>> sort of thing internally - i.e. when a shard reaches a certain size/num
>>>> docs, it creates 'sub-shards' to which new commits are sent and queries
>>>> to
>>>> the 'parent' shard are included. The concept works, as long as you don't
>>>> try any non-dist stuff - it's one reason why all our fields are always
>>>> single valued.
>>
>>
>> Is there a problem with multi-valued fields and distributed queries?
>>
>>>> A cool side-effect of sub-sharding (for lack of a snappy term) is that
>>>> the
>>>> parent shard then stops suffering from auto-warming latency due to
>>>> commits
>>>> (we do a fair amount of committing). In theory, you could carry on
>>>> sub-sharding until your hardware starts gasping for air.
>>
>>
>> Sounds like you're doing something similar to us. In some cases we have a
>> hard commit every minute. Keeping the caches hot seems like a very good
>> reason to send data to a specific shard. At least I'm assuming that when you
>> add documents to a single shard and commit; the other shards won't be
>> impacted...
>>
>>  - Bram
>>
>>
>

Reply via email to