Re: Configuring the Distributed

Jamie Johnson Thu, 01 Dec 2011 16:44:37 -0800

hmmm.....This doesn't sound like the hashing algorithm that's on the
branch, right?  The algorithm you're mentioning sounds like there is
some logic which is able to tell that a particular range should be
distributed between 2 shards instead of 1.  So seems like a trade off
between repartitioning the entire index (on every shard) and having a
custom hashing algorithm which is able to handle the situation where 2
or more shards map to a particular range.


On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <markrmil...@gmail.com> wrote:
>
> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>
>> I am not familiar with the index splitter that is in contrib, but I'll
>> take a look at it soon.  So the process sounds like it would be to run
>> this on all of the current shards indexes based on the hash algorithm.
>
> Not something I've thought deeply about myself yet, but I think the idea 
> would be to split as many as you felt you needed to.
>
> If you wanted to keep the full balance always, this would mean splitting 
> every shard at once, yes. But this depends on how many boxes (partitions) you 
> are willing/able to add at a time.
>
> You might just split one index to start - now it's hash range would be 
> handled by two shards instead of one (if you have 3 replicas per shard, this 
> would mean adding 3 more boxes). When you needed to expand again, you would 
> split another index that was still handling its full starting range. As you 
> grow, once you split every original index, you'd start again, splitting one 
> of the now half ranges.
>
>> Is there also an index merger in contrib which could be used to merge
>> indexes?  I'm assuming this would be the process?
>
> You can merge with IndexWriter.addIndexes (Solr also has an admin command 
> that can do this). But I'm not sure where this fits in?
>
> - Mark
>
>>
>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <markrmil...@gmail.com> wrote:
>>> Not yet - we don't plan on working on this until a lot of other stuff is
>>> working solid at this point. But someone else could jump in!
>>>
>>> There are a couple ways to go about it that I know of:
>>>
>>> A more long term solution may be to start using micro shards - each index
>>> starts as multiple indexes. This makes it pretty fast to move mirco shards
>>> around as you decide to change partitions. It's also less flexible as you
>>> are limited by the number of micro shards you start with.
>>>
>>> A more simple and likely first step is to use an index splitter . We
>>> already have one in lucene contrib - we would just need to modify it so
>>> that it splits based on the hash of the document id. This is super
>>> flexible, but splitting will obviously take a little while on a huge index.
>>> The current index splitter is a multi pass splitter - good enough to start
>>> with, but most files under codec control these days, we may be able to make
>>> a single pass splitter soon as well.
>>>
>>> Eventually you could imagine using both options - micro shards that could
>>> also be split as needed. Though I still wonder if micro shards will be
>>> worth the extra complications myself...
>>>
>>> Right now though, the idea is that you should pick a good number of
>>> partitions to start given your expected data ;) Adding more replicas is
>>> trivial though.
>>>
>>> - Mark
>>>
>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>>
>>>> Another question, is there any support for repartitioning of the index
>>>> if a new shard is added?  What is the recommended approach for
>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>> any) would require the index to be repartitioned should a new shard be
>>>> added.
>>>>
>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>>>> Thanks I will try this first thing in the morning.
>>>>>
>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <markrmil...@gmail.com>
>>>> wrote:
>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <jej2...@gmail.com>
>>>> wrote:
>>>>>>
>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>> wondering if there was any documentation on configuring the
>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>> solr/core/src/test-files
>>>>>>
>>>>>> You need to enable the update log, add an empty replication handler def,
>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>>>
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://www.lucidimagination.com
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>

Re: Configuring the Distributed

Reply via email to