Re: Configuring the Distributed

Mark Miller Thu, 01 Dec 2011 19:22:13 -0800

Hmm...sorry bout that - so my first guess is that right now we are not 
distributing a commit (easy to add, just have not done it).


Right now I explicitly commit on each server for tests.

Can you try explicitly committing on server1 after updating the doc on server 2?

I can start distributing commits tomorrow - been meaning to do it for my own 
convenience anyhow.

Also, you want to pass the sys property numShards=1 on startup. I think it 
defaults to 3. That will give you one leader and one replica.

- Mark

On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:

> So I couldn't resist, I attempted to do this tonight, I used the
> solrconfig you mentioned (as is, no modifications), I setup a 2 shard
> cluster in collection1, I sent 1 doc to 1 of the shards, updated it
> and sent the update to the other.  I don't see the modifications
> though I only see the original document.  The following is the test
> 
> public void update() throws Exception {
> 
>               String key = "1";
> 
>               SolrInputDocument solrDoc = new SolrInputDocument();
>               solrDoc.setField("key", key);
> 
>               solrDoc.addField("content", "initial value");
> 
>               SolrServer server = servers
>                               .get("http://localhost:8983/solr/collection1";);
>               server.add(solrDoc);
> 
>               server.commit();
> 
>               solrDoc = new SolrInputDocument();
>               solrDoc.addField("key", key);
>               solrDoc.addField("content", "updated value");
> 
>               server = servers.get("http://localhost:7574/solr/collection1";);
> 
>               UpdateRequest ureq = new UpdateRequest();
>               ureq.setParam("update.chain", "distrib-update-chain");
>               ureq.add(solrDoc);
>               ureq.setParam("shards",
>                               
> "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>               ureq.setParam("self", "foo");
>               ureq.setAction(ACTION.COMMIT, true, true);
>               server.request(ureq);
>               System.out.println("done");
>       }
> 
> key is my unique field in schema.xml
> 
> What am I doing wrong?
> 
> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>> Yes, the ZK method seems much more flexible.  Adding a new shard would
>> be simply updating the range assignments in ZK.  Where is this
>> currently on the list of things to accomplish?  I don't have time to
>> work on this now, but if you (or anyone) could provide direction I'd
>> be willing to work on this when I had spare time.  I guess a JIRA
>> detailing where/how to do this could help.  Not sure if the design has
>> been thought out that far though.
>> 
>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <markrmil...@gmail.com> wrote:
>>> Right now lets say you have one shard - everything there hashes to range X.
>>> 
>>> Now you want to split that shard with an Index Splitter.
>>> 
>>> You divide range X in two - giving you two ranges - then you start 
>>> splitting. This is where the current Splitter needs a little modification. 
>>> You decide which doc should go into which new index by rehashing each doc 
>>> id in the index you are splitting - if its hash is greater than X/2, it 
>>> goes into index1 - if its less, index2. I think there are a couple current 
>>> Splitter impls, but one of them does something like, give me an id - now if 
>>> the id's in the index are above that id, goto index1, if below, index2. We 
>>> need to instead do a quick hash rather than simple id compare.
>>> 
>>> Why do you need to do this on every shard?
>>> 
>>> The other part we need that we dont have is to store hash range assignments 
>>> in zookeeper - we don't do that yet because it's not needed yet. Instead we 
>>> currently just simply calculate that on the fly (too often at the moment - 
>>> on every request :) I intend to fix that of course).
>>> 
>>> At the start, zk would say, for range X, goto this shard. After the split, 
>>> it would say, for range less than X/2 goto the old node, for range greater 
>>> than X/2 goto the new node.
>>> 
>>> - Mark
>>> 
>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>> 
>>>> hmmm.....This doesn't sound like the hashing algorithm that's on the
>>>> branch, right?  The algorithm you're mentioning sounds like there is
>>>> some logic which is able to tell that a particular range should be
>>>> distributed between 2 shards instead of 1.  So seems like a trade off
>>>> between repartitioning the entire index (on every shard) and having a
>>>> custom hashing algorithm which is able to handle the situation where 2
>>>> or more shards map to a particular range.
>>>> 
>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <markrmil...@gmail.com> wrote:
>>>>> 
>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>> 
>>>>>> I am not familiar with the index splitter that is in contrib, but I'll
>>>>>> take a look at it soon.  So the process sounds like it would be to run
>>>>>> this on all of the current shards indexes based on the hash algorithm.
>>>>> 
>>>>> Not something I've thought deeply about myself yet, but I think the idea 
>>>>> would be to split as many as you felt you needed to.
>>>>> 
>>>>> If you wanted to keep the full balance always, this would mean splitting 
>>>>> every shard at once, yes. But this depends on how many boxes (partitions) 
>>>>> you are willing/able to add at a time.
>>>>> 
>>>>> You might just split one index to start - now it's hash range would be 
>>>>> handled by two shards instead of one (if you have 3 replicas per shard, 
>>>>> this would mean adding 3 more boxes). When you needed to expand again, 
>>>>> you would split another index that was still handling its full starting 
>>>>> range. As you grow, once you split every original index, you'd start 
>>>>> again, splitting one of the now half ranges.
>>>>> 
>>>>>> Is there also an index merger in contrib which could be used to merge
>>>>>> indexes?  I'm assuming this would be the process?
>>>>> 
>>>>> You can merge with IndexWriter.addIndexes (Solr also has an admin command 
>>>>> that can do this). But I'm not sure where this fits in?
>>>>> 
>>>>> - Mark
>>>>> 
>>>>>> 
>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <markrmil...@gmail.com> 
>>>>>> wrote:
>>>>>>> Not yet - we don't plan on working on this until a lot of other stuff is
>>>>>>> working solid at this point. But someone else could jump in!
>>>>>>> 
>>>>>>> There are a couple ways to go about it that I know of:
>>>>>>> 
>>>>>>> A more long term solution may be to start using micro shards - each 
>>>>>>> index
>>>>>>> starts as multiple indexes. This makes it pretty fast to move mirco 
>>>>>>> shards
>>>>>>> around as you decide to change partitions. It's also less flexible as 
>>>>>>> you
>>>>>>> are limited by the number of micro shards you start with.
>>>>>>> 
>>>>>>> A more simple and likely first step is to use an index splitter . We
>>>>>>> already have one in lucene contrib - we would just need to modify it so
>>>>>>> that it splits based on the hash of the document id. This is super
>>>>>>> flexible, but splitting will obviously take a little while on a huge 
>>>>>>> index.
>>>>>>> The current index splitter is a multi pass splitter - good enough to 
>>>>>>> start
>>>>>>> with, but most files under codec control these days, we may be able to 
>>>>>>> make
>>>>>>> a single pass splitter soon as well.
>>>>>>> 
>>>>>>> Eventually you could imagine using both options - micro shards that 
>>>>>>> could
>>>>>>> also be split as needed. Though I still wonder if micro shards will be
>>>>>>> worth the extra complications myself...
>>>>>>> 
>>>>>>> Right now though, the idea is that you should pick a good number of
>>>>>>> partitions to start given your expected data ;) Adding more replicas is
>>>>>>> trivial though.
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Another question, is there any support for repartitioning of the index
>>>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>>>>>> any) would require the index to be repartitioned should a new shard be
>>>>>>>> added.
>>>>>>>> 
>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <jej2...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <markrmil...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <jej2...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml 
>>>>>>>>>>> needs
>>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>>>> solr/core/src/test-files
>>>>>>>>>> 
>>>>>>>>>> You need to enable the update log, add an empty replication handler 
>>>>>>>>>> def,
>>>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> - Mark
>>>>>>>>>> 
>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> - Mark
>>>>>>> 
>>>>>>> http://www.lucidimagination.com
>>>>>>> 
>>>>> 
>>>>> - Mark Miller
>>>>> lucidimagination.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> - Mark Miller
>>> lucidimagination.com
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 

- Mark Miller
lucidimagination.com

Re: Configuring the Distributed

Reply via email to