Re: Configuring the Distributed

Mark Miller Fri, 02 Dec 2011 07:48:31 -0800

So I dunno. You are running a zk server and running in zk mode right?

You don't need to / shouldn't set a shards or self param. The shards are
figured out from Zookeeper.


You always want to use the distrib-update-chain. Eventually it will
probably be part of the default chain and auto turn in zk mode.

If you are running in zk mode attached to a zk server, this should work no
problem. You can add docs to any server and they will be forwarded to the
correct shard leader and then versioned and forwarded to replicas.

You can also use the CloudSolrServer solrj client - that way you don't even
have to choose a server to send docs too - in which case if it went down
you would have to choose another manually - CloudSolrServer automatically
finds one that is up through ZooKeeper. Eventually it will also be smart
and do the hashing itself so that it can send directly to the shard leader
that the doc would be forwarded to anyway.

- Mark

On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson <jej2...@gmail.com> wrote:

> Really just trying to do a simple add and update test, the chain
> missing is just proof of my not understanding exactly how this is
> supposed to work.  I modified the code to this
>
>                String key = "1";
>
>                SolrInputDocument solrDoc = new SolrInputDocument();
>                solrDoc.setField("key", key);
>
>                 solrDoc.addField("content_mvtxt", "initial value");
>
>                SolrServer server = servers
>                                .get("
> http://localhost:8983/solr/collection1";);
>
>                 UpdateRequest ureq = new UpdateRequest();
>                ureq.setParam("update.chain", "distrib-update-chain");
>                ureq.add(solrDoc);
>                ureq.setParam("shards",
>
>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>                ureq.setParam("self", "foo");
>                ureq.setAction(ACTION.COMMIT, true, true);
>                server.request(ureq);
>                 server.commit();
>
>                solrDoc = new SolrInputDocument();
>                solrDoc.addField("key", key);
>                 solrDoc.addField("content_mvtxt", "updated value");
>
>                server = servers.get("
> http://localhost:7574/solr/collection1";);
>
>                 ureq = new UpdateRequest();
>                ureq.setParam("update.chain", "distrib-update-chain");
>                 // ureq.deleteById("8060a9eb-9546-43ee-95bb-d18ea26a6285");
>                 ureq.add(solrDoc);
>                ureq.setParam("shards",
>
>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
>                ureq.setParam("self", "foo");
>                ureq.setAction(ACTION.COMMIT, true, true);
>                server.request(ureq);
>                 // server.add(solrDoc);
>                server.commit();
>                server = servers.get("
> http://localhost:8983/solr/collection1";);
>
>
>                server.commit();
>                System.out.println("done");
>
> but I'm still seeing the doc appear on both shards.    After the first
> commit I see the doc on 8983 with "initial value".  after the second
> commit I see the updated value on 7574 and the old on 8983.  After the
> final commit the doc on 8983 gets updated.
>
> Is there something wrong with my test?
>
> On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller <markrmil...@gmail.com>
> wrote:
> > Getting late - didn't really pay attention to your code I guess - why
> are you adding the first doc without specifying the distrib update chain?
> This is not really supported. It's going to just go to the server you
> specified - even with everything setup right, the update might then go to
> that same server or the other one depending on how it hashes. You really
> want to just always use the distrib update chain.  I guess I don't yet
> understand what you are trying to test.
> >
> > Sent from my iPad
> >
> > On Dec 1, 2011, at 10:57 PM, Mark Miller <markrmil...@gmail.com> wrote:
> >
> >> Not sure offhand - but things will be funky if you don't specify the
> correct numShards.
> >>
> >> The instance to shard assignment should be using numShards to assign.
> But then the hash to shard mapping actually goes on the number of shards it
> finds registered in ZK (it doesn't have to, but really these should be
> equal).
> >>
> >> So basically you are saying, I want 3 partitions, but you are only
> starting up 2 nodes, and the code is just not happy about that I'd guess.
> For the system to work properly, you have to fire up at least as many
> servers as numShards.
> >>
> >> What are you trying to do? 2 partitions with no replicas, or one
> partition with one replica?
> >>
> >> In either case, I think you will have better luck if you fire up at
> least as many servers as the numShards setting. Or lower the numShards
> setting.
> >>
> >> This is all a work in progress by the way - what you are trying to test
> should work if things are setup right though.
> >>
> >> - Mark
> >>
> >>
> >> On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
> >>
> >>> Thanks for the quick response.  With that change (have not done
> >>> numShards yet) shard1 got updated.  But now when executing the
> >>> following queries I get information back from both, which doesn't seem
> >>> right
> >>>
> >>> http://localhost:7574/solr/select/?q=*:*
> >>> <doc><str name="key">1</str><str name="content_mvtxt">updated
> value</str></doc>
> >>>
> >>> http://localhost:8983/solr/select?q=*:*
> >>> <doc><str name="key">1</str><str name="content_mvtxt">updated
> value</str></doc>
> >>>
> >>>
> >>>
> >>> On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller <markrmil...@gmail.com>
> wrote:
> >>>> Hmm...sorry bout that - so my first guess is that right now we are
> not distributing a commit (easy to add, just have not done it).
> >>>>
> >>>> Right now I explicitly commit on each server for tests.
> >>>>
> >>>> Can you try explicitly committing on server1 after updating the doc
> on server 2?
> >>>>
> >>>> I can start distributing commits tomorrow - been meaning to do it for
> my own convenience anyhow.
> >>>>
> >>>> Also, you want to pass the sys property numShards=1 on startup. I
> think it defaults to 3. That will give you one leader and one replica.
> >>>>
> >>>> - Mark
> >>>>
> >>>> On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
> >>>>
> >>>>> So I couldn't resist, I attempted to do this tonight, I used the
> >>>>> solrconfig you mentioned (as is, no modifications), I setup a 2 shard
> >>>>> cluster in collection1, I sent 1 doc to 1 of the shards, updated it
> >>>>> and sent the update to the other.  I don't see the modifications
> >>>>> though I only see the original document.  The following is the test
> >>>>>
> >>>>> public void update() throws Exception {
> >>>>>
> >>>>>              String key = "1";
> >>>>>
> >>>>>              SolrInputDocument solrDoc = new SolrInputDocument();
> >>>>>              solrDoc.setField("key", key);
> >>>>>
> >>>>>              solrDoc.addField("content", "initial value");
> >>>>>
> >>>>>              SolrServer server = servers
> >>>>>                              .get("
> http://localhost:8983/solr/collection1";);
> >>>>>              server.add(solrDoc);
> >>>>>
> >>>>>              server.commit();
> >>>>>
> >>>>>              solrDoc = new SolrInputDocument();
> >>>>>              solrDoc.addField("key", key);
> >>>>>              solrDoc.addField("content", "updated value");
> >>>>>
> >>>>>              server = servers.get("
> http://localhost:7574/solr/collection1";);
> >>>>>
> >>>>>              UpdateRequest ureq = new UpdateRequest();
> >>>>>              ureq.setParam("update.chain", "distrib-update-chain");
> >>>>>              ureq.add(solrDoc);
> >>>>>              ureq.setParam("shards",
> >>>>>
>  "localhost:8983/solr/collection1,localhost:7574/solr/collection1");
> >>>>>              ureq.setParam("self", "foo");
> >>>>>              ureq.setAction(ACTION.COMMIT, true, true);
> >>>>>              server.request(ureq);
> >>>>>              System.out.println("done");
> >>>>>      }
> >>>>>
> >>>>> key is my unique field in schema.xml
> >>>>>
> >>>>> What am I doing wrong?
> >>>>>
> >>>>> On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <jej2...@gmail.com>
> wrote:
> >>>>>> Yes, the ZK method seems much more flexible.  Adding a new shard
> would
> >>>>>> be simply updating the range assignments in ZK.  Where is this
> >>>>>> currently on the list of things to accomplish?  I don't have time to
> >>>>>> work on this now, but if you (or anyone) could provide direction I'd
> >>>>>> be willing to work on this when I had spare time.  I guess a JIRA
> >>>>>> detailing where/how to do this could help.  Not sure if the design
> has
> >>>>>> been thought out that far though.
> >>>>>>
> >>>>>> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <markrmil...@gmail.com>
> wrote:
> >>>>>>> Right now lets say you have one shard - everything there hashes to
> range X.
> >>>>>>>
> >>>>>>> Now you want to split that shard with an Index Splitter.
> >>>>>>>
> >>>>>>> You divide range X in two - giving you two ranges - then you start
> splitting. This is where the current Splitter needs a little modification.
> You decide which doc should go into which new index by rehashing each doc
> id in the index you are splitting - if its hash is greater than X/2, it
> goes into index1 - if its less, index2. I think there are a couple current
> Splitter impls, but one of them does something like, give me an id - now if
> the id's in the index are above that id, goto index1, if below, index2. We
> need to instead do a quick hash rather than simple id compare.
> >>>>>>>
> >>>>>>> Why do you need to do this on every shard?
> >>>>>>>
> >>>>>>> The other part we need that we dont have is to store hash range
> assignments in zookeeper - we don't do that yet because it's not needed
> yet. Instead we currently just simply calculate that on the fly (too often
> at the moment - on every request :) I intend to fix that of course).
> >>>>>>>
> >>>>>>> At the start, zk would say, for range X, goto this shard. After
> the split, it would say, for range less than X/2 goto the old node, for
> range greater than X/2 goto the new node.
> >>>>>>>
> >>>>>>> - Mark
> >>>>>>>
> >>>>>>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
> >>>>>>>
> >>>>>>>> hmmm.....This doesn't sound like the hashing algorithm that's on
> the
> >>>>>>>> branch, right?  The algorithm you're mentioning sounds like there
> is
> >>>>>>>> some logic which is able to tell that a particular range should be
> >>>>>>>> distributed between 2 shards instead of 1.  So seems like a trade
> off
> >>>>>>>> between repartitioning the entire index (on every shard) and
> having a
> >>>>>>>> custom hashing algorithm which is able to handle the situation
> where 2
> >>>>>>>> or more shards map to a particular range.
> >>>>>>>>
> >>>>>>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <
> markrmil...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
> >>>>>>>>>
> >>>>>>>>>> I am not familiar with the index splitter that is in contrib,
> but I'll
> >>>>>>>>>> take a look at it soon.  So the process sounds like it would be
> to run
> >>>>>>>>>> this on all of the current shards indexes based on the hash
> algorithm.
> >>>>>>>>>
> >>>>>>>>> Not something I've thought deeply about myself yet, but I think
> the idea would be to split as many as you felt you needed to.
> >>>>>>>>>
> >>>>>>>>> If you wanted to keep the full balance always, this would mean
> splitting every shard at once, yes. But this depends on how many boxes
> (partitions) you are willing/able to add at a time.
> >>>>>>>>>
> >>>>>>>>> You might just split one index to start - now it's hash range
> would be handled by two shards instead of one (if you have 3 replicas per
> shard, this would mean adding 3 more boxes). When you needed to expand
> again, you would split another index that was still handling its full
> starting range. As you grow, once you split every original index, you'd
> start again, splitting one of the now half ranges.
> >>>>>>>>>
> >>>>>>>>>> Is there also an index merger in contrib which could be used to
> merge
> >>>>>>>>>> indexes?  I'm assuming this would be the process?
> >>>>>>>>>
> >>>>>>>>> You can merge with IndexWriter.addIndexes (Solr also has an
> admin command that can do this). But I'm not sure where this fits in?
> >>>>>>>>>
> >>>>>>>>> - Mark
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <
> markrmil...@gmail.com> wrote:
> >>>>>>>>>>> Not yet - we don't plan on working on this until a lot of
> other stuff is
> >>>>>>>>>>> working solid at this point. But someone else could jump in!
> >>>>>>>>>>>
> >>>>>>>>>>> There are a couple ways to go about it that I know of:
> >>>>>>>>>>>
> >>>>>>>>>>> A more long term solution may be to start using micro shards -
> each index
> >>>>>>>>>>> starts as multiple indexes. This makes it pretty fast to move
> mirco shards
> >>>>>>>>>>> around as you decide to change partitions. It's also less
> flexible as you
> >>>>>>>>>>> are limited by the number of micro shards you start with.
> >>>>>>>>>>>
> >>>>>>>>>>> A more simple and likely first step is to use an index
> splitter . We
> >>>>>>>>>>> already have one in lucene contrib - we would just need to
> modify it so
> >>>>>>>>>>> that it splits based on the hash of the document id. This is
> super
> >>>>>>>>>>> flexible, but splitting will obviously take a little while on
> a huge index.
> >>>>>>>>>>> The current index splitter is a multi pass splitter - good
> enough to start
> >>>>>>>>>>> with, but most files under codec control these days, we may be
> able to make
> >>>>>>>>>>> a single pass splitter soon as well.
> >>>>>>>>>>>
> >>>>>>>>>>> Eventually you could imagine using both options - micro shards
> that could
> >>>>>>>>>>> also be split as needed. Though I still wonder if micro shards
> will be
> >>>>>>>>>>> worth the extra complications myself...
> >>>>>>>>>>>
> >>>>>>>>>>> Right now though, the idea is that you should pick a good
> number of
> >>>>>>>>>>> partitions to start given your expected data ;) Adding more
> replicas is
> >>>>>>>>>>> trivial though.
> >>>>>>>>>>>
> >>>>>>>>>>> - Mark
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <
> jej2...@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Another question, is there any support for repartitioning of
> the index
> >>>>>>>>>>>> if a new shard is added?  What is the recommended approach for
> >>>>>>>>>>>> handling this?  It seemed that the hashing algorithm (and
> probably
> >>>>>>>>>>>> any) would require the index to be repartitioned should a new
> shard be
> >>>>>>>>>>>> added.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <
> jej2...@gmail.com> wrote:
> >>>>>>>>>>>>> Thanks I will try this first thing in the morning.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <
> markrmil...@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <
> jej2...@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I am currently looking at the latest solrcloud branch and
> was
> >>>>>>>>>>>>>>> wondering if there was any documentation on configuring the
> >>>>>>>>>>>>>>> DistributedUpdateProcessor?  What specifically in
> solrconfig.xml needs
> >>>>>>>>>>>>>>> to be added/modified to make distributed indexing work?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
> >>>>>>>>>>>>>> solr/core/src/test-files
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> You need to enable the update log, add an empty replication
> handler def,
> >>>>>>>>>>>>>> and an update chain with
> solr.DistributedUpdateProcessFactory in it.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> http://www.lucidimagination.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> - Mark
> >>>>>>>>>>>
> >>>>>>>>>>> http://www.lucidimagination.com
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> - Mark Miller
> >>>>>>>>> lucidimagination.com
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>> - Mark Miller
> >>>>>>> lucidimagination.com
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>> - Mark Miller
> >>>> lucidimagination.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >> - Mark Miller
> >> lucidimagination.com
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
>



-- 
- Mark

http://www.lucidimagination.com

Re: Configuring the Distributed

Reply via email to