Re: Configuring the Distributed
What does the version field need to look like? Something like? field name=_version_ type=string indexed=true stored=true required=true / On Sun, Dec 4, 2011 at 2:00 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote: You always want to use the distrib-update-chain. Eventually it will probably be part of the default chain and auto turn in zk mode. I'm working on this now... -Yonik http://www.lucidimagination.com
Re: Configuring the Distributed
On Mon, Dec 5, 2011 at 9:21 AM, Jamie Johnson jej2...@gmail.com wrote: What does the version field need to look like? It's in the example schema: field name=_version_ type=long indexed=true stored=true/ -Yonik http://www.lucidimagination.com
Re: Configuring the Distributed
Thanks Yonik, must have just missed it. A question about adding a new shard to the index. I am definitely not a hashing expert, but the goal is to have a uniform distribution of buckets based on what we're hashing. If that happens then our shards would reach capacity at approximately the same time. In this situation I don't think splitting one shard would help us we'd need to split every shard to reduce the load on the burdened systems right? On Mon, Dec 5, 2011 at 9:45 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Mon, Dec 5, 2011 at 9:21 AM, Jamie Johnson jej2...@gmail.com wrote: What does the version field need to look like? It's in the example schema: field name=_version_ type=long indexed=true stored=true/ -Yonik http://www.lucidimagination.com
Re: Configuring the Distributed
On Mon, Dec 5, 2011 at 1:29 PM, Jamie Johnson jej2...@gmail.com wrote: In this situation I don't think splitting one shard would help us we'd need to split every shard to reduce the load on the burdened systems right? Sure... but if you can split one, you can split them all :-) -Yonik http://www.lucidimagination.com
Re: Configuring the Distributed
Yes completely agree, just wanted to make sure I wasn't missing the obvious :) On Mon, Dec 5, 2011 at 1:39 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Mon, Dec 5, 2011 at 1:29 PM, Jamie Johnson jej2...@gmail.com wrote: In this situation I don't think splitting one shard would help us we'd need to split every shard to reduce the load on the burdened systems right? Sure... but if you can split one, you can split them all :-) -Yonik http://www.lucidimagination.com
Re: Configuring the Distributed
On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote: On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. One also needs an indexed _version_ field defined in schema.xml for versioning to work. -Yonik http://www.lucidimagination.com
Re: Configuring the Distributed
On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote: You always want to use the distrib-update-chain. Eventually it will probably be part of the default chain and auto turn in zk mode. I'm working on this now... -Yonik http://www.lucidimagination.com
Re: Configuring the Distributed
bq. A few questions if a master goes down does a replica get promoted? Right - if the leader goes down there is a leader election and one of the replicas takes over. bq. If a new shard needs to be added is it just a matter of starting a new solr instance with a higher numShards? Eventually, that's the plan. The idea is, you say something like, I want 3 shards. Now if you start up 9 instances, the first 3 end up as shard leaders - the next 6 evenly come up as replicas for each shard. To change the numShards, we will need some kind of micro shards / splitting / rebalancing. bq. Last question, how do you change numShards? I think this is somewhat a work in progress, but I think Sami just made it so that numShards is stored on the collection node in zk (along with which config set to use). So you would change it there presumably. Or perhaps just start up a new server with an update numShards property and then it would realize that needs to be a new leader - of course then you'd want to rebalance probably - unless you fired up enough servers to add replicas too... bq. Is that right? Yup, sounds about right. On Fri, Dec 2, 2011 at 10:59 PM, Jamie Johnson jej2...@gmail.com wrote: So I just tried this out, seems like it does the things I asked about. Really really cool stuff, it's progressed quite a bit in the time since I took a snapshot of the branch. Last question, how do you change numShards? Is there a command you can use to do this now? I understand there will be implications for the hashing algorithm, but once the hash ranges are stored in ZK (is there a separate JIRA for this or does this fall under 2358) I assume that it would be a relatively simple index split (JIRA 2595?) and updating the hash ranges in solr, essentially splitting the range between the new and existing shard. Is that right? On Fri, Dec 2, 2011 at 10:08 PM, Jamie Johnson jej2...@gmail.com wrote: I think I see it.so if I understand this correctly you specify numShards as a system property, as new nodes come up they check ZK to see if they should be a new shard or a replica based on if numShards is met. A few questions if a master goes down does a replica get promoted? If a new shard needs to be added is it just a matter of starting a new solr instance with a higher numShards? (understanding that index rebalancing does not happen automatically now, but presumably it could). On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson jej2...@gmail.com wrote: How does it determine the number of shards to create? How many replicas to create? On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller markrmil...@gmail.com wrote: Ah, okay - you are setting the shards in solr.xml - thats still an option to force a node to a particular shard - but if you take that out, shards will be auto assigned. By the way, because of the version code, distrib deletes don't work at the moment - will get to that next week. - Mark On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com wrote: So I'm a fool. I did set the numShards, the issue was so trivial it's embarrassing. I did indeed have it setup as a replica, the shard names in solr.xml were both shard1. This worked as I expected now. On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com wrote: They are unused params, so removing them wouldn't help anything. You might just want to wait till we are further along before playing with it. Or if you submit your full self contained test, I can see what's going on (eg its still unclear if you have started setting numShards?). I can do a similar set of actions in my tests and it works fine. The only reason I could see things working like this is if it thinks you have one shard - a leader and a replica. - Mark On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote: Glad to hear I don't need to set shards/self, but removing them didn't seem to change what I'm seeing. Doing this still results in 2 documents 1 on 8983 and 1 on 7574. String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers.get( http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get( http://localhost:7574/solr/collection1;); ureq = new UpdateRequest();
Re: Configuring the Distributed
Again great stuff. Once distributed update/delete works (sounds like it's not far off) I'll have to reevaluate our current stack. You had mentioned storing the shad hash assignments in ZK, is there a JIRA around this? I'll keep my eyes on the JIRA tickets. Right now the distirbuted updated/delete are big ones for me, the rebalancing of the cluster is a nice to have, but hopefully I wouldn't need that capability anytime in the near future. One other questionmy current setup has replication done on a polling setup, I didn't notice that in the updated solrconfig, how does this work now? On Sat, Dec 3, 2011 at 9:00 AM, Mark Miller markrmil...@gmail.com wrote: bq. A few questions if a master goes down does a replica get promoted? Right - if the leader goes down there is a leader election and one of the replicas takes over. bq. If a new shard needs to be added is it just a matter of starting a new solr instance with a higher numShards? Eventually, that's the plan. The idea is, you say something like, I want 3 shards. Now if you start up 9 instances, the first 3 end up as shard leaders - the next 6 evenly come up as replicas for each shard. To change the numShards, we will need some kind of micro shards / splitting / rebalancing. bq. Last question, how do you change numShards? I think this is somewhat a work in progress, but I think Sami just made it so that numShards is stored on the collection node in zk (along with which config set to use). So you would change it there presumably. Or perhaps just start up a new server with an update numShards property and then it would realize that needs to be a new leader - of course then you'd want to rebalance probably - unless you fired up enough servers to add replicas too... bq. Is that right? Yup, sounds about right. On Fri, Dec 2, 2011 at 10:59 PM, Jamie Johnson jej2...@gmail.com wrote: So I just tried this out, seems like it does the things I asked about. Really really cool stuff, it's progressed quite a bit in the time since I took a snapshot of the branch. Last question, how do you change numShards? Is there a command you can use to do this now? I understand there will be implications for the hashing algorithm, but once the hash ranges are stored in ZK (is there a separate JIRA for this or does this fall under 2358) I assume that it would be a relatively simple index split (JIRA 2595?) and updating the hash ranges in solr, essentially splitting the range between the new and existing shard. Is that right? On Fri, Dec 2, 2011 at 10:08 PM, Jamie Johnson jej2...@gmail.com wrote: I think I see it.so if I understand this correctly you specify numShards as a system property, as new nodes come up they check ZK to see if they should be a new shard or a replica based on if numShards is met. A few questions if a master goes down does a replica get promoted? If a new shard needs to be added is it just a matter of starting a new solr instance with a higher numShards? (understanding that index rebalancing does not happen automatically now, but presumably it could). On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson jej2...@gmail.com wrote: How does it determine the number of shards to create? How many replicas to create? On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller markrmil...@gmail.com wrote: Ah, okay - you are setting the shards in solr.xml - thats still an option to force a node to a particular shard - but if you take that out, shards will be auto assigned. By the way, because of the version code, distrib deletes don't work at the moment - will get to that next week. - Mark On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com wrote: So I'm a fool. I did set the numShards, the issue was so trivial it's embarrassing. I did indeed have it setup as a replica, the shard names in solr.xml were both shard1. This worked as I expected now. On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com wrote: They are unused params, so removing them wouldn't help anything. You might just want to wait till we are further along before playing with it. Or if you submit your full self contained test, I can see what's going on (eg its still unclear if you have started setting numShards?). I can do a similar set of actions in my tests and it works fine. The only reason I could see things working like this is if it thinks you have one shard - a leader and a replica. - Mark On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote: Glad to hear I don't need to set shards/self, but removing them didn't seem to change what I'm seeing. Doing this still results in 2 documents 1 on 8983 and 1 on 7574. String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial
Re: Configuring the Distributed
On Sat, Dec 3, 2011 at 1:31 PM, Jamie Johnson jej2...@gmail.com wrote: Again great stuff. Once distributed update/delete works (sounds like it's not far off) Yeah, I only realized it was not working with the Version code on Friday as I started adding tests for it - the work to fix it is not too difficult. I'll have to reevaluate our current stack. You had mentioned storing the shad hash assignments in ZK, is there a JIRA around this? I don't think there is specifically one for that yet - though it could just be part of the index splitting JIRA issue I suppose. I'll keep my eyes on the JIRA tickets. Right now the distirbuted updated/delete are big ones for me, the rebalancing of the cluster is a nice to have, but hopefully I wouldn't need that capability anytime in the near future. One other questionmy current setup has replication done on a polling setup, I didn't notice that in the updated solrconfig, how does this work now? Replication is only used for recovery, because it doesn't work with Near Realtime. We want SolrCloud to work with NRT, so currently, the leader versions documents and forward the docs to the replicas (these will let us do optimistic locking as well). If you send a doc to a replica instead, its first forwarded to the leader to get versioned. The SolrCloud solrj client will likely be smart enough to just send to the leader first. You need the replication handler defined for recovery - when a replica goes down and then come back up, it starts buffering updates and replicates from the leader - then it applies the buffered updates and ends up current with the leader. - Mark On Sat, Dec 3, 2011 at 9:00 AM, Mark Miller markrmil...@gmail.com wrote: bq. A few questions if a master goes down does a replica get promoted? Right - if the leader goes down there is a leader election and one of the replicas takes over. bq. If a new shard needs to be added is it just a matter of starting a new solr instance with a higher numShards? Eventually, that's the plan. The idea is, you say something like, I want 3 shards. Now if you start up 9 instances, the first 3 end up as shard leaders - the next 6 evenly come up as replicas for each shard. To change the numShards, we will need some kind of micro shards / splitting / rebalancing. bq. Last question, how do you change numShards? I think this is somewhat a work in progress, but I think Sami just made it so that numShards is stored on the collection node in zk (along with which config set to use). So you would change it there presumably. Or perhaps just start up a new server with an update numShards property and then it would realize that needs to be a new leader - of course then you'd want to rebalance probably - unless you fired up enough servers to add replicas too... bq. Is that right? Yup, sounds about right. On Fri, Dec 2, 2011 at 10:59 PM, Jamie Johnson jej2...@gmail.com wrote: So I just tried this out, seems like it does the things I asked about. Really really cool stuff, it's progressed quite a bit in the time since I took a snapshot of the branch. Last question, how do you change numShards? Is there a command you can use to do this now? I understand there will be implications for the hashing algorithm, but once the hash ranges are stored in ZK (is there a separate JIRA for this or does this fall under 2358) I assume that it would be a relatively simple index split (JIRA 2595?) and updating the hash ranges in solr, essentially splitting the range between the new and existing shard. Is that right? On Fri, Dec 2, 2011 at 10:08 PM, Jamie Johnson jej2...@gmail.com wrote: I think I see it.so if I understand this correctly you specify numShards as a system property, as new nodes come up they check ZK to see if they should be a new shard or a replica based on if numShards is met. A few questions if a master goes down does a replica get promoted? If a new shard needs to be added is it just a matter of starting a new solr instance with a higher numShards? (understanding that index rebalancing does not happen automatically now, but presumably it could). On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson jej2...@gmail.com wrote: How does it determine the number of shards to create? How many replicas to create? On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller markrmil...@gmail.com wrote: Ah, okay - you are setting the shards in solr.xml - thats still an option to force a node to a particular shard - but if you take that out, shards will be auto assigned. By the way, because of the version code, distrib deletes don't work at the moment - will get to that next week. - Mark On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com wrote: So I'm a fool. I did set the numShards, the issue was so trivial it's embarrassing. I
Re: Configuring the Distributed
So I dunno. You are running a zk server and running in zk mode right? You don't need to / shouldn't set a shards or self param. The shards are figured out from Zookeeper. You always want to use the distrib-update-chain. Eventually it will probably be part of the default chain and auto turn in zk mode. If you are running in zk mode attached to a zk server, this should work no problem. You can add docs to any server and they will be forwarded to the correct shard leader and then versioned and forwarded to replicas. You can also use the CloudSolrServer solrj client - that way you don't even have to choose a server to send docs too - in which case if it went down you would have to choose another manually - CloudSolrServer automatically finds one that is up through ZooKeeper. Eventually it will also be smart and do the hashing itself so that it can send directly to the shard leader that the doc would be forwarded to anyway. - Mark On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com wrote: Really just trying to do a simple add and update test, the chain missing is just proof of my not understanding exactly how this is supposed to work. I modified the code to this String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers .get( http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get( http://localhost:7574/solr/collection1;); ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); // ureq.deleteById(8060a9eb-9546-43ee-95bb-d18ea26a6285); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); // server.add(solrDoc); server.commit(); server = servers.get( http://localhost:8983/solr/collection1;); server.commit(); System.out.println(done); but I'm still seeing the doc appear on both shards.After the first commit I see the doc on 8983 with initial value. after the second commit I see the updated value on 7574 and the old on 8983. After the final commit the doc on 8983 gets updated. Is there something wrong with my test? On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller markrmil...@gmail.com wrote: Getting late - didn't really pay attention to your code I guess - why are you adding the first doc without specifying the distrib update chain? This is not really supported. It's going to just go to the server you specified - even with everything setup right, the update might then go to that same server or the other one depending on how it hashes. You really want to just always use the distrib update chain. I guess I don't yet understand what you are trying to test. Sent from my iPad On Dec 1, 2011, at 10:57 PM, Mark Miller markrmil...@gmail.com wrote: Not sure offhand - but things will be funky if you don't specify the correct numShards. The instance to shard assignment should be using numShards to assign. But then the hash to shard mapping actually goes on the number of shards it finds registered in ZK (it doesn't have to, but really these should be equal). So basically you are saying, I want 3 partitions, but you are only starting up 2 nodes, and the code is just not happy about that I'd guess. For the system to work properly, you have to fire up at least as many servers as numShards. What are you trying to do? 2 partitions with no replicas, or one partition with one replica? In either case, I think you will have better luck if you fire up at least as many servers as the numShards setting. Or lower the numShards setting. This is all a work in progress by the way - what you are trying to test should work if things are setup right though. - Mark On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote: Thanks for the quick response. With that change (have not done numShards yet) shard1 got updated. But now
Re: Configuring the Distributed
Glad to hear I don't need to set shards/self, but removing them didn't seem to change what I'm seeing. Doing this still results in 2 documents 1 on 8983 and 1 on 7574. String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers.get(http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get(http://localhost:7574/solr/collection1;); ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); server = servers.get(http://localhost:8983/solr/collection1;); server.commit(); System.out.println(done); On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote: So I dunno. You are running a zk server and running in zk mode right? You don't need to / shouldn't set a shards or self param. The shards are figured out from Zookeeper. You always want to use the distrib-update-chain. Eventually it will probably be part of the default chain and auto turn in zk mode. If you are running in zk mode attached to a zk server, this should work no problem. You can add docs to any server and they will be forwarded to the correct shard leader and then versioned and forwarded to replicas. You can also use the CloudSolrServer solrj client - that way you don't even have to choose a server to send docs too - in which case if it went down you would have to choose another manually - CloudSolrServer automatically finds one that is up through ZooKeeper. Eventually it will also be smart and do the hashing itself so that it can send directly to the shard leader that the doc would be forwarded to anyway. - Mark On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com wrote: Really just trying to do a simple add and update test, the chain missing is just proof of my not understanding exactly how this is supposed to work. I modified the code to this String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers .get( http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get( http://localhost:7574/solr/collection1;); ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); // ureq.deleteById(8060a9eb-9546-43ee-95bb-d18ea26a6285); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); // server.add(solrDoc); server.commit(); server = servers.get( http://localhost:8983/solr/collection1;); server.commit(); System.out.println(done); but I'm still seeing the doc appear on both shards. After the first commit I see the doc on 8983 with initial value. after the second commit I see the updated value on 7574 and the old on 8983. After the final commit the doc on 8983 gets updated. Is there something wrong with my test? On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller markrmil...@gmail.com wrote: Getting late - didn't really pay attention to your code I guess - why are you adding the first doc without specifying the distrib update chain? This is not really
Re: Configuring the Distributed
They are unused params, so removing them wouldn't help anything. You might just want to wait till we are further along before playing with it. Or if you submit your full self contained test, I can see what's going on (eg its still unclear if you have started setting numShards?). I can do a similar set of actions in my tests and it works fine. The only reason I could see things working like this is if it thinks you have one shard - a leader and a replica. - Mark On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote: Glad to hear I don't need to set shards/self, but removing them didn't seem to change what I'm seeing. Doing this still results in 2 documents 1 on 8983 and 1 on 7574. String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers.get(http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get(http://localhost:7574/solr/collection1;); ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); server = servers.get(http://localhost:8983/solr/collection1;); server.commit(); System.out.println(done); On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote: So I dunno. You are running a zk server and running in zk mode right? You don't need to / shouldn't set a shards or self param. The shards are figured out from Zookeeper. You always want to use the distrib-update-chain. Eventually it will probably be part of the default chain and auto turn in zk mode. If you are running in zk mode attached to a zk server, this should work no problem. You can add docs to any server and they will be forwarded to the correct shard leader and then versioned and forwarded to replicas. You can also use the CloudSolrServer solrj client - that way you don't even have to choose a server to send docs too - in which case if it went down you would have to choose another manually - CloudSolrServer automatically finds one that is up through ZooKeeper. Eventually it will also be smart and do the hashing itself so that it can send directly to the shard leader that the doc would be forwarded to anyway. - Mark On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com wrote: Really just trying to do a simple add and update test, the chain missing is just proof of my not understanding exactly how this is supposed to work. I modified the code to this String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers .get( http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get( http://localhost:7574/solr/collection1;); ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); // ureq.deleteById(8060a9eb-9546-43ee-95bb-d18ea26a6285); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); // server.add(solrDoc); server.commit(); server = servers.get( http://localhost:8983/solr/collection1;); server.commit(); System.out.println(done); but I'm
Re: Configuring the Distributed
So I'm a fool. I did set the numShards, the issue was so trivial it's embarrassing. I did indeed have it setup as a replica, the shard names in solr.xml were both shard1. This worked as I expected now. On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com wrote: They are unused params, so removing them wouldn't help anything. You might just want to wait till we are further along before playing with it. Or if you submit your full self contained test, I can see what's going on (eg its still unclear if you have started setting numShards?). I can do a similar set of actions in my tests and it works fine. The only reason I could see things working like this is if it thinks you have one shard - a leader and a replica. - Mark On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote: Glad to hear I don't need to set shards/self, but removing them didn't seem to change what I'm seeing. Doing this still results in 2 documents 1 on 8983 and 1 on 7574. String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers.get(http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get(http://localhost:7574/solr/collection1;); ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); server = servers.get(http://localhost:8983/solr/collection1;); server.commit(); System.out.println(done); On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote: So I dunno. You are running a zk server and running in zk mode right? You don't need to / shouldn't set a shards or self param. The shards are figured out from Zookeeper. You always want to use the distrib-update-chain. Eventually it will probably be part of the default chain and auto turn in zk mode. If you are running in zk mode attached to a zk server, this should work no problem. You can add docs to any server and they will be forwarded to the correct shard leader and then versioned and forwarded to replicas. You can also use the CloudSolrServer solrj client - that way you don't even have to choose a server to send docs too - in which case if it went down you would have to choose another manually - CloudSolrServer automatically finds one that is up through ZooKeeper. Eventually it will also be smart and do the hashing itself so that it can send directly to the shard leader that the doc would be forwarded to anyway. - Mark On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com wrote: Really just trying to do a simple add and update test, the chain missing is just proof of my not understanding exactly how this is supposed to work. I modified the code to this String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers .get( http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get( http://localhost:7574/solr/collection1;); ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); // ureq.deleteById(8060a9eb-9546-43ee-95bb-d18ea26a6285); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true);
Re: Configuring the Distributed
Ah, okay - you are setting the shards in solr.xml - thats still an option to force a node to a particular shard - but if you take that out, shards will be auto assigned. By the way, because of the version code, distrib deletes don't work at the moment - will get to that next week. - Mark On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com wrote: So I'm a fool. I did set the numShards, the issue was so trivial it's embarrassing. I did indeed have it setup as a replica, the shard names in solr.xml were both shard1. This worked as I expected now. On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com wrote: They are unused params, so removing them wouldn't help anything. You might just want to wait till we are further along before playing with it. Or if you submit your full self contained test, I can see what's going on (eg its still unclear if you have started setting numShards?). I can do a similar set of actions in my tests and it works fine. The only reason I could see things working like this is if it thinks you have one shard - a leader and a replica. - Mark On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote: Glad to hear I don't need to set shards/self, but removing them didn't seem to change what I'm seeing. Doing this still results in 2 documents 1 on 8983 and 1 on 7574. String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers.get( http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get( http://localhost:7574/solr/collection1;); ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); server = servers.get( http://localhost:8983/solr/collection1;); server.commit(); System.out.println(done); On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote: So I dunno. You are running a zk server and running in zk mode right? You don't need to / shouldn't set a shards or self param. The shards are figured out from Zookeeper. You always want to use the distrib-update-chain. Eventually it will probably be part of the default chain and auto turn in zk mode. If you are running in zk mode attached to a zk server, this should work no problem. You can add docs to any server and they will be forwarded to the correct shard leader and then versioned and forwarded to replicas. You can also use the CloudSolrServer solrj client - that way you don't even have to choose a server to send docs too - in which case if it went down you would have to choose another manually - CloudSolrServer automatically finds one that is up through ZooKeeper. Eventually it will also be smart and do the hashing itself so that it can send directly to the shard leader that the doc would be forwarded to anyway. - Mark On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com wrote: Really just trying to do a simple add and update test, the chain missing is just proof of my not understanding exactly how this is supposed to work. I modified the code to this String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers .get( http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get(
Re: Configuring the Distributed
How does it determine the number of shards to create? How many replicas to create? On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller markrmil...@gmail.com wrote: Ah, okay - you are setting the shards in solr.xml - thats still an option to force a node to a particular shard - but if you take that out, shards will be auto assigned. By the way, because of the version code, distrib deletes don't work at the moment - will get to that next week. - Mark On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com wrote: So I'm a fool. I did set the numShards, the issue was so trivial it's embarrassing. I did indeed have it setup as a replica, the shard names in solr.xml were both shard1. This worked as I expected now. On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com wrote: They are unused params, so removing them wouldn't help anything. You might just want to wait till we are further along before playing with it. Or if you submit your full self contained test, I can see what's going on (eg its still unclear if you have started setting numShards?). I can do a similar set of actions in my tests and it works fine. The only reason I could see things working like this is if it thinks you have one shard - a leader and a replica. - Mark On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote: Glad to hear I don't need to set shards/self, but removing them didn't seem to change what I'm seeing. Doing this still results in 2 documents 1 on 8983 and 1 on 7574. String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers.get( http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get( http://localhost:7574/solr/collection1;); ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); server = servers.get( http://localhost:8983/solr/collection1;); server.commit(); System.out.println(done); On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote: So I dunno. You are running a zk server and running in zk mode right? You don't need to / shouldn't set a shards or self param. The shards are figured out from Zookeeper. You always want to use the distrib-update-chain. Eventually it will probably be part of the default chain and auto turn in zk mode. If you are running in zk mode attached to a zk server, this should work no problem. You can add docs to any server and they will be forwarded to the correct shard leader and then versioned and forwarded to replicas. You can also use the CloudSolrServer solrj client - that way you don't even have to choose a server to send docs too - in which case if it went down you would have to choose another manually - CloudSolrServer automatically finds one that is up through ZooKeeper. Eventually it will also be smart and do the hashing itself so that it can send directly to the shard leader that the doc would be forwarded to anyway. - Mark On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com wrote: Really just trying to do a simple add and update test, the chain missing is just proof of my not understanding exactly how this is supposed to work. I modified the code to this String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers .get( http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument();
Re: Configuring the Distributed
I think I see it.so if I understand this correctly you specify numShards as a system property, as new nodes come up they check ZK to see if they should be a new shard or a replica based on if numShards is met. A few questions if a master goes down does a replica get promoted? If a new shard needs to be added is it just a matter of starting a new solr instance with a higher numShards? (understanding that index rebalancing does not happen automatically now, but presumably it could). On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson jej2...@gmail.com wrote: How does it determine the number of shards to create? How many replicas to create? On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller markrmil...@gmail.com wrote: Ah, okay - you are setting the shards in solr.xml - thats still an option to force a node to a particular shard - but if you take that out, shards will be auto assigned. By the way, because of the version code, distrib deletes don't work at the moment - will get to that next week. - Mark On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com wrote: So I'm a fool. I did set the numShards, the issue was so trivial it's embarrassing. I did indeed have it setup as a replica, the shard names in solr.xml were both shard1. This worked as I expected now. On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com wrote: They are unused params, so removing them wouldn't help anything. You might just want to wait till we are further along before playing with it. Or if you submit your full self contained test, I can see what's going on (eg its still unclear if you have started setting numShards?). I can do a similar set of actions in my tests and it works fine. The only reason I could see things working like this is if it thinks you have one shard - a leader and a replica. - Mark On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote: Glad to hear I don't need to set shards/self, but removing them didn't seem to change what I'm seeing. Doing this still results in 2 documents 1 on 8983 and 1 on 7574. String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers.get( http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get( http://localhost:7574/solr/collection1;); ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); server = servers.get( http://localhost:8983/solr/collection1;); server.commit(); System.out.println(done); On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote: So I dunno. You are running a zk server and running in zk mode right? You don't need to / shouldn't set a shards or self param. The shards are figured out from Zookeeper. You always want to use the distrib-update-chain. Eventually it will probably be part of the default chain and auto turn in zk mode. If you are running in zk mode attached to a zk server, this should work no problem. You can add docs to any server and they will be forwarded to the correct shard leader and then versioned and forwarded to replicas. You can also use the CloudSolrServer solrj client - that way you don't even have to choose a server to send docs too - in which case if it went down you would have to choose another manually - CloudSolrServer automatically finds one that is up through ZooKeeper. Eventually it will also be smart and do the hashing itself so that it can send directly to the shard leader that the doc would be forwarded to anyway. - Mark On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com wrote: Really just trying to do a simple add and update test, the chain missing is just proof of my not understanding exactly how this is supposed to work. I modified the code to this String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers
Re: Configuring the Distributed
So I just tried this out, seems like it does the things I asked about. Really really cool stuff, it's progressed quite a bit in the time since I took a snapshot of the branch. Last question, how do you change numShards? Is there a command you can use to do this now? I understand there will be implications for the hashing algorithm, but once the hash ranges are stored in ZK (is there a separate JIRA for this or does this fall under 2358) I assume that it would be a relatively simple index split (JIRA 2595?) and updating the hash ranges in solr, essentially splitting the range between the new and existing shard. Is that right? On Fri, Dec 2, 2011 at 10:08 PM, Jamie Johnson jej2...@gmail.com wrote: I think I see it.so if I understand this correctly you specify numShards as a system property, as new nodes come up they check ZK to see if they should be a new shard or a replica based on if numShards is met. A few questions if a master goes down does a replica get promoted? If a new shard needs to be added is it just a matter of starting a new solr instance with a higher numShards? (understanding that index rebalancing does not happen automatically now, but presumably it could). On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson jej2...@gmail.com wrote: How does it determine the number of shards to create? How many replicas to create? On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller markrmil...@gmail.com wrote: Ah, okay - you are setting the shards in solr.xml - thats still an option to force a node to a particular shard - but if you take that out, shards will be auto assigned. By the way, because of the version code, distrib deletes don't work at the moment - will get to that next week. - Mark On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com wrote: So I'm a fool. I did set the numShards, the issue was so trivial it's embarrassing. I did indeed have it setup as a replica, the shard names in solr.xml were both shard1. This worked as I expected now. On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com wrote: They are unused params, so removing them wouldn't help anything. You might just want to wait till we are further along before playing with it. Or if you submit your full self contained test, I can see what's going on (eg its still unclear if you have started setting numShards?). I can do a similar set of actions in my tests and it works fine. The only reason I could see things working like this is if it thinks you have one shard - a leader and a replica. - Mark On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote: Glad to hear I don't need to set shards/self, but removing them didn't seem to change what I'm seeing. Doing this still results in 2 documents 1 on 8983 and 1 on 7574. String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers.get( http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get( http://localhost:7574/solr/collection1;); ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); server = servers.get( http://localhost:8983/solr/collection1;); server.commit(); System.out.println(done); On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote: So I dunno. You are running a zk server and running in zk mode right? You don't need to / shouldn't set a shards or self param. The shards are figured out from Zookeeper. You always want to use the distrib-update-chain. Eventually it will probably be part of the default chain and auto turn in zk mode. If you are running in zk mode attached to a zk server, this should work no problem. You can add docs to any server and they will be forwarded to the correct shard leader and then versioned and forwarded to replicas. You can also use the CloudSolrServer solrj client - that way you don't even have to choose a server to send docs too - in which case if it went down you would have to choose another manually - CloudSolrServer automatically finds one that is up through ZooKeeper.
Re: Configuring the Distributed
On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com
Re: Configuring the Distributed
Thanks I will try this first thing in the morning. On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote: On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com
Re: Configuring the Distributed
Another question, is there any support for repartitioning of the index if a new shard is added? What is the recommended approach for handling this? It seemed that the hashing algorithm (and probably any) would require the index to be repartitioned should a new shard be added. On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks I will try this first thing in the morning. On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote: On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com
Re: Configuring the Distributed
Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also be split as needed. Though I still wonder if micro shards will be worth the extra complications myself... Right now though, the idea is that you should pick a good number of partitions to start given your expected data ;) Adding more replicas is trivial though. - Mark On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote: Another question, is there any support for repartitioning of the index if a new shard is added? What is the recommended approach for handling this? It seemed that the hashing algorithm (and probably any) would require the index to be repartitioned should a new shard be added. On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks I will try this first thing in the morning. On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote: On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com -- - Mark http://www.lucidimagination.com
Re: Configuring the Distributed
I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also be split as needed. Though I still wonder if micro shards will be worth the extra complications myself... Right now though, the idea is that you should pick a good number of partitions to start given your expected data ;) Adding more replicas is trivial though. - Mark On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote: Another question, is there any support for repartitioning of the index if a new shard is added? What is the recommended approach for handling this? It seemed that the hashing algorithm (and probably any) would require the index to be repartitioned should a new shard be added. On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks I will try this first thing in the morning. On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote: On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com -- - Mark http://www.lucidimagination.com
Re: Configuring the Distributed
On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in? - Mark On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also be split as needed. Though I still wonder if micro shards will be worth the extra complications myself... Right now though, the idea is that you should pick a good number of partitions to start given your expected data ;) Adding more replicas is trivial though. - Mark On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote: Another question, is there any support for repartitioning of the index if a new shard is added? What is the recommended approach for handling this? It seemed that the hashing algorithm (and probably any) would require the index to be repartitioned should a new shard be added. On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks I will try this first thing in the morning. On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote: On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com -- - Mark http://www.lucidimagination.com - Mark Miller lucidimagination.com
Re: Configuring the Distributed
hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in? - Mark On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also be split as needed. Though I still wonder if micro shards will be worth the extra complications myself... Right now though, the idea is that you should pick a good number of partitions to start given your expected data ;) Adding more replicas is trivial though. - Mark On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote: Another question, is there any support for repartitioning of the index if a new shard is added? What is the recommended approach for handling this? It seemed that the hashing algorithm (and probably any) would require the index to be repartitioned should a new shard be added. On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks I will try this first thing in the morning. On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote: On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com -- - Mark http://www.lucidimagination.com - Mark Miller lucidimagination.com
Re: Configuring the Distributed
Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in? - Mark On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also be split as needed. Though I still wonder if micro shards will be worth the extra complications myself... Right now though, the idea is that you should pick a good number of partitions to start given your expected data ;) Adding more replicas is trivial though. - Mark On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote: Another question, is there any support for repartitioning of the index if a new shard is added? What is the recommended approach for handling this? It seemed that the hashing algorithm (and probably any) would require the index to be repartitioned should a new shard be added. On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks I will try this first thing in the morning. On
Re: Configuring the Distributed
Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in? - Mark On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also be split as needed. Though I still wonder if micro shards will be worth the extra complications myself... Right now though, the idea is that you should pick a good number of partitions to start given your expected data ;) Adding more
Re: Configuring the Distributed
Of course, resharding is almost never necessary if you use micro-shards. Micro-shards are shards small enough that you can fit 20 or more on a node. If you have that many on each node, then adding a new node consists of moving some shards to the new machine rather than moving lots of little documents. Much faster. As in thousands of times faster. On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in? - Mark On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi
Re: Configuring the Distributed
In this case we are still talking about moving a whole index at a time rather than lots of little documents. You split the index into two, and then ship one of them off. The extra cost you can avoid with micro sharding will be the cost of splitting the index - which could be significant for a very large index. I have not done any tests though. The cost of 20 micro-shards is that you will always have tons of segments unless you are very heavily merging - and even in the very unusual case of each micro shard being optimized, you have essentially 20 segments. Thats best case - normal case is likely in the hundreds. This can be a fairly significant % hit at search time. You also have the added complexity of managing 20 indexes per node in solr code. I think that both options have there +/-'s and eventually we could perhaps support both. To kick things off though, adding another partition should be a rare event if you plan carefully, and I think many will be able to handle the cost of splitting (you might even mark the replica you are splitting on so that it's not part of queries while its 'busy' splitting). - Mark On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote: Of course, resharding is almost never necessary if you use micro-shards. Micro-shards are shards small enough that you can fit 20 or more on a node. If you have that many on each node, then adding a new node consists of moving some shards to the new machine rather than moving lots of little documents. Much faster. As in thousands of times faster. On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half
Re: Configuring the Distributed
Sorry - missed something - you also have the added cost of shipping the new half index to all of the replicas of the original shard with the splitting method. Unless you somehow split on every replica at the same time - then of course you wouldn't be able to avoid the 'busy' replica, and it would probably be fairly hard to juggle. On Dec 1, 2011, at 9:37 PM, Mark Miller wrote: In this case we are still talking about moving a whole index at a time rather than lots of little documents. You split the index into two, and then ship one of them off. The extra cost you can avoid with micro sharding will be the cost of splitting the index - which could be significant for a very large index. I have not done any tests though. The cost of 20 micro-shards is that you will always have tons of segments unless you are very heavily merging - and even in the very unusual case of each micro shard being optimized, you have essentially 20 segments. Thats best case - normal case is likely in the hundreds. This can be a fairly significant % hit at search time. You also have the added complexity of managing 20 indexes per node in solr code. I think that both options have there +/-'s and eventually we could perhaps support both. To kick things off though, adding another partition should be a rare event if you plan carefully, and I think many will be able to handle the cost of splitting (you might even mark the replica you are splitting on so that it's not part of queries while its 'busy' splitting). - Mark On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote: Of course, resharding is almost never necessary if you use micro-shards. Micro-shards are shards small enough that you can fit 20 or more on a node. If you have that many on each node, then adding a new node consists of moving some shards to the new machine rather than moving lots of little documents. Much faster. As in thousands of times faster. On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to
Re: Configuring the Distributed
So I couldn't resist, I attempted to do this tonight, I used the solrconfig you mentioned (as is, no modifications), I setup a 2 shard cluster in collection1, I sent 1 doc to 1 of the shards, updated it and sent the update to the other. I don't see the modifications though I only see the original document. The following is the test public void update() throws Exception { String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content, initial value); SolrServer server = servers .get(http://localhost:8983/solr/collection1;); server.add(solrDoc); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content, updated value); server = servers.get(http://localhost:7574/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); System.out.println(done); } key is my unique field in schema.xml What am I doing wrong? On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now
Re: Configuring the Distributed
It's not full of details yet, but there is a JIRA issue here: https://issues.apache.org/jira/browse/SOLR-2595 On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in? - Mark On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also
Re: Configuring the Distributed
Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it). Right now I explicitly commit on each server for tests. Can you try explicitly committing on server1 after updating the doc on server 2? I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow. Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica. - Mark On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote: So I couldn't resist, I attempted to do this tonight, I used the solrconfig you mentioned (as is, no modifications), I setup a 2 shard cluster in collection1, I sent 1 doc to 1 of the shards, updated it and sent the update to the other. I don't see the modifications though I only see the original document. The following is the test public void update() throws Exception { String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content, initial value); SolrServer server = servers .get(http://localhost:8983/solr/collection1;); server.add(solrDoc); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content, updated value); server = servers.get(http://localhost:7574/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); System.out.println(done); } key is my unique field in schema.xml What am I doing wrong? On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you
Re: Configuring the Distributed
Thanks for the quick response. With that change (have not done numShards yet) shard1 got updated. But now when executing the following queries I get information back from both, which doesn't seem right http://localhost:7574/solr/select/?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc http://localhost:8983/solr/select?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote: Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it). Right now I explicitly commit on each server for tests. Can you try explicitly committing on server1 after updating the doc on server 2? I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow. Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica. - Mark On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote: So I couldn't resist, I attempted to do this tonight, I used the solrconfig you mentioned (as is, no modifications), I setup a 2 shard cluster in collection1, I sent 1 doc to 1 of the shards, updated it and sent the update to the other. I don't see the modifications though I only see the original document. The following is the test public void update() throws Exception { String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content, initial value); SolrServer server = servers .get(http://localhost:8983/solr/collection1;); server.add(solrDoc); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content, updated value); server = servers.get(http://localhost:7574/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); System.out.println(done); } key is my unique field in schema.xml What am I doing wrong? On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On
Re: Configuring the Distributed
Not sure offhand - but things will be funky if you don't specify the correct numShards. The instance to shard assignment should be using numShards to assign. But then the hash to shard mapping actually goes on the number of shards it finds registered in ZK (it doesn't have to, but really these should be equal). So basically you are saying, I want 3 partitions, but you are only starting up 2 nodes, and the code is just not happy about that I'd guess. For the system to work properly, you have to fire up at least as many servers as numShards. What are you trying to do? 2 partitions with no replicas, or one partition with one replica? In either case, I think you will have better luck if you fire up at least as many servers as the numShards setting. Or lower the numShards setting. This is all a work in progress by the way - what you are trying to test should work if things are setup right though. - Mark On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote: Thanks for the quick response. With that change (have not done numShards yet) shard1 got updated. But now when executing the following queries I get information back from both, which doesn't seem right http://localhost:7574/solr/select/?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc http://localhost:8983/solr/select?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote: Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it). Right now I explicitly commit on each server for tests. Can you try explicitly committing on server1 after updating the doc on server 2? I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow. Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica. - Mark On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote: So I couldn't resist, I attempted to do this tonight, I used the solrconfig you mentioned (as is, no modifications), I setup a 2 shard cluster in collection1, I sent 1 doc to 1 of the shards, updated it and sent the update to the other. I don't see the modifications though I only see the original document. The following is the test public void update() throws Exception { String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content, initial value); SolrServer server = servers .get(http://localhost:8983/solr/collection1;); server.add(solrDoc); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content, updated value); server = servers.get(http://localhost:7574/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); System.out.println(done); } key is my unique field in schema.xml What am I doing wrong? On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to
Re: Configuring the Distributed
Getting late - didn't really pay attention to your code I guess - why are you adding the first doc without specifying the distrib update chain? This is not really supported. It's going to just go to the server you specified - even with everything setup right, the update might then go to that same server or the other one depending on how it hashes. You really want to just always use the distrib update chain. I guess I don't yet understand what you are trying to test. Sent from my iPad On Dec 1, 2011, at 10:57 PM, Mark Miller markrmil...@gmail.com wrote: Not sure offhand - but things will be funky if you don't specify the correct numShards. The instance to shard assignment should be using numShards to assign. But then the hash to shard mapping actually goes on the number of shards it finds registered in ZK (it doesn't have to, but really these should be equal). So basically you are saying, I want 3 partitions, but you are only starting up 2 nodes, and the code is just not happy about that I'd guess. For the system to work properly, you have to fire up at least as many servers as numShards. What are you trying to do? 2 partitions with no replicas, or one partition with one replica? In either case, I think you will have better luck if you fire up at least as many servers as the numShards setting. Or lower the numShards setting. This is all a work in progress by the way - what you are trying to test should work if things are setup right though. - Mark On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote: Thanks for the quick response. With that change (have not done numShards yet) shard1 got updated. But now when executing the following queries I get information back from both, which doesn't seem right http://localhost:7574/solr/select/?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc http://localhost:8983/solr/select?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote: Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it). Right now I explicitly commit on each server for tests. Can you try explicitly committing on server1 after updating the doc on server 2? I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow. Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica. - Mark On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote: So I couldn't resist, I attempted to do this tonight, I used the solrconfig you mentioned (as is, no modifications), I setup a 2 shard cluster in collection1, I sent 1 doc to 1 of the shards, updated it and sent the update to the other. I don't see the modifications though I only see the original document. The following is the test public void update() throws Exception { String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content, initial value); SolrServer server = servers .get(http://localhost:8983/solr/collection1;); server.add(solrDoc); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content, updated value); server = servers.get(http://localhost:7574/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); System.out.println(done); } key is my unique field in schema.xml What am I doing wrong? On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you
Re: Configuring the Distributed
Really just trying to do a simple add and update test, the chain missing is just proof of my not understanding exactly how this is supposed to work. I modified the code to this String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers .get(http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get(http://localhost:7574/solr/collection1;); ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); // ureq.deleteById(8060a9eb-9546-43ee-95bb-d18ea26a6285); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); // server.add(solrDoc); server.commit(); server = servers.get(http://localhost:8983/solr/collection1;); server.commit(); System.out.println(done); but I'm still seeing the doc appear on both shards.After the first commit I see the doc on 8983 with initial value. after the second commit I see the updated value on 7574 and the old on 8983. After the final commit the doc on 8983 gets updated. Is there something wrong with my test? On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller markrmil...@gmail.com wrote: Getting late - didn't really pay attention to your code I guess - why are you adding the first doc without specifying the distrib update chain? This is not really supported. It's going to just go to the server you specified - even with everything setup right, the update might then go to that same server or the other one depending on how it hashes. You really want to just always use the distrib update chain. I guess I don't yet understand what you are trying to test. Sent from my iPad On Dec 1, 2011, at 10:57 PM, Mark Miller markrmil...@gmail.com wrote: Not sure offhand - but things will be funky if you don't specify the correct numShards. The instance to shard assignment should be using numShards to assign. But then the hash to shard mapping actually goes on the number of shards it finds registered in ZK (it doesn't have to, but really these should be equal). So basically you are saying, I want 3 partitions, but you are only starting up 2 nodes, and the code is just not happy about that I'd guess. For the system to work properly, you have to fire up at least as many servers as numShards. What are you trying to do? 2 partitions with no replicas, or one partition with one replica? In either case, I think you will have better luck if you fire up at least as many servers as the numShards setting. Or lower the numShards setting. This is all a work in progress by the way - what you are trying to test should work if things are setup right though. - Mark On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote: Thanks for the quick response. With that change (have not done numShards yet) shard1 got updated. But now when executing the following queries I get information back from both, which doesn't seem right http://localhost:7574/solr/select/?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc http://localhost:8983/solr/select?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote: Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it). Right now I explicitly commit on each server for tests. Can you try explicitly committing on server1 after updating the doc on server 2? I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow. Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica. - Mark On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote: So I
Re: Configuring the Distributed
Well, this goes both ways. It is not that unusual to take a node down for maintenance of some kind or even to have a node failure. In that case, it is very nice to have the load from the lost node be spread fairly evenly across the remaining cluster. Regarding the cost of having several micro-shards, they are also an opportunity for threading the search. Most sites don't have enough queries coming in to occupy all of the cores in modern machines so threading each query can actually be a substantial benefit in terms of query time. On Thu, Dec 1, 2011 at 6:37 PM, Mark Miller markrmil...@gmail.com wrote: To kick things off though, adding another partition should be a rare event if you plan carefully, and I think many will be able to handle the cost of splitting (you might even mark the replica you are splitting on so that it's not part of queries while its 'busy' splitting).
Re: Configuring the Distributed
With micro-shards, you can use random numbers for all placements with minor constraints like avoiding replicas sitting in the same rack. Since the number of shards never changes, things stay very simple. On Thu, Dec 1, 2011 at 6:44 PM, Mark Miller markrmil...@gmail.com wrote: Sorry - missed something - you also have the added cost of shipping the new half index to all of the replicas of the original shard with the splitting method. Unless you somehow split on every replica at the same time - then of course you wouldn't be able to avoid the 'busy' replica, and it would probably be fairly hard to juggle. On Dec 1, 2011, at 9:37 PM, Mark Miller wrote: In this case we are still talking about moving a whole index at a time rather than lots of little documents. You split the index into two, and then ship one of them off. The extra cost you can avoid with micro sharding will be the cost of splitting the index - which could be significant for a very large index. I have not done any tests though. The cost of 20 micro-shards is that you will always have tons of segments unless you are very heavily merging - and even in the very unusual case of each micro shard being optimized, you have essentially 20 segments. Thats best case - normal case is likely in the hundreds. This can be a fairly significant % hit at search time. You also have the added complexity of managing 20 indexes per node in solr code. I think that both options have there +/-'s and eventually we could perhaps support both. To kick things off though, adding another partition should be a rare event if you plan carefully, and I think many will be able to handle the cost of splitting (you might even mark the replica you are splitting on so that it's not part of queries while its 'busy' splitting). - Mark On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote: Of course, resharding is almost never necessary if you use micro-shards. Micro-shards are shards small enough that you can fit 20 or more on a node. If you have that many on each node, then adding a new node consists of moving some shards to the new machine rather than moving lots of little documents. Much faster. As in thousands of times faster. On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards