Re: basic solr cloud questions
SOLR-2355 is definitely a step in the right direction but something I would like to get clarified: a) There were some fixes to it that went on the 3.4 3.5 branch based on the comments section ... are they not available or not needed on 4.x trunk? b) Does this basic implementation distribute across shards or across cores? I think that distributing across all the cores in a shard is the key towards using it successfully with SolrCloud and I really don't know if this does this right now as I am not familiar with the source code. If someone could answer this it would be great otherwise I'll post back eventually when I do become familiar. Cheers, - Pulkit
Re: basic solr cloud questions
BTW I update the wiki with the following, hope it keeps it simpel for others starting out: Example B: Simple two shard cluster with shard replicas Note: This setup leverages copy/paste to setup 2 cores per shard and distributed searches validate a succesful completion of this example/exercise. But DO NOT assume that any new data that you index will be distributed across and indexes at each core of a given shard. That will not happen. Distributed Indexing is not part of SolrCloud yet. You may however adapt a basic implementation of distributed indexing by referring to SOLR-2355. On Fri, Sep 30, 2011 at 11:26 AM, Pulkit Singhal pulkitsing...@gmail.com wrote: SOLR-2355 is definitely a step in the right direction but something I would like to get clarified: a) There were some fixes to it that went on the 3.4 3.5 branch based on the comments section ... are they not available or not needed on 4.x trunk? b) Does this basic implementation distribute across shards or across cores? I think that distributing across all the cores in a shard is the key towards using it successfully with SolrCloud and I really don't know if this does this right now as I am not familiar with the source code. If someone could answer this it would be great otherwise I'll post back eventually when I do become familiar. Cheers, - Pulkit
Re: basic solr cloud questions
Thanks Pulkit! I'd actually been meaning to add the post.jar commands needed to index a doc to each shard to the wiki. Waiting till I streamline a few things though. - Mark On Sep 30, 2011, at 12:35 PM, Pulkit Singhal wrote: BTW I update the wiki with the following, hope it keeps it simpel for others starting out: Example B: Simple two shard cluster with shard replicas Note: This setup leverages copy/paste to setup 2 cores per shard and distributed searches validate a succesful completion of this example/exercise. But DO NOT assume that any new data that you index will be distributed across and indexes at each core of a given shard. That will not happen. Distributed Indexing is not part of SolrCloud yet. You may however adapt a basic implementation of distributed indexing by referring to SOLR-2355. On Fri, Sep 30, 2011 at 11:26 AM, Pulkit Singhal pulkitsing...@gmail.com wrote: SOLR-2355 is definitely a step in the right direction but something I would like to get clarified: a) There were some fixes to it that went on the 3.4 3.5 branch based on the comments section ... are they not available or not needed on 4.x trunk? b) Does this basic implementation distribute across shards or across cores? I think that distributing across all the cores in a shard is the key towards using it successfully with SolrCloud and I really don't know if this does this right now as I am not familiar with the source code. If someone could answer this it would be great otherwise I'll post back eventually when I do become familiar. Cheers, - Pulkit - Mark Miller lucidimagination.com 2011.lucene-eurocon.org | Oct 17-20 | Barcelona
Re: basic solr cloud questions
On 9/30/2011 12:26 PM, Pulkit Singhal wrote: SOLR-2355 is definitely a step in the right direction but something I would like to get clarified: Questions about SOLR-2355 are best asked in SOLR-2355 :) b) Does this basic implementation distribute across shards or across cores? From a brief look, it seems to assume shard=core. You list all cores in the config file under shards.
Re: basic solr cloud questions
That was kinda my point. The new cloud implementation is not about replication, nor should it be. But rather about horizontal scalability where nodes manage different parts of a unified index. One of the design goals of the new cloud implementation is for this to happen more or less automatically. To me that means one does not have to manually distributed documents or enforce replication as Yurly suggests. Replication is different to me than what was being asked. And perhaps I misunderstood the original question. Yurly's response introduced the term core where the original person was referring to nodes. For all I know, those are two different things in the new cloud design terminology (I believe they are). I guess understanding cores vs. nodes vs shards is helpful. :) cheers! Darren On 09/29/2011 12:00 AM, Pulkit Singhal wrote: @Darren: I feel that the question itself is misleading. Creating shards is meant to separate out the data ... not keep the exact same copy of it. I think the two node setup that was attempted by Sam mislead him and us into thinking that configuring two nodes which are to be named shard1 ... somehow means that they are instantly replicated too ... this is not the case! I can see how this misunderstanding can develop as I too was confused until Yury cleared it up. @Sam: If you are interested in performing a quick exercise to understand the pieces involved for replication rather than sharding ... perhaps this link would be of help in taking you through it: http://pulkitsinghal.blogspot.com/2011/09/setup-solr-master-slave-replication.html - Pulkit 2011/9/27 Yury Katsyuryk...@yahoo.com: On 9/27/2011 5:16 PM, Darren Govoni wrote: On 09/27/2011 05:05 PM, Yury Kats wrote: You need to either submit the docs to both nodes, or have a replication setup between the two. Otherwise they are not in sync. I hope that's not the case. :/ My understanding (or hope maybe) is that the new Solr Cloud implementation will support auto-sharding and distributed indexing. This means that shards will receive different documents regardless of which node received the submitted document (spread evenly based on a hash-node assignment). Distributed queries will thus merge all the solr shard/node responses. All cores in the same shard must somehow have the same index. Only then can you continue servicing searches when individual cores fail. Auto-sharding and distributed indexing don't have anything to do with this. In the future, SolrCloud may be managing replication between cores in the same shard automatically. But right now it does not.
Re: basic solr cloud questions
On 9/29/2011 7:22 AM, Darren Govoni wrote: That was kinda my point. The new cloud implementation is not about replication, nor should it be. But rather about horizontal scalability where nodes manage different parts of a unified index. It;s about many things. You stated one, but there are goals, one of them being tolerance to node outages. In a cloud, when one of your many nodes fail, you don't want to stop querying and indexing. For this to happen, you need to maintain redundant copies of the same pieces of the index, hence you need to replicate. One of the design goals of the new cloud implementation is for this to happen more or less automatically. True, but there is a big gap between goals and current state. Right now, there is distributed search, but not distributed indexing or auto-sharding, or auto-replication. So if you want to use the SolrCloud now (as many of us do), you need do a number of things yourself, even if they might be done by SolrCloud automatically in the future. To me that means one does not have to manually distributed documents or enforce replication as Yurly suggests. Replication is different to me than what was being asked. And perhaps I misunderstood the original question. Yurly's response introduced the term core where the original person was referring to nodes. For all I know, those are two different things in the new cloud design terminology (I believe they are). I guess understanding cores vs. nodes vs shards is helpful. :) Shard is a slice of index. Index is managed/stored in a core. Nodes are Solr instances, usually physical machines. Each node can host multiple shards, and each shard can consist of multiple cores. However, all cores within the same shard must have the same content. This is where the OP ran into the problem. The OP had 1 shard, consisting of two cores on two nodes. Since there is no distributed indexing yet, all documents were indexed into a single core. However, there is distributed search, therefore queries were sent randomly to different cores of the same shard. Since one core in the shard had documents and the other didn't, the query result was random. To solve this problem, the OP must make sure all cores within the same shard (be they on the same node or not) have the same content. This can currently be achieved by: a) setting up replication between cores. you index into one core and the other core replicates the content b) indexing into both cores Hope this clarifies.
Re: basic solr cloud questions
Agree. Thanks also for clarifying. It helps. On 09/29/2011 08:50 AM, Yury Kats wrote: On 9/29/2011 7:22 AM, Darren Govoni wrote: That was kinda my point. The new cloud implementation is not about replication, nor should it be. But rather about horizontal scalability where nodes manage different parts of a unified index. It;s about many things. You stated one, but there are goals, one of them being tolerance to node outages. In a cloud, when one of your many nodes fail, you don't want to stop querying and indexing. For this to happen, you need to maintain redundant copies of the same pieces of the index, hence you need to replicate. One of the design goals of the new cloud implementation is for this to happen more or less automatically. True, but there is a big gap between goals and current state. Right now, there is distributed search, but not distributed indexing or auto-sharding, or auto-replication. So if you want to use the SolrCloud now (as many of us do), you need do a number of things yourself, even if they might be done by SolrCloud automatically in the future. To me that means one does not have to manually distributed documents or enforce replication as Yurly suggests. Replication is different to me than what was being asked. And perhaps I misunderstood the original question. Yurly's response introduced the term core where the original person was referring to nodes. For all I know, those are two different things in the new cloud design terminology (I believe they are). I guess understanding cores vs. nodes vs shards is helpful. :) Shard is a slice of index. Index is managed/stored in a core. Nodes are Solr instances, usually physical machines. Each node can host multiple shards, and each shard can consist of multiple cores. However, all cores within the same shard must have the same content. This is where the OP ran into the problem. The OP had 1 shard, consisting of two cores on two nodes. Since there is no distributed indexing yet, all documents were indexed into a single core. However, there is distributed search, therefore queries were sent randomly to different cores of the same shard. Since one core in the shard had documents and the other didn't, the query result was random. To solve this problem, the OP must make sure all cores within the same shard (be they on the same node or not) have the same content. This can currently be achieved by: a) setting up replication between cores. you index into one core and the other core replicates the content b) indexing into both cores Hope this clarifies.
Re: basic solr cloud questions
2011/9/29 Yury Kats yuryk...@yahoo.com: True, but there is a big gap between goals and current state. Right now, there is distributed search, but not distributed indexing or auto-sharding, or auto-replication. So if you want to use the SolrCloud now (as many of us do), you need do a number of things yourself, even if they might be done by SolrCloud automatically in the future. There is a patch in Jira: https://issues.apache.org/jira/browse/SOLR-2355 that adds a update processor suitable for doing simple distributed indexing with current version of Solr. -- Sami Siren
Re: basic solr cloud questions
@Darren: I feel that the question itself is misleading. Creating shards is meant to separate out the data ... not keep the exact same copy of it. I think the two node setup that was attempted by Sam mislead him and us into thinking that configuring two nodes which are to be named shard1 ... somehow means that they are instantly replicated too ... this is not the case! I can see how this misunderstanding can develop as I too was confused until Yury cleared it up. @Sam: If you are interested in performing a quick exercise to understand the pieces involved for replication rather than sharding ... perhaps this link would be of help in taking you through it: http://pulkitsinghal.blogspot.com/2011/09/setup-solr-master-slave-replication.html - Pulkit 2011/9/27 Yury Kats yuryk...@yahoo.com: On 9/27/2011 5:16 PM, Darren Govoni wrote: On 09/27/2011 05:05 PM, Yury Kats wrote: You need to either submit the docs to both nodes, or have a replication setup between the two. Otherwise they are not in sync. I hope that's not the case. :/ My understanding (or hope maybe) is that the new Solr Cloud implementation will support auto-sharding and distributed indexing. This means that shards will receive different documents regardless of which node received the submitted document (spread evenly based on a hash-node assignment). Distributed queries will thus merge all the solr shard/node responses. All cores in the same shard must somehow have the same index. Only then can you continue servicing searches when individual cores fail. Auto-sharding and distributed indexing don't have anything to do with this. In the future, SolrCloud may be managing replication between cores in the same shard automatically. But right now it does not.
basic solr cloud questions
Hi all I'm a relatively new solr user, and recently I discovered the interesting solr cloud feature. I have some basic questions: (please excuse me if I get the terminologies wrong) - from my understanding, this is still a work in progress. How mature is it? Is there any estimate on the official release? - has the solr_cluster.properties configuration been implemented? it's mentioned in http://wiki.apache.org/solr/NewSolrCloudDesign. I was trying to play with it a bit but I couldn't find the file. - I tried to to setup a two node, 1 shard cluster, e.g. active active solr with fault tolerance. (this isn't possible with the old replication feature right?) I have both instances of solr configured to use core name=collection1 instanceDir=. shard=shard1/, and I started each instance with its own instance of zookeeper to form an ensemble. From the zookeeper admin page, I can see both nodes under shard1. I can submit documents fine. However, when I do a search, it appears that only one node has the submitted documents. (e.g. if I keep refreshing, I get different results depending on which node gets assigned the work). My search url is http://localhost:8983/solr/collection1/select?distrib=trueq=*.*. Did I miss something? thanks
Re: basic solr cloud questions
On 09/27/2011 05:05 PM, Yury Kats wrote: You need to either submit the docs to both nodes, or have a replication setup between the two. Otherwise they are not in sync. I hope that's not the case. :/ My understanding (or hope maybe) is that the new Solr Cloud implementation will support auto-sharding and distributed indexing. This means that shards will receive different documents regardless of which node received the submitted document (spread evenly based on a hash-node assignment). Distributed queries will thus merge all the solr shard/node responses. This is similar in theory to how memcache and other big scale DHT's work. If its just manually replicated indexes then its not really a step forward from current Solr. :/
Re: basic solr cloud questions
On 9/27/2011 5:16 PM, Darren Govoni wrote: On 09/27/2011 05:05 PM, Yury Kats wrote: You need to either submit the docs to both nodes, or have a replication setup between the two. Otherwise they are not in sync. I hope that's not the case. :/ My understanding (or hope maybe) is that the new Solr Cloud implementation will support auto-sharding and distributed indexing. This means that shards will receive different documents regardless of which node received the submitted document (spread evenly based on a hash-node assignment). Distributed queries will thus merge all the solr shard/node responses. All cores in the same shard must somehow have the same index. Only then can you continue servicing searches when individual cores fail. Auto-sharding and distributed indexing don't have anything to do with this. In the future, SolrCloud may be managing replication between cores in the same shard automatically. But right now it does not.