SolrCloud OR distributed Solr
Hello Member, Is there any difference between distributed solr solrCloud ? Consider I have three countries' product. I have indexed one country data and it's index size is 160 gb+ Now we have other two countries and now I am confused ! My client ask me what is the difference if we procure another Solr server and indexed separatelyI was thinking for solrcloud.Can someone explain how we can explain these two approaches in simple words and if there are any reading links please share. Thanks
Re: SolrCloud OR distributed Solr
On 30 March 2014 23:12, Priti Solanki pritiatw...@gmail.com wrote: Hello Member, Is there any difference between distributed solr solrCloud ? You might be confusing the older Solr distributed search with the new SolrCloud: * Older distributed search: https://wiki.apache.org/solr/DistributedSearch * SolrCloud: https://cwiki.apache.org/confluence/display/solr/SolrCloud Consider I have three countries' product. I have indexed one country data and it's index size is 160 gb+ Now we have other two countries and now I am confused ! My client ask me what is the difference if we procure another Solr server and indexed separatelyI was thinking for solrcloud.Can someone explain how we can explain these two approaches in simple words and if there are any reading links please share. With 4.0+ versions of Solr, you probably want to go for SolrCloud. Regards, Gora
Re: SolrCloud OR distributed Solr
Distributed solr is simply the ability for Solr to take the incoming query and send it to multiple shards, then aggregate the response. Here a shard is a physical partition of a single logical index. The assumption is that you can't fit the entire index on a single machine and still get the performance you need, so you use N smaller parts. So, there has to be some mechanism to send the request to each sub-index and assemble the response and give it back to the client. That's distrubuted solr. Before 4.0, splitting the index up was entirely manual, _you_ decided what document went to what shard. _you_ configured Solr to know about where the other shards were. _you_ handled the situation where a node went down and you had to heal the network. But it was still using distributed search As of 4.0, SolrCloud happens. The differences are 1 you can have Solr automatically distribute the docs to the right shard. 2 when a node goes down, Solr can automatically compensate (assuming more than one replica/shard) 3 when the node comes back up, Solr will automatically re-synchronize the node before (automatically) bringing it back into service NOTE: you can still use old-style manual sharding if you choose, it's available in 4.x But be careful here and draw a distinction between distributed search and federated search. Distributed search - what we've been talking about, the underlying assumption is that the sub-indexes are all substantially similar. Federated search - the sub-indexes (or, indeed, complete self-contained indexes) may have no relation to each other and you're somehow expected to search them all and return the results. In this case you'll probably be firing off N separate queries (one to each of N indexes) and assembling them at the app layer. Best, Erick On Sun, Mar 30, 2014 at 1:42 PM, Priti Solanki pritiatw...@gmail.com wrote: Hello Member, Is there any difference between distributed solr solrCloud ? Consider I have three countries' product. I have indexed one country data and it's index size is 160 gb+ Now we have other two countries and now I am confused ! My client ask me what is the difference if we procure another Solr server and indexed separatelyI was thinking for solrcloud.Can someone explain how we can explain these two approaches in simple words and if there are any reading links please share. Thanks
SolrCloud vs Distributed Solr
Hi to all, I started following this mailing list about 1 month ago and I read many threads about SolrCloud and distributed Solr. I just want to check if I understood correctly and, if so, ask for some architectural decision I have to take: 1) At the moment, in order to design a scalable Solr deployment, one could choose if to setup a Solr cloud (where servers are transparent to the client) or a simil-Solr cloud (distributed mode) where client has to know which server to contact, right? 2) If so, I don't fully understand why to make the clients aware about the Solr servers. Why should a client decide on its own where to index or query? Is it because of backward compatibility, performance or similar issues? From what I understood SolrCloud does all the magic hiding to the user the real deployment (with all subsequent benefits of tearing up/down server and so on)...isn't it? 3) When configuring SolrCloud I put in the solrconfig.xml the list of the shards supporting my collection distribution. E.g.: str name=selflocalhost:8983/solr/str arr name=shards strlocalhost:8983/solr/str strsomeotherhost:7574/solr/str /arr How does the splitting work behind the scenes (a link to a detailed explanation is sufficient..)? 4) If one day I decide to add one more server to distribute the load, what is the correct procedure to deploy such a change? Does SolrCloud automatically redistribute the index within all shards? Best, Flavio
Re: SolrCloud vs Distributed Solr
Flavio: I think you're missing a critical bit about SolrCloud, namely Zookeeper (ZK), see here on the SolrCloud page for a start: http://wiki.apache.org/solr/SolrCloud#ZooKeeper You'll notice that each Solr node, when it is started, requires the address of your ZK ensemble, NOT a solr node. That allows ZK to know where all the nodes are in your cluster. So each of the nodes just knows where all the other shards are since that info is kept it ZK, so any request to any node in the cluster does the right thing, whether update or query. So updates are forwarded to all correct leaders, queries are sent to a member of each shard etc, all automatically. Now take a look at CloudSolrServer (assuming that you're using SolrJ from your client). The constructor takes the address of ZK too. Using this info the client code has access to information about the state of the entire cluster, so you don't have to do anything, the client code will just know how to connect to Solr. So for 1, 2 and 3 above, don't do anything G. Just start up all the solr nodes with the proper zkHost (or zkRun) parameter and send requests to any node. You do NOT have to configure shards in solrconfig.xml or anything else. For 4, I'm going to pass on the shard splitting details since I haven't had time to dive into that yet. But increasing capacity comes in two flavors. If you simply need to get more query throughput, just add more nodes. Solr will assign them to the right shard (although you can control this), copy the index for that shard down and start automatically routing new requests to that node too. The second flavor is when your index is too big to fit on your physical hardware and you need more shards (as opposed to more replicas). Then you need to do the shard splitting thing which I'm going to skip rather than mislead you. Final note: The other thing that's confusing you I think is the distinction between SolrCloud and Solr Master/Slave. SolrCloud is the new way of doing things. Master/Slave is a situation in where all the automatic stuff you can do with SolrCloud must be done manually, things like assigning documents to particular shards, configuring solrconfig.xml with the addresses of all the other shards, all that stuff. Best Erick
Re: SolrCloud vs Distributed Solr
Thanks for the detailed response Erik, you helped me a lot in clarifying many Solr concepts! Best, Flavio On Mon, Jul 8, 2013 at 1:59 PM, Erick Erickson erickerick...@gmail.comwrote: Flavio: I think you're missing a critical bit about SolrCloud, namely Zookeeper (ZK), see here on the SolrCloud page for a start: http://wiki.apache.org/solr/SolrCloud#ZooKeeper You'll notice that each Solr node, when it is started, requires the address of your ZK ensemble, NOT a solr node. That allows ZK to know where all the nodes are in your cluster. So each of the nodes just knows where all the other shards are since that info is kept it ZK, so any request to any node in the cluster does the right thing, whether update or query. So updates are forwarded to all correct leaders, queries are sent to a member of each shard etc, all automatically. Now take a look at CloudSolrServer (assuming that you're using SolrJ from your client). The constructor takes the address of ZK too. Using this info the client code has access to information about the state of the entire cluster, so you don't have to do anything, the client code will just know how to connect to Solr. So for 1, 2 and 3 above, don't do anything G. Just start up all the solr nodes with the proper zkHost (or zkRun) parameter and send requests to any node. You do NOT have to configure shards in solrconfig.xml or anything else. For 4, I'm going to pass on the shard splitting details since I haven't had time to dive into that yet. But increasing capacity comes in two flavors. If you simply need to get more query throughput, just add more nodes. Solr will assign them to the right shard (although you can control this), copy the index for that shard down and start automatically routing new requests to that node too. The second flavor is when your index is too big to fit on your physical hardware and you need more shards (as opposed to more replicas). Then you need to do the shard splitting thing which I'm going to skip rather than mislead you. Final note: The other thing that's confusing you I think is the distinction between SolrCloud and Solr Master/Slave. SolrCloud is the new way of doing things. Master/Slave is a situation in where all the automatic stuff you can do with SolrCloud must be done manually, things like assigning documents to particular shards, configuring solrconfig.xml with the addresses of all the other shards, all that stuff. Best Erick -- Flavio Pompermaier *Development Department *___ *OKKAM**Srl **- www.okkam.it* *Phone:* +(39) 0461 283 702 *Fax:* + (39) 0461 186 6433 *Email:* f.pomperma...@okkam.it *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2 *Registered office:* Trento (Italy), via Segantini 23 Confidentially notice. This e-mail transmission may contain legally privileged and/or confidential information. Please do not read it if you are not the intended recipient(S). Any use, distribution, reproduction or disclosure by any other person is strictly prohibited. If you have received this e-mail in error, please notify the sender and destroy the original transmission and its attachments without reading or saving it in any manner.