Re: Replicating Between Solr Clouds
Are there any more OOB solutions for inter-SolrCloud replication now? Our indexing is so slow that we cannot rely on a complete re-index of data from our DB of record (SQL) to recover data in the Solr indices. -- View this message in context: http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp4121196p4153856.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Replicating Between Solr Clouds
I¹ve been working on this tool, which wraps the collections API to do more advanced cluster-management operations: https://github.com/whitepages/solrcloud_manager One of the operations I¹ve added (copy) is a deployment mechanism that uses the replication handler¹s snap puller to hot-load a pre-indexed collection from one solrcloud cluster into another. You create the same collection name with the same shard count in two clusters, index into one, and copy from that into the other. This method won¹t work as a method of active replication, since it copies the whole index. If you only need a periodic copy between data centers though, or want someplace to restore from in case of critical failure (until you can properly rebuild), there might be something you can use here. On 8/19/14, 12:45 PM, reparker23 reparke...@gmail.com wrote: Are there any more OOB solutions for inter-SolrCloud replication now? Our indexing is so slow that we cannot rely on a complete re-index of data from our DB of record (SQL) to recover data in the Solr indices. -- View this message in context: http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp41211 96p4153856.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Replicating Between Solr Clouds
Toby Lazar wrote Unless Solr is your system of record, aren't you already replicating your source data across the WAN? If so, could you load Solr in colo B from your colo B data source? You may be duplicating some indexing work, but at least your colo B Solr would be more closely in sync with your colo B data. Our system of record exists in a SQL DB that is indeed replicated via always-on mirroring to the failover data center. However, a complete forced re-index of all of the data could take hours and our SLA requires us to be back up with searchable indices in minutes. Because we may have to replicate multiple data centers' data (three plus data centers, A, B and the failover DC) into this failover data center, we can't dedicate the failover data center's SolrCloud to constantly re-index data from a single SQL mirror when we could potentially need it to take over for any given one. One thought we had was to have a situation where the DCs A and B would run a cron job that would force a backup of the indices using the replication?command=backup API command and then we would sync up those backup snapshots to the failover DC's shut down SorCloud instance to a separate filesystem directory dedicated to DC A's or DC B's indices. Then in the case of a failover we would have to run a script that would symlink the snapshots for the particular DC we want to failover for to the index dir for the failover DCs SolrCloud and then start up the nodes. The problem comes with how to handle different indices on different nodes in the SolrCloud then we have 2 shards. We would have to do a 1:1 copy of each of the four nodes in DCs A and B to each of the other node in the failover DC. Sounds pretty ugly. Looking at this thread, even this paln may not work: http://lucene.472066.n3.nabble.com/solrcloud-shards-backup-restoration-td4088447.html As far as the SolrEntityProcessor, I'm not sure how you would configure it. From what I gather, you have to configure a new requestHandler section in your Solrconfig.xml like this: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=config/data/solr/mysolr/conf/data-config.xml/str /lst /requestHandler And then you have to configure a /data/solr/mysolr/conf/data-config.xml with the following contents: dataConfig document entity name=sep processor=SolrEntityProcessor url=http://solrsource.example.com:8983/solr/; query=*:*/ /document /dataConfig However, this doesn't seem to work for me as I'm using a SolrCloud with zookeeper. I created these files in my conf directory and uploaded them to zookeeper, then reloaded the collection/cores but all I got were initialization errors. I don't think the docs assume you'll be doing this under a SolrCloud scenario. Any other insight? -- View this message in context: http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp4121196p4121685.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Replicating Between Solr Clouds
On 3/6/2014 7:54 AM, perdurabo wrote: Toby Lazar wrote Unless Solr is your system of record, aren't you already replicating your source data across the WAN? If so, could you load Solr in colo B from your colo B data source? You may be duplicating some indexing work, but at least your colo B Solr would be more closely in sync with your colo B data. Our system of record exists in a SQL DB that is indeed replicated via always-on mirroring to the failover data center. However, a complete forced re-index of all of the data could take hours and our SLA requires us to be back up with searchable indices in minutes. Because we may have to replicate multiple data centers' data (three plus data centers, A, B and the failover DC) into this failover data center, we can't dedicate the failover data center's SolrCloud to constantly re-index data from a single SQL mirror when we could potentially need it to take over for any given one. There are a lot of issues with availability and multiple data centers that must be addressed before SolrCloud can handle this all internally. Until that day comes, here's what I would do: Have a SolrCloud install at each online data center, just as you already do. It should have collection names that are unique to the functions of that DC, and may include the DC name. If you MUST have the same collection name in all online data centers despite there being different data, you can use collection aliasing. The actual collection name would be something like stuff_dca, but you'd have an alias called stuff that can be used for both indexing and querying. You would need to index the data for all data centers to the SolrCloud install at the failover DC. Ideally that would be done from the failover DC's SQL, not over the WAN ... but it really wouldn't matter. Because each production DC collection will have a unique name, all collections can coexist on the failover SolrCloud. If a failover becomes necessary, you can make or change collection any required aliases on the fly. Although I don't use SolrCloud, and I don't have multiple data centers, my own index uses a similar paradigm. I have two completely independent copies of my index. My indexing program knows about them both and indexes them independently. There is another benefit to this: I can make changes (Solr upgrades, new config/schema, a complete rebuild, etc.) to one copy of my index without affecting the search application at all. By simply enabling or disabling the ping handler in Solr, my load balancer will keep requests going to whichever copy I choose. Thanks, Shawn
Re: Replicating Between Solr Clouds
Well, I think I finally figured out how to get SolrEntityProcessor to work, but there are still some issues. I had to add a library path to solrconfig.xml, but the cores are finally coming up and i am now manually able to run a data import that does seem to index all of the documents on the remote SolrCloud. I ran into the issue here where I got version conflicts: http://lucene.472066.n3.nabble.com/Version-conflict-during-data-import-from-another-Solr-instance-into-clean-Solr-td4046937.html I used the suggestion of adding fl=*,old_version:_version_ to the data-config.xml entity config line. This seems to be working but I don't know if this will cause a problem. When I do a manual data import i get the correct number of documents from the source SolrCloud (the total number of docs added up between both shards is 6357 in this test case) Indexing completed. Added/Updated: 6,357 documents. Deleted 0 documents. (Duration: 22s) Requests: 0 (0/s), Fetched: 6,357 (289/s), Skipped: 0, Processed: 6,357 However, when I check the number of docs indexed for each shard in the core admin UI on the destination SolrCloud, the numbers are way off and a lot less than 6357. Theres nothing in the logs to indicate collisions or dropped documents. What could account for the disparity? I would assume down the road what I need to do is configure multiple collections/cores on the failover cluster representing each DC its replicating from, but how would you create multiple collections when using zookeeper? How do you upload multiple sets of config files for each one and keep them separate? -- View this message in context: http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp4121196p4121737.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Replicating Between Solr Clouds
Unless Solr is your system of record, aren't you already replicating your source data across the WAN? If so, could you load Solr in colo B from your colo B data source? You may be duplicating some indexing work, but at least your colo B Solr would be more closely in sync with your colo B data. Toby Sent via BlackBerry by ATT -Original Message- From: Tim Potter tim.pot...@lucidworks.com Date: Wed, 5 Mar 2014 02:51:21 To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org Reply-To: solr-user@lucene.apache.org Subject: RE: Replicating Between Solr Clouds Unfortunately, there is no out-of-the-box solution for this at the moment. In the past, I solved this using a couple of different approaches, which weren't all that elegant but served the purpose and were simple enough to allow the ops folks to setup monitors and alerts if things didn't work. 1) use DIH's Solr entity processor to pull data from one Solr to another, see: http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor This only works if you store all fields, which in my use case was OK because I also did lots of partial document updates, which also required me to store all fields 2) use the replication handler's snapshot support to create snapshots on a regular basis and then move the files over the network This one works but required the use of read and write aliases and two collections on the remote (slave) data center so that I could rebuild my write collection from the snapshots and then update the aliases to point the reads at the updated collection. Work on an automated backup/restore solution is planned, see https://issues.apache.org/jira/browse/SOLR-5750, but if you need something sooner, you can write a backup driver using SolrJ that uses CloudSolrServer to get the address of all the shard leaders, initiate the backup command on each leader, poll the replication details handler for snapshot completion on each shard, and then ship the files across the network. Obviously, this isn't a solution for NRT multi-homing ;-) Lastly, these aren't the only ways to go about this, just wanted to share some high-level details about what has worked. Timothy Potter Sr. Software Engineer, LucidWorks www.lucidworks.com From: perdurabo robert_par...@volusion.com Sent: Tuesday, March 04, 2014 1:04 PM To: solr-user@lucene.apache.org Subject: Replicating Between Solr Clouds We are looking to setup a highly available failover site across a WAN for our SolrCloud instance. The main production instance is at colo center A and consists of a 3-node ZooKeeper ensemble managing configs for a 4-node SolrCloud running Solr 4.6.1. We only have one collection among the 4 cores and there are two shards in the collection, one master node and one replica node for each shard. Our search and indexing services address the Solr cloud through a load balancer VIP, not a compound API call. Anyway, the Solr wiki explains fairly well how to replicate single node Solr collections, but I do not see an obvious way for replicating a SolrCloud's indices over a WAN to another SolrCloud. I need for a SolrCloud in another data center to be able to replicate both shards of the collection in the other data center over a WAN. It needs to be able to replicate from a load balancer VIP, not a single named server of the SolrCloud, which round robins across all four nodes/2 shards for high availability. I've searched high and low for a white paper or some discussion of how to do this and haven't found anything. Any ideas? Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp4121196.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Replicating Between Solr Clouds
Unfortunately, there is no out-of-the-box solution for this at the moment. In the past, I solved this using a couple of different approaches, which weren't all that elegant but served the purpose and were simple enough to allow the ops folks to setup monitors and alerts if things didn't work. 1) use DIH's Solr entity processor to pull data from one Solr to another, see: http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor This only works if you store all fields, which in my use case was OK because I also did lots of partial document updates, which also required me to store all fields 2) use the replication handler's snapshot support to create snapshots on a regular basis and then move the files over the network This one works but required the use of read and write aliases and two collections on the remote (slave) data center so that I could rebuild my write collection from the snapshots and then update the aliases to point the reads at the updated collection. Work on an automated backup/restore solution is planned, see https://issues.apache.org/jira/browse/SOLR-5750, but if you need something sooner, you can write a backup driver using SolrJ that uses CloudSolrServer to get the address of all the shard leaders, initiate the backup command on each leader, poll the replication details handler for snapshot completion on each shard, and then ship the files across the network. Obviously, this isn't a solution for NRT multi-homing ;-) Lastly, these aren't the only ways to go about this, just wanted to share some high-level details about what has worked. Timothy Potter Sr. Software Engineer, LucidWorks www.lucidworks.com From: perdurabo robert_par...@volusion.com Sent: Tuesday, March 04, 2014 1:04 PM To: solr-user@lucene.apache.org Subject: Replicating Between Solr Clouds We are looking to setup a highly available failover site across a WAN for our SolrCloud instance. The main production instance is at colo center A and consists of a 3-node ZooKeeper ensemble managing configs for a 4-node SolrCloud running Solr 4.6.1. We only have one collection among the 4 cores and there are two shards in the collection, one master node and one replica node for each shard. Our search and indexing services address the Solr cloud through a load balancer VIP, not a compound API call. Anyway, the Solr wiki explains fairly well how to replicate single node Solr collections, but I do not see an obvious way for replicating a SolrCloud's indices over a WAN to another SolrCloud. I need for a SolrCloud in another data center to be able to replicate both shards of the collection in the other data center over a WAN. It needs to be able to replicate from a load balancer VIP, not a single named server of the SolrCloud, which round robins across all four nodes/2 shards for high availability. I've searched high and low for a white paper or some discussion of how to do this and haven't found anything. Any ideas? Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp4121196.html Sent from the Solr - User mailing list archive at Nabble.com.