I am trying to get to the bottom of the cause for loss of configurations for Solr cloud stored in a Zookeeper ensemble. We have been running 4 Solr clouds in our data centers for about 5 years now with no problems. About 2 years ago we started adding more clouds specifically in AWS. During those two years, we have had instances where the Solr configurations stored in Zookeeper have just disappeared. About a year ago we added some new Solr clouds to our own datacenters and experienced two instances of the Solr configurations disappearing in Zookeeper. The difference between our original Solr Clouds instances and the ones we have spun up in the past two years is that we are using Exhibitor for Zookeeper Ensemble management.
We have not been able to find anything in the logs indicating why this problem happens. We have not been able to replicate the problem reliably. The closest I have come is when adding new Zookeepers to an ensemble and performing a rolling restart via Exhibitor, there have been a few instances where pretty much everything stored in Zookeeper has been deleted. Everything except the Zookeeper information itself. We have asked around on Exhibitor support channels and done a lot of searching but have come up empty handed in regards to a solution or discovering other people who have had this issue. What I suspect is happening is that when rolling restarts happen, if the node that becomes the leader is a new node that has not had the data replicated to it, when new nodes join to this leader, they see the leader is without the data they have stored and thus they should delete said data. In the cases where we are not adding new nodes, I suspect that there might an issue causing the zookeeper node to fail or appear failed to Exhibitor. A rolling restart occurs to remove this node. When exhibitor registers the zookeeper is available, Exhibitor initiates a rolling restart to bring the node back in. For some reason the data is corrupted or lost on that node and this is the node that becomes the leader. The remaining nodes that join to this leader then dump their data to match the leader. Does this scenario sound plausible? If a newly added node that does not have data replicated to it is added to a zookeeper ensemble and the zookeepers are restarted with the new node becoming the leader, could this prompt the data stored in Zookeeper to be deleted? -- Daniel S Washko Solutions Architect [cid:[email protected]] Phone: 757 667 1463 [email protected] gannett.com<http://www.gannett.com/>
