I am trying to get to the bottom of the cause for loss of configurations for 
Solr cloud stored in a Zookeeper ensemble. We have been running 4 Solr clouds 
in our data centers for about 5 years now with no problems. About 2 years ago 
we started adding more clouds specifically in AWS.  During those two years, we 
have had instances where the Solr configurations stored in Zookeeper have just 
disappeared. About a year ago we added some new Solr clouds to our own 
datacenters and experienced two instances of the Solr configurations 
disappearing in Zookeeper. The difference between our original Solr Clouds 
instances and the ones we have spun up in the past two years is that we are 
using Exhibitor for Zookeeper Ensemble management.

We have not been able to find anything in the logs indicating why this problem 
happens. We have not been able to replicate the problem reliably. The closest I 
have come is when adding new Zookeepers to an ensemble and performing a rolling 
restart via Exhibitor, there have been a few instances where pretty much 
everything stored in Zookeeper has been deleted. Everything except the 
Zookeeper information itself. We have asked around on Exhibitor support 
channels and done a lot of searching but have come up empty handed in regards 
to a solution or discovering other people who have had this issue.

What I suspect is happening is that when rolling restarts happen, if the node 
that becomes the leader is a new node that has not had the data replicated to 
it, when new nodes join to this leader, they see the leader is without the data 
they have stored and thus they should delete said data. In the cases where we 
are not adding new nodes, I suspect that there might an issue causing the 
zookeeper node to fail or appear failed to Exhibitor. A rolling restart occurs 
to remove this node. When exhibitor registers the zookeeper is available, 
Exhibitor initiates a rolling restart to bring the node back in. For some 
reason the data is corrupted or lost on that node and this is the node that 
becomes the leader. The remaining nodes that join to this leader then dump 
their data to match the leader.

Does this scenario sound plausible? If a newly added node that does not have 
data replicated to it is added to a zookeeper ensemble and the zookeepers are 
restarted with the new node becoming the leader, could this prompt the data 
stored in Zookeeper to be deleted?


--
Daniel S Washko
Solutions Architect

[cid:[email protected]]
Phone: 757 667 1463
[email protected]


gannett.com<http://www.gannett.com/>



Reply via email to