Solr Cloud: Zookeeper failure modes

Pavel Micka Wed, 02 Jan 2019 00:36:22 -0800

Hi,
We are currently implementing Solr cloud and as part of this effort we are 
investigating, which failure modes may happen between Solr and Zookeeper.


We have found quite a lot articles describing the "happy path" failure, when ZK 
stops (loses majority) and the Solr Cluster ceases to serve write requests (& 
read continues to work as expected). Once ZK cluster is reconciled and majority 
achieved again, everything continues working as expected.

What we have not been able to find is what happens when ZK cluster 
catastrophically fails and loses its data. Either completely (scenario A) or is 
restarted from backup (scenario B).

So now the questions:

1)      Scenario A - Is existing Solr Cloud cluster able to start against a 
clean Zookeeper and reconstruct all the ZK data from its internal state (using 
some king of emergency recovery; it may take long)?

2)      Scenario B - What is the worst case backup/restore scenario? For 
example when

a.       ZK is backed up

b.       Cluster performs some transition between states "X -> Y" (such as 
commit shard, elect new leader etc.)

c.       ZK fails completely

d.       ZK is restored from backup created in step a

e.       Solr Cloud is in state "Y", while ZK is in state "X"

Thanks in advance,

Pavel

Solr Cloud: Zookeeper failure modes

Reply via email to