Re: Solr Cloud: Zookeeper failure modes

Erick Erickson Wed, 02 Jan 2019 09:25:47 -0800

1> no. At one point, this could be done in the sense that the
collections would be reconstructed, (legacyCloud) but that turned out
to have.. side effects. Even in that case, though, Solr couldn't
reconstruct the configsets. (insert rant that you really must store
your configsets in a VCS system somewhere IMO).

2> Should be fine, as long as the state changes don't include things
like adding replicas or collections or you've changed your configsets.
ZK has nothing to do with commits for instance. Leader election is
recorded in ZK, but other leaders will be elected if necessary. Again,
though, if you've changed the topology (added replicas and/or
collections and/or shards if using implicit routing) between the time
you took the snapshot and ZK failed you'll have an incomplete restored
state.

Now, all that said ZooKeeper data is "just data". Apart from blobs
stored in ZK, you can manually reconstruct the whole thing  with a
text editor and upload it. this would be tedious and error-prone to be
sure, but do-able. Periodically storing away a copy of the Collections
API CLUSTERSTATUS would help a lot.

Another approach would be to simply re-create your collections with
the exact same shard count. That'll create replicas with the same
ranges etc. Then shut your Solr instances down and copy the data
directory from the correct old replica to the correct new replica.
Once you're satisfied that things are running, you can delete the old
(unused) data. As an aside, in this case I'd create my new
collection(s) as leader-only (1 replica), then copy as necessary and
verify that things were as expected. Once that was done, I'd use
ADDREPLICA to build out the new collection(s). This pre-supposes you
can get your configsets back from VCS as well as any binary data
you've stored in ZK (e.g. jar files for custom code and the like).

So overall it's do-able even without ZK snapshots _assuming_ you can
find copies of your configsets and any custom code you've stored in
ZK. Not something I'd really _like_ to do, but in an emergency you
have options.

But backing up ZK snapshots in a safe place would be, by far, the
easiest and safest thing to do....

HTH,
Erick

On Wed, Jan 2, 2019 at 12:36 AM Pavel Micka <pavel.mi...@zoomint.com> wrote:
>
> Hi,
> We are currently implementing Solr cloud and as part of this effort we are 
> investigating, which failure modes may happen between Solr and Zookeeper.
>
> We have found quite a lot articles describing the "happy path" failure, when 
> ZK stops (loses majority) and the Solr Cluster ceases to serve write requests 
> (& read continues to work as expected). Once ZK cluster is reconciled and 
> majority achieved again, everything continues working as expected.
>
> What we have not been able to find is what happens when ZK cluster 
> catastrophically fails and loses its data. Either completely (scenario A) or 
> is restarted from backup (scenario B).
>
> So now the questions:
>
> 1)      Scenario A - Is existing Solr Cloud cluster able to start against a 
> clean Zookeeper and reconstruct all the ZK data from its internal state 
> (using some king of emergency recovery; it may take long)?
>
> 2)      Scenario B - What is the worst case backup/restore scenario? For 
> example when
>
> a.       ZK is backed up
>
> b.       Cluster performs some transition between states "X -> Y" (such as 
> commit shard, elect new leader etc.)
>
> c.       ZK fails completely
>
> d.       ZK is restored from backup created in step a
>
> e.       Solr Cloud is in state "Y", while ZK is in state "X"
>
> Thanks in advance,
>
> Pavel
>

Re: Solr Cloud: Zookeeper failure modes

Reply via email to