Hello,

I am experiencing a failure mode where a replica is unable to recover and it 
will try to do so forever.  In writing this email I want to make sure that I 
haven't missed anything obvious or missed a configurable option that could 
help.  If something about this looks funny, I would really like to hear from 
you.

Relevant details:
- solr 5.3.1
- java 1.8
- ubuntu linux 14.04 lts
- the cluster is composed of 1 SolrCloud collection with 100 shards backed by a 
3 node zookeeper ensemble
- there are 200 solr servers in the cluster, 1 replica per shard
- a shard replica is larger than 50% of the available disk
- ~40M docs added per day, total indexing time is 8-10 hours spread over the day
- autoCommit is set to 15s
- softCommit is not defined
        
I think I have traced the failure to the following set of events but would 
appreciate feedback:

1. new documents are being indexed
2. the leader of a shard, server A, fails for any reason (java crashes, times 
out with zookeeper, etc)
3. zookeeper promotes the other replica of the shard, server B, to the leader 
position and indexing resumes
4. server A comes back online (typically 10s of seconds later) and reports to 
zookeeper
5. zookeeper tells server A that it is no longer the leader and to sync with 
server B
6. server A checks with server B but finds that server B's index version is 
different from its own
7. server A begins replicating a new copy of the index from server B using the 
(legacy?) replication handler
8. the original index on server A was not deleted so it runs out of disk space 
mid-replication
9. server A throws an error, deletes the partially replicated index, and then 
tries to replicate again

At this point I think steps 6  => 9 will loop forever

If the actual errors from solr.log are useful let me know, not doing that now 
for brevity since this email is already pretty long.  In a nutshell and in 
order, on server A I can find the error that took it down, the post-recovery 
instruction from ZK to unregister itself as a leader, the corrupt index error 
message, and then the (start - whoops, out of disk- stop) loop of the 
replication messages.

I first want to ask if what I described is possible or did I get lost somewhere 
along the way reading the docs?  Is there any reason to think that solr should 
not do this?

If my version of events is feasible I have a few other questions:

1. What happens to the docs that were indexed on server A but never replicated 
to server B before the failure?  Assuming that the replica on server A were to 
complete the recovery process would those docs appear in the index or are they 
gone for good?

2. I am guessing that the corrupt replica on server A is not deleted because it 
is still viable, if server B had a catastrophic failure you could pick up the 
pieces from server A.  If so is this a configurable option somewhere?  I'd 
rather take my chances on server B going down before replication finishes than 
be stuck in this state and have to manually intervene.  Besides, I have 
disaster recovery backups for exactly this situation.

3. Is there anything I can do to prevent this type of failure?  It seems to me 
that if server B gets even 1 new document as a leader the shard will enter this 
state.  My only thought right now is to try to stop sending documents for 
indexing the instant a leader goes down but on the surface this solution sounds 
tough to implement perfectly (and it would have to be perfect).

If you got this far thanks for sticking with me.

Cheers,
Brian

Reply via email to