Whoops, in the description of my setup that should say 2 replicas per shard.  
Every server has a replica.


> On Oct 27, 2015, at 20:16, Brian Scholl <bsch...@legendary.com> wrote:
> 
> Hello,
> 
> I am experiencing a failure mode where a replica is unable to recover and it 
> will try to do so forever.  In writing this email I want to make sure that I 
> haven't missed anything obvious or missed a configurable option that could 
> help.  If something about this looks funny, I would really like to hear from 
> you.
> 
> Relevant details:
> - solr 5.3.1
> - java 1.8
> - ubuntu linux 14.04 lts
> - the cluster is composed of 1 SolrCloud collection with 100 shards backed by 
> a 3 node zookeeper ensemble
> - there are 200 solr servers in the cluster, 1 replica per shard
> - a shard replica is larger than 50% of the available disk
> - ~40M docs added per day, total indexing time is 8-10 hours spread over the 
> day
> - autoCommit is set to 15s
> - softCommit is not defined
>       
> I think I have traced the failure to the following set of events but would 
> appreciate feedback:
> 
> 1. new documents are being indexed
> 2. the leader of a shard, server A, fails for any reason (java crashes, times 
> out with zookeeper, etc)
> 3. zookeeper promotes the other replica of the shard, server B, to the leader 
> position and indexing resumes
> 4. server A comes back online (typically 10s of seconds later) and reports to 
> zookeeper
> 5. zookeeper tells server A that it is no longer the leader and to sync with 
> server B
> 6. server A checks with server B but finds that server B's index version is 
> different from its own
> 7. server A begins replicating a new copy of the index from server B using 
> the (legacy?) replication handler
> 8. the original index on server A was not deleted so it runs out of disk 
> space mid-replication
> 9. server A throws an error, deletes the partially replicated index, and then 
> tries to replicate again
> 
> At this point I think steps 6  => 9 will loop forever
> 
> If the actual errors from solr.log are useful let me know, not doing that now 
> for brevity since this email is already pretty long.  In a nutshell and in 
> order, on server A I can find the error that took it down, the post-recovery 
> instruction from ZK to unregister itself as a leader, the corrupt index error 
> message, and then the (start - whoops, out of disk- stop) loop of the 
> replication messages.
> 
> I first want to ask if what I described is possible or did I get lost 
> somewhere along the way reading the docs?  Is there any reason to think that 
> solr should not do this?
> 
> If my version of events is feasible I have a few other questions:
> 
> 1. What happens to the docs that were indexed on server A but never 
> replicated to server B before the failure?  Assuming that the replica on 
> server A were to complete the recovery process would those docs appear in the 
> index or are they gone for good?
> 
> 2. I am guessing that the corrupt replica on server A is not deleted because 
> it is still viable, if server B had a catastrophic failure you could pick up 
> the pieces from server A.  If so is this a configurable option somewhere?  
> I'd rather take my chances on server B going down before replication finishes 
> than be stuck in this state and have to manually intervene.  Besides, I have 
> disaster recovery backups for exactly this situation.
> 
> 3. Is there anything I can do to prevent this type of failure?  It seems to 
> me that if server B gets even 1 new document as a leader the shard will enter 
> this state.  My only thought right now is to try to stop sending documents 
> for indexing the instant a leader goes down but on the surface this solution 
> sounds tough to implement perfectly (and it would have to be perfect).
> 
> If you got this far thanks for sticking with me.
> 
> Cheers,
> Brian
> 

Reply via email to