Whoops, in the description of my setup that should say 2 replicas per shard. Every server has a replica.
> On Oct 27, 2015, at 20:16, Brian Scholl <bsch...@legendary.com> wrote: > > Hello, > > I am experiencing a failure mode where a replica is unable to recover and it > will try to do so forever. In writing this email I want to make sure that I > haven't missed anything obvious or missed a configurable option that could > help. If something about this looks funny, I would really like to hear from > you. > > Relevant details: > - solr 5.3.1 > - java 1.8 > - ubuntu linux 14.04 lts > - the cluster is composed of 1 SolrCloud collection with 100 shards backed by > a 3 node zookeeper ensemble > - there are 200 solr servers in the cluster, 1 replica per shard > - a shard replica is larger than 50% of the available disk > - ~40M docs added per day, total indexing time is 8-10 hours spread over the > day > - autoCommit is set to 15s > - softCommit is not defined > > I think I have traced the failure to the following set of events but would > appreciate feedback: > > 1. new documents are being indexed > 2. the leader of a shard, server A, fails for any reason (java crashes, times > out with zookeeper, etc) > 3. zookeeper promotes the other replica of the shard, server B, to the leader > position and indexing resumes > 4. server A comes back online (typically 10s of seconds later) and reports to > zookeeper > 5. zookeeper tells server A that it is no longer the leader and to sync with > server B > 6. server A checks with server B but finds that server B's index version is > different from its own > 7. server A begins replicating a new copy of the index from server B using > the (legacy?) replication handler > 8. the original index on server A was not deleted so it runs out of disk > space mid-replication > 9. server A throws an error, deletes the partially replicated index, and then > tries to replicate again > > At this point I think steps 6 => 9 will loop forever > > If the actual errors from solr.log are useful let me know, not doing that now > for brevity since this email is already pretty long. In a nutshell and in > order, on server A I can find the error that took it down, the post-recovery > instruction from ZK to unregister itself as a leader, the corrupt index error > message, and then the (start - whoops, out of disk- stop) loop of the > replication messages. > > I first want to ask if what I described is possible or did I get lost > somewhere along the way reading the docs? Is there any reason to think that > solr should not do this? > > If my version of events is feasible I have a few other questions: > > 1. What happens to the docs that were indexed on server A but never > replicated to server B before the failure? Assuming that the replica on > server A were to complete the recovery process would those docs appear in the > index or are they gone for good? > > 2. I am guessing that the corrupt replica on server A is not deleted because > it is still viable, if server B had a catastrophic failure you could pick up > the pieces from server A. If so is this a configurable option somewhere? > I'd rather take my chances on server B going down before replication finishes > than be stuck in this state and have to manually intervene. Besides, I have > disaster recovery backups for exactly this situation. > > 3. Is there anything I can do to prevent this type of failure? It seems to > me that if server B gets even 1 new document as a leader the shard will enter > this state. My only thought right now is to try to stop sending documents > for indexing the instant a leader goes down but on the surface this solution > sounds tough to implement perfectly (and it would have to be perfect). > > If you got this far thanks for sticking with me. > > Cheers, > Brian >