How many Zookeeper nodes in your ensemble? You need five nodes to handle two failures.
Are your Solr instances started with a zkHost that lists all five Zookeeper nodes? What version of Zookeeper? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 30, 2018, at 1:45 PM, Jack Schlederer > <jack.schlede...@directsupply.com> wrote: > > Hi all, > > My team is attempting to spin up a SolrCloud cluster with an external > ZooKeeper ensemble. We're trying to engineer our solution to be HA and > fault-tolerant such that we can lose either 1 Solr instance or 1 ZooKeeper > and not take downtime. We use chaos engineering to randomly kill instances > to test our fault-tolerance. Killing Solr instances seems to be solved, as > we use a high enough replication factor and Solr's built in autoscaling to > ensure that new Solr nodes added to the cluster get the replicas that were > lost from the killed node. However, ZooKeeper seems to be a different > story. We can kill 1 ZooKeeper instance and still maintain, and everything > is good. It comes back and starts participating in leader elections, etc. > Kill 2, however, and we lose the quorum and we have collections/replicas > that appear as "gone" on the Solr Admin UI's cloud graph display, and we > get Java errors in the log reporting that collections can't be read from > ZK. This means we aren't servicing search requests. We found an open JIRA > that reports this same issue, but its only affected version is 5.3.1. We > are experiencing this problem in 7.3.1. Has there been any progress or > potential workarounds on this issue since? > > Thanks, > Jack > > Reference: > https://issues.apache.org/jira/browse/SOLR-8868