Re: ZooKeeper issues with AWS

Walter Underwood Thu, 30 Aug 2018 14:13:13 -0700

How many Zookeeper nodes in your ensemble? You need five nodes to
handle two failures.


Are your Solr instances started with a zkHost that lists all five Zookeeper 
nodes?

What version of Zookeeper?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 30, 2018, at 1:45 PM, Jack Schlederer 
> <jack.schlede...@directsupply.com> wrote:
> 
> Hi all,
> 
> My team is attempting to spin up a SolrCloud cluster with an external
> ZooKeeper ensemble. We're trying to engineer our solution to be HA and
> fault-tolerant such that we can lose either 1 Solr instance or 1 ZooKeeper
> and not take downtime. We use chaos engineering to randomly kill instances
> to test our fault-tolerance. Killing Solr instances seems to be solved, as
> we use a high enough replication factor and Solr's built in autoscaling to
> ensure that new Solr nodes added to the cluster get the replicas that were
> lost from the killed node. However, ZooKeeper seems to be a different
> story. We can kill 1 ZooKeeper instance and still maintain, and everything
> is good. It comes back and starts participating in leader elections, etc.
> Kill 2, however, and we lose the quorum and we have collections/replicas
> that appear as "gone" on the Solr Admin UI's cloud graph display, and we
> get Java errors in the log reporting that collections can't be read from
> ZK. This means we aren't servicing search requests. We found an open JIRA
> that reports this same issue, but its only affected version is 5.3.1. We
> are experiencing this problem in 7.3.1. Has there been any progress or
> potential workarounds on this issue since?
> 
> Thanks,
> Jack
> 
> Reference:
> https://issues.apache.org/jira/browse/SOLR-8868

Re: ZooKeeper issues with AWS

Reply via email to