Re: ZooKeeper issues with AWS

2018-09-05 Thread Erick Erickson
Jack: Thanks for letting us know, that provides evidence that will help prioritize upgrading ZK. Erick On Wed, Sep 5, 2018 at 7:15 AM Jack Schlederer wrote: > > Ah, yes. We use ZK 3.4.13 for our ZK server nodes, but we never thought to > upgrade the ZK JAR within Solr. We included that in our

Re: ZooKeeper issues with AWS

2018-09-05 Thread Jack Schlederer
Ah, yes. We use ZK 3.4.13 for our ZK server nodes, but we never thought to upgrade the ZK JAR within Solr. We included that in our Solr image, and it's working like a charm, re-resolving DNS names when new ZKs come up with different IPs. Thanks for the help guys! --Jack On Sat, Sep 1, 2018 at

Re: ZooKeeper issues with AWS

2018-09-01 Thread Shawn Heisey
On 9/1/2018 3:42 AM, Björn Häuser wrote: as far as I can see the required fix for this is finally in 3.4.13: - https://issues.apache.org/jira/browse/ZOOKEEPER-2184 Would be great to have this in the next solr update. Issue created.

Re: ZooKeeper issues with AWS

2018-09-01 Thread Björn Häuser
Hello, > On 31. Aug 2018, at 21:53, Shawn Heisey wrote: > > > As Walter hinted, ZooKeeper 3.4.x is not capable of dynamically > adding/removing servers to/from the ensemble. To do this successfully, all > ZK servers and all ZK clients must be upgraded to 3.5.x. Solr is a ZK client > when

Re: ZooKeeper issues with AWS

2018-08-31 Thread Erick Erickson
Jack: Yeah, I understood that you were only killing one ZK at a time. I think Walter and Shawn are pointing you in the right direction. On Fri, Aug 31, 2018 at 12:53 PM Shawn Heisey wrote: > > On 8/31/2018 12:14 PM, Jack Schlederer wrote: > > Our working hypothesis is that Solr's JVM is caching

Re: ZooKeeper issues with AWS

2018-08-31 Thread Shawn Heisey
On 8/31/2018 12:14 PM, Jack Schlederer wrote: Our working hypothesis is that Solr's JVM is caching the IP addresses for the ZK hosts' DNS names when it starts up, and doesn't re-query DNS for some reason when it finds that that IP address is no longer reachable (i.e., when a ZooKeeper node

Re: ZooKeeper issues with AWS

2018-08-31 Thread Walter Underwood
I would not run Zookeeper in a container. That seems like a very bad idea. Each Zookeeper node has an identity. They are not interchangeable. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 31, 2018, at 11:14 AM, Jack Schlederer > wrote: > >

Re: ZooKeeper issues with AWS

2018-08-31 Thread Jack Schlederer
Thanks Erick. After some more testing, I'd like to correct the failure case we're seeing. It's not when 2 ZK nodes are killed that we have trouble recovering, but rather when all 3 ZK nodes that came up when the cluster was initially started get killed at some point. Even if it's one at a time,

Re: ZooKeeper issues with AWS

2018-08-31 Thread Erick Erickson
Jack: Is it possible to reproduce "manually"? By that I mean without the chaos bit by the following: - Start 3 ZK nodes - Create a multi-node, multi-shard Solr collection. - Sequentially stop and start the ZK nodes, waiting for the ZK quorum to recover between restarts. - Solr does not reconnect

Re: ZooKeeper issues with AWS

2018-08-30 Thread Jack Schlederer
We run a 3 node ZK cluster, but I'm not concerned about 2 nodes failing at the same time. Our chaos process only kills approximately one node per hour, and our cloud service provider automatically spins up another ZK node when one goes down. All 3 ZK nodes are back up within 2 minutes, talking to

Re: ZooKeeper issues with AWS

2018-08-30 Thread Walter Underwood
How many Zookeeper nodes in your ensemble? You need five nodes to handle two failures. Are your Solr instances started with a zkHost that lists all five Zookeeper nodes? What version of Zookeeper? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On

ZooKeeper issues with AWS

2018-08-30 Thread Jack Schlederer
Hi all, My team is attempting to spin up a SolrCloud cluster with an external ZooKeeper ensemble. We're trying to engineer our solution to be HA and fault-tolerant such that we can lose either 1 Solr instance or 1 ZooKeeper and not take downtime. We use chaos engineering to randomly kill