Thanks Erick. After some more testing, I'd like to correct the failure case
we're seeing. It's not when 2 ZK nodes are killed that we have trouble
recovering, but rather when all 3 ZK nodes that came up when the cluster
was initially started get killed at some point. Even if it's one at a time,
and we wait for a new one to spin up and join the cluster before killing
the next one, we get into a bad state when none of the 3 nodes that were in
the cluster initially are there anymore, even though they've been replaced
by our cloud provider spinning up new ZK's. We assign DNS names to the
ZooKeepers as they spin up, with a 10 second TTL, and those are what get
set as the ZK_HOST environment variable on the Solr hosts (i.e., ZK_HOST=
zk1.foo.com:2182,zk2.foo.com:2182,zk3.foo.com:2182). Our working hypothesis
is that Solr's JVM is caching the IP addresses for the ZK hosts' DNS names
when it starts up, and doesn't re-query DNS for some reason when it finds
that that IP address is no longer reachable (i.e., when a ZooKeeper node
dies and spins up at a different IP). Our current trajectory has us finding
a way to assign known static IPs to the ZK nodes upon startup, and
assigning those IPs to the ZK_HOST env var, so we can take DNS lookups out
of the picture entirely.

We can reproduce this in our cloud environment, as each ZK node has its own
IP and DNS name, but it's difficult to reproduce locally due to all the
ZooKeeper containers having the same IP when running locally (127.0.0.1).

Please let us know if you have insight into this issue.

Thanks,
Jack

On Fri, Aug 31, 2018 at 10:40 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> Jack:
>
> Is it possible to reproduce "manually"? By that I mean without the
> chaos bit by the following:
>
> - Start 3 ZK nodes
> - Create a multi-node, multi-shard Solr collection.
> - Sequentially stop and start the ZK nodes, waiting for the ZK quorum
> to recover between restarts.
> - Solr does not reconnect to the restarted ZK node and will think it's
> lost quorum after the second node is restarted.
>
> bq. Kill 2, however, and we lose the quorum and we have
> collections/replicas that appear as "gone" on the Solr Admin UI's
> cloud graph display.
>
> It's odd that replicas appear as "gone", and suggests that your ZK
> ensemble is possibly not correctly configured, although exactly how is
> a mystery. Solr pulls it's picture of the topology of the network from
> ZK, establishes watches and the like. For most operations, Solr
> doesn't even ask ZooKeeper for anything since it's picture of the
> cluster is stored locally. ZKs job is to inform the various Solr nodes
> when the topology changes, i.e. _Solr_ nodes change state. For
> querying and indexing, there's no ZK involved at all. Even if _all_
> ZooKeeper nodes disappear, Solr should still be able to talk to other
> Solr nodes and shouldn't show them as down just because it can't talk
> to ZK. Indeed, querying should be OK although indexing will fail if
> quorum is lost.
>
> But you say you see the restarted ZK nodes rejoin the ZK ensemble, so
> the ZK config seems right. Is there any chance your chaos testing
> "somehow" restarts the ZK nodes with any changes to the configs?
> Shooting in the dark here.
>
> For a replica to be "gone", the host node should _also_ be removed
> form the "live_nodes" znode, Hmmmm. I do wonder if what you're
> observing is a consequence of both killing ZK nodes and Solr nodes.
> I'm not saying this is what _should_ happen, just trying to understand
> what you're reporting.
>
> My theory here is that your chaos testing kills some Solr nodes and
> that fact is correctly propagated to the remaining Solr nodes. Then
> your ZK nodes are killed and somehow Solr doesn't reconnect to ZK
> appropriately so it's picture of the cluster has the node as
> permanently down. Then you restart the Solr node and that information
> isn't propagated to the Solr nodes since they didn't reconnect. If
> that were the case, then I'd expect the admin UI to correctly show the
> state of the cluster when hit on a Solr node that has never been
> restarted.
>
> As you can tell, I'm using something of a scattergun approach here b/c
> this isn't what _should_ happen given what you describe.
> Theoretically, all the ZK nodes should be able to go away and come
> back and Solr reconnect...
>
> As an aside, if you are ever in the code you'll see that for a replica
> to be usable, it must have both the state set to "active" _and_ the
> corresponding node has to be present in the live_nodes ephemeral
> zNode.
>
> Is there any chance you could try the manual steps above (AWS isn't
> necessary here) and let us know what happens? And if we can get a
> reproducible set of steps, feel free to open a JIRA.
> On Thu, Aug 30, 2018 at 10:11 PM Jack Schlederer
> <jack.schlede...@directsupply.com> wrote:
> >
> > We run a 3 node ZK cluster, but I'm not concerned about 2 nodes failing
> at
> > the same time. Our chaos process only kills approximately one node per
> > hour, and our cloud service provider automatically spins up another ZK
> node
> > when one goes down. All 3 ZK nodes are back up within 2 minutes, talking
> to
> > each other and syncing data. It's just that Solr doesn't seem to
> recognize
> > it. We'd have to restart Solr to get it to recognize the new Zookeepers,
> > which we can't do without taking downtime or losing data that's stored on
> > non-persistent disk within the container.
> >
> > The ZK_HOST environment variable lists all 3 ZK nodes.
> >
> > We're running ZooKeeper version 3.4.13.
> >
> > Thanks,
> > Jack
> >
> > On Thu, Aug 30, 2018 at 4:12 PM Walter Underwood <wun...@wunderwood.org>
> > wrote:
> >
> > > How many Zookeeper nodes in your ensemble? You need five nodes to
> > > handle two failures.
> > >
> > > Are your Solr instances started with a zkHost that lists all five
> > > Zookeeper nodes?
> > >
> > > What version of Zookeeper?
> > >
> > > wunder
> > > Walter Underwood
> > > wun...@wunderwood.org
> > > http://observer.wunderwood.org/  (my blog)
> > >
> > > > On Aug 30, 2018, at 1:45 PM, Jack Schlederer <
> > > jack.schlede...@directsupply.com> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > My team is attempting to spin up a SolrCloud cluster with an external
> > > > ZooKeeper ensemble. We're trying to engineer our solution to be HA
> and
> > > > fault-tolerant such that we can lose either 1 Solr instance or 1
> > > ZooKeeper
> > > > and not take downtime. We use chaos engineering to randomly kill
> > > instances
> > > > to test our fault-tolerance. Killing Solr instances seems to be
> solved,
> > > as
> > > > we use a high enough replication factor and Solr's built in
> autoscaling
> > > to
> > > > ensure that new Solr nodes added to the cluster get the replicas that
> > > were
> > > > lost from the killed node. However, ZooKeeper seems to be a different
> > > > story. We can kill 1 ZooKeeper instance and still maintain, and
> > > everything
> > > > is good. It comes back and starts participating in leader elections,
> etc.
> > > > Kill 2, however, and we lose the quorum and we have
> collections/replicas
> > > > that appear as "gone" on the Solr Admin UI's cloud graph display,
> and we
> > > > get Java errors in the log reporting that collections can't be read
> from
> > > > ZK. This means we aren't servicing search requests. We found an open
> JIRA
> > > > that reports this same issue, but its only affected version is
> 5.3.1. We
> > > > are experiencing this problem in 7.3.1. Has there been any progress
> or
> > > > potential workarounds on this issue since?
> > > >
> > > > Thanks,
> > > > Jack
> > > >
> > > > Reference:
> > > > https://issues.apache.org/jira/browse/SOLR-8868
> > >
> > >
>

Reply via email to