Jack:

Is it possible to reproduce "manually"? By that I mean without the
chaos bit by the following:

- Start 3 ZK nodes
- Create a multi-node, multi-shard Solr collection.
- Sequentially stop and start the ZK nodes, waiting for the ZK quorum
to recover between restarts.
- Solr does not reconnect to the restarted ZK node and will think it's
lost quorum after the second node is restarted.

bq. Kill 2, however, and we lose the quorum and we have
collections/replicas that appear as "gone" on the Solr Admin UI's
cloud graph display.

It's odd that replicas appear as "gone", and suggests that your ZK
ensemble is possibly not correctly configured, although exactly how is
a mystery. Solr pulls it's picture of the topology of the network from
ZK, establishes watches and the like. For most operations, Solr
doesn't even ask ZooKeeper for anything since it's picture of the
cluster is stored locally. ZKs job is to inform the various Solr nodes
when the topology changes, i.e. _Solr_ nodes change state. For
querying and indexing, there's no ZK involved at all. Even if _all_
ZooKeeper nodes disappear, Solr should still be able to talk to other
Solr nodes and shouldn't show them as down just because it can't talk
to ZK. Indeed, querying should be OK although indexing will fail if
quorum is lost.

But you say you see the restarted ZK nodes rejoin the ZK ensemble, so
the ZK config seems right. Is there any chance your chaos testing
"somehow" restarts the ZK nodes with any changes to the configs?
Shooting in the dark here.

For a replica to be "gone", the host node should _also_ be removed
form the "live_nodes" znode, Hmmmm. I do wonder if what you're
observing is a consequence of both killing ZK nodes and Solr nodes.
I'm not saying this is what _should_ happen, just trying to understand
what you're reporting.

My theory here is that your chaos testing kills some Solr nodes and
that fact is correctly propagated to the remaining Solr nodes. Then
your ZK nodes are killed and somehow Solr doesn't reconnect to ZK
appropriately so it's picture of the cluster has the node as
permanently down. Then you restart the Solr node and that information
isn't propagated to the Solr nodes since they didn't reconnect. If
that were the case, then I'd expect the admin UI to correctly show the
state of the cluster when hit on a Solr node that has never been
restarted.

As you can tell, I'm using something of a scattergun approach here b/c
this isn't what _should_ happen given what you describe.
Theoretically, all the ZK nodes should be able to go away and come
back and Solr reconnect...

As an aside, if you are ever in the code you'll see that for a replica
to be usable, it must have both the state set to "active" _and_ the
corresponding node has to be present in the live_nodes ephemeral
zNode.

Is there any chance you could try the manual steps above (AWS isn't
necessary here) and let us know what happens? And if we can get a
reproducible set of steps, feel free to open a JIRA.
On Thu, Aug 30, 2018 at 10:11 PM Jack Schlederer
<jack.schlede...@directsupply.com> wrote:
>
> We run a 3 node ZK cluster, but I'm not concerned about 2 nodes failing at
> the same time. Our chaos process only kills approximately one node per
> hour, and our cloud service provider automatically spins up another ZK node
> when one goes down. All 3 ZK nodes are back up within 2 minutes, talking to
> each other and syncing data. It's just that Solr doesn't seem to recognize
> it. We'd have to restart Solr to get it to recognize the new Zookeepers,
> which we can't do without taking downtime or losing data that's stored on
> non-persistent disk within the container.
>
> The ZK_HOST environment variable lists all 3 ZK nodes.
>
> We're running ZooKeeper version 3.4.13.
>
> Thanks,
> Jack
>
> On Thu, Aug 30, 2018 at 4:12 PM Walter Underwood <wun...@wunderwood.org>
> wrote:
>
> > How many Zookeeper nodes in your ensemble? You need five nodes to
> > handle two failures.
> >
> > Are your Solr instances started with a zkHost that lists all five
> > Zookeeper nodes?
> >
> > What version of Zookeeper?
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Aug 30, 2018, at 1:45 PM, Jack Schlederer <
> > jack.schlede...@directsupply.com> wrote:
> > >
> > > Hi all,
> > >
> > > My team is attempting to spin up a SolrCloud cluster with an external
> > > ZooKeeper ensemble. We're trying to engineer our solution to be HA and
> > > fault-tolerant such that we can lose either 1 Solr instance or 1
> > ZooKeeper
> > > and not take downtime. We use chaos engineering to randomly kill
> > instances
> > > to test our fault-tolerance. Killing Solr instances seems to be solved,
> > as
> > > we use a high enough replication factor and Solr's built in autoscaling
> > to
> > > ensure that new Solr nodes added to the cluster get the replicas that
> > were
> > > lost from the killed node. However, ZooKeeper seems to be a different
> > > story. We can kill 1 ZooKeeper instance and still maintain, and
> > everything
> > > is good. It comes back and starts participating in leader elections, etc.
> > > Kill 2, however, and we lose the quorum and we have collections/replicas
> > > that appear as "gone" on the Solr Admin UI's cloud graph display, and we
> > > get Java errors in the log reporting that collections can't be read from
> > > ZK. This means we aren't servicing search requests. We found an open JIRA
> > > that reports this same issue, but its only affected version is 5.3.1. We
> > > are experiencing this problem in 7.3.1. Has there been any progress or
> > > potential workarounds on this issue since?
> > >
> > > Thanks,
> > > Jack
> > >
> > > Reference:
> > > https://issues.apache.org/jira/browse/SOLR-8868
> >
> >

Reply via email to