Hi all,

It looks like due to a security scan sending "bad" traffic to the leader
election port, we have clusters in which
the leader election Listener thread is dead (unchecked exception was thrown
and thread died - seen in the log).
(This seems to be fixed by fixed in
https://issues.apache.org/jira/browse/ZOOKEEPER-2186)

In this state, when a healthy server comes up and tries to connecnt to the
quorum, it gets stuck on
the leader election. It establishes TCP connections to the other servers
but any traffic it sends seems
to get stuck in the receiver's TCP Recv queue (seen with netstat), and is
not read/processed by zk.

Not a good place to be :)

This is with 3.4.6

Is there a way to get such clusters back to a healthy state without loss of
quorum / client impact?
Some way of re-starting the listener thread? or restarting the servers in a
certain order?
e.g. If I restart a minority, say the ones with lower server id's - is
there a way to get the majority servers
to re-initiate leader election connections with them so as to connect them
to the quorum? (and to do this without
the majority losing quorum).

Thanks,
Guy

Reply via email to