Hi all, It looks like due to a security scan sending "bad" traffic to the leader election port, we have clusters in which the leader election Listener thread is dead (unchecked exception was thrown and thread died - seen in the log). (This seems to be fixed by fixed in https://issues.apache.org/jira/browse/ZOOKEEPER-2186)
In this state, when a healthy server comes up and tries to connecnt to the quorum, it gets stuck on the leader election. It establishes TCP connections to the other servers but any traffic it sends seems to get stuck in the receiver's TCP Recv queue (seen with netstat), and is not read/processed by zk. Not a good place to be :) This is with 3.4.6 Is there a way to get such clusters back to a healthy state without loss of quorum / client impact? Some way of re-starting the listener thread? or restarting the servers in a certain order? e.g. If I restart a minority, say the ones with lower server id's - is there a way to get the majority servers to re-initiate leader election connections with them so as to connect them to the quorum? (and to do this without the majority losing quorum). Thanks, Guy
