[ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930774#action_12930774 ]
Flavio Junqueira commented on ZOOKEEPER-928: -------------------------------------------- I've just seen the messages on zookeeper-dev, and I'm not sure this is right: # readPacket is implemented in Learner.java, and the socket read is performed in this line: leaderIs.readRecord(pp, "packet"); # leaderIs is an InputArchive instance instantiated in Learner:connectToLeader; # The socket used to instantiate leaderIs has its SO_TIMEOUT value set right before in connectToLeader: sock.setSoTimeout(self.tickTime * self.initLimit). Consequently, the operation should not be delayed indefinitely and should return after self.tickTime * self.initLimit. This discussion on SO_TIMEOUT sounds familiar, huh? ;-) > Follower should stop following and start FLE if it does not receive pings > from the leader > ----------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-928 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928 > Project: Zookeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.3.2 > Reporter: Vishal K > Priority: Critical > Fix For: 3.3.3, 3.4.0 > > > In Follower.followLeader() after syncing with the leader, the follower does: > while (self.isRunning()) { > readPacket(qp); > processPacket(qp); > } > It looks like it relies on socket timeout expiry to figure out if the > connection with the leader has gone down. So a follower *with no cilents* > may never notice a faulty leader if a Leader has a software hang, but the TCP > connections with the peers are still valid. Since it has no cilents, it won't > hearbeat with the Leader. If majority of followers are not connected to any > clients, then FLE will fail even if other followers attempt to elect a new > leader. > We should keep track of pings received from the leader and see if we havent > seen > a ping packet from the leader for (syncLimit * tickTime) time and give up > following the > leader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.