Hi Vishal, There are periodic pings sent from the leader to the followers.
Take a look at Leader.java: syncedSet.add(self.getId()); synchronized (learners) { for (LearnerHandler f : learners) { if (f.synced()) { syncedCount++; syncedSet.add(f.getSid()); } f.ping(); } } This code sends periodic pings to the followers to make sure they are running fine. We should keep track of these pings and see if we havent seen a ping packet from the leader for a long time and give up following the leader in case we havent heard from him for a long time. This is definitely worth fixing since we pride ourselves in being a highly available and reliable service. Please feel free to open a jira and work on it. 3.4 would be a good target for this. Thanks mahadev On 11/10/10 12:26 PM, "Vishal Kher" <vishalm...@gmail.com> wrote: > Hi, > > In Follower.followLeader() after syncing with the leader, the follower does: > while (self.isRunning()) { > readPacket(qp); > processPacket(qp); > } > > It looks like it relies on socket timeout expiry to figure out if the > connection with the leader has gone down. So a follower *with no cilents* > may never notice a faulty leader if a Leader has a software hang, but the > TCP connections with the peers are still valid. Since it has not cilents, it > won't hearbeat with the Leader. If majority of followers are not connected > to any clients, then even if other followers attempt to elect a new leader > after detecting that the leader is unresponsive. > > Please correct me if I am wrong. If I am not mistaken, should we add code at > the follower to monitor the heartbeat messages that it receives from the > leader and take action if it misses heartbeats for time > (syncLimit * > tickTime)? This certainly is a hypothetical case, however, I think it is > worth a fix. > > Thanks. > -Vishal >