Yes, thats what I was planning to do. At the follower, start FLE if the follower does not receive a ping for > (syncLimit * tickTime).
On Wed, Nov 10, 2010 at 2:48 PM, Mahadev Konar <maha...@yahoo-inc.com>wrote: > Hi Vishal, > There are periodic pings sent from the leader to the followers. > > Take a look at Leader.java: > > syncedSet.add(self.getId()); > synchronized (learners) { > for (LearnerHandler f : learners) { > if (f.synced()) { > syncedCount++; > syncedSet.add(f.getSid()); > } > f.ping(); > } > } > > > This code sends periodic pings to the followers to make sure they are > running fine. We should keep track of these pings and see if we havent seen > a ping packet from the leader for a long time and give up following the > leader in case we havent heard from him for a long time. This is definitely > worth fixing since we pride ourselves in being a highly available and > reliable service. > > Please feel free to open a jira and work on it. > 3.4 would be a good target for this. > > Thanks > mahadev > > On 11/10/10 12:26 PM, "Vishal Kher" <vishalm...@gmail.com> wrote: > > > Hi, > > > > In Follower.followLeader() after syncing with the leader, the follower > does: > > while (self.isRunning()) { > > readPacket(qp); > > processPacket(qp); > > } > > > > It looks like it relies on socket timeout expiry to figure out if the > > connection with the leader has gone down. So a follower *with no > cilents* > > may never notice a faulty leader if a Leader has a software hang, but the > > TCP connections with the peers are still valid. Since it has not cilents, > it > > won't hearbeat with the Leader. If majority of followers are not > connected > > to any clients, then even if other followers attempt to elect a new > leader > > after detecting that the leader is unresponsive. > > > > Please correct me if I am wrong. If I am not mistaken, should we add code > at > > the follower to monitor the heartbeat messages that it receives from the > > leader and take action if it misses heartbeats for time > (syncLimit * > > tickTime)? This certainly is a hypothetical case, however, I think it is > > worth a fix. > > > > Thanks. > > -Vishal > > > >