Hi Vishal,
There are periodic pings sent from the leader to the followers.
Take a look at Leader.java:
syncedSet.add(self.getId());
synchronized (learners) {
for (LearnerHandler f : learners) {
if (f.synced()) {
syncedCount++;
syncedSet.add(f.getSid());
}
f.ping();
}
}
This code sends periodic pings to the followers to make sure they are
running fine. We should keep track of these pings and see if we havent seen
a ping packet from the leader for a long time and give up following the
leader in case we havent heard from him for a long time. This is definitely
worth fixing since we pride ourselves in being a highly available and
reliable service.
Please feel free to open a jira and work on it.
3.4 would be a good target for this.
Thanks
mahadev
On 11/10/10 12:26 PM, "Vishal Kher" <[email protected]> wrote:
> Hi,
>
> In Follower.followLeader() after syncing with the leader, the follower does:
> while (self.isRunning()) {
> readPacket(qp);
> processPacket(qp);
> }
>
> It looks like it relies on socket timeout expiry to figure out if the
> connection with the leader has gone down. So a follower *with no cilents*
> may never notice a faulty leader if a Leader has a software hang, but the
> TCP connections with the peers are still valid. Since it has not cilents, it
> won't hearbeat with the Leader. If majority of followers are not connected
> to any clients, then even if other followers attempt to elect a new leader
> after detecting that the leader is unresponsive.
>
> Please correct me if I am wrong. If I am not mistaken, should we add code at
> the follower to monitor the heartbeat messages that it receives from the
> leader and take action if it misses heartbeats for time > (syncLimit *
> tickTime)? This certainly is a hypothetical case, however, I think it is
> worth a fix.
>
> Thanks.
> -Vishal
>