Hi Vishal,
 There are periodic pings sent from the leader to the followers.

Take a look at Leader.java:

syncedSet.add(self.getId());
                synchronized (learners) {
                    for (LearnerHandler f : learners) {
                        if (f.synced()) {
                            syncedCount++;
                            syncedSet.add(f.getSid());
                        }
                        f.ping();
                    }
                }


This code sends periodic pings to the followers to make sure they are
running fine. We should keep track of these pings and see if we havent seen
a ping packet from the leader for a long time and give up following the
leader in case we havent heard from him for a long time. This is definitely
worth fixing since we pride ourselves in being a highly available and
reliable service.

Please feel free to open a jira and work on it.
3.4 would be a good target for this.

Thanks
mahadev

On 11/10/10 12:26 PM, "Vishal Kher" <vishalm...@gmail.com> wrote:

> Hi,
> 
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> 
> It looks like it relies on socket timeout expiry to figure out if the
> connection with the leader has gone down.  So a follower *with no cilents*
> may never notice a faulty leader if a Leader has a software hang, but the
> TCP connections with the peers are still valid. Since it has not cilents, it
> won't hearbeat with the Leader. If majority of followers are not connected
> to any clients, then even if other followers attempt to elect a new leader
> after detecting that the leader is unresponsive.
> 
> Please correct me if I am wrong. If I am not mistaken, should we add code at
> the follower to monitor the heartbeat messages that it receives from the
> leader and take action if it misses heartbeats for time > (syncLimit *
> tickTime)? This certainly is a hypothetical case, however, I think it is
> worth a fix.
> 
> Thanks.
> -Vishal
> 

Reply via email to