have you been able to make this happen? the behavior you are suggesting is exactly what should be happening. When we sync with the leader we set the socket timeout: sock.setSoTimeout(self.tickTime * self.syncLimit);

if the leader hangs, we should get a timeout and disconnect from the leader.

ben


On 11/10/2010 11:57 AM, Vishal Kher wrote:
Yes, thats what I was planning to do. At the follower, start FLE if the
follower does not receive a ping for>  (syncLimit * tickTime).


On Wed, Nov 10, 2010 at 2:48 PM, Mahadev Konar<maha...@yahoo-inc.com>wrote:

Hi Vishal,
  There are periodic pings sent from the leader to the followers.

Take a look at Leader.java:

syncedSet.add(self.getId());
                synchronized (learners) {
                    for (LearnerHandler f : learners) {
                        if (f.synced()) {
                            syncedCount++;
                            syncedSet.add(f.getSid());
                        }
                        f.ping();
                    }
                }


This code sends periodic pings to the followers to make sure they are
running fine. We should keep track of these pings and see if we havent seen
a ping packet from the leader for a long time and give up following the
leader in case we havent heard from him for a long time. This is definitely
worth fixing since we pride ourselves in being a highly available and
reliable service.

Please feel free to open a jira and work on it.
3.4 would be a good target for this.

Thanks
mahadev

On 11/10/10 12:26 PM, "Vishal Kher"<vishalm...@gmail.com>  wrote:

Hi,

In Follower.followLeader() after syncing with the leader, the follower
does:
                 while (self.isRunning()) {
                     readPacket(qp);
                     processPacket(qp);
                 }

It looks like it relies on socket timeout expiry to figure out if the
connection with the leader has gone down.  So a follower *with no
cilents*
may never notice a faulty leader if a Leader has a software hang, but the
TCP connections with the peers are still valid. Since it has not cilents,
it
won't hearbeat with the Leader. If majority of followers are not
connected
to any clients, then even if other followers attempt to elect a new
leader
after detecting that the leader is unresponsive.

Please correct me if I am wrong. If I am not mistaken, should we add code
at
the follower to monitor the heartbeat messages that it receives from the
leader and take action if it misses heartbeats for time>  (syncLimit *
tickTime)? This certainly is a hypothetical case, however, I think it is
worth a fix.

Thanks.
-Vishal



Reply via email to