What happens to a follower if leader hangs?

2010-11-10 Thread Vishal Kher
Hi,

In Follower.followLeader() after syncing with the leader, the follower does:
while (self.isRunning()) {
readPacket(qp);
processPacket(qp);
}

It looks like it relies on socket timeout expiry to figure out if the
connection with the leader has gone down.  So a follower *with no cilents*
may never notice a faulty leader if a Leader has a software hang, but the
TCP connections with the peers are still valid. Since it has not cilents, it
won't hearbeat with the Leader. If majority of followers are not connected
to any clients, then even if other followers attempt to elect a new leader
after detecting that the leader is unresponsive.

Please correct me if I am wrong. If I am not mistaken, should we add code at
the follower to monitor the heartbeat messages that it receives from the
leader and take action if it misses heartbeats for time  (syncLimit *
tickTime)? This certainly is a hypothetical case, however, I think it is
worth a fix.

Thanks.
-Vishal


Re: What happens to a follower if leader hangs?

2010-11-10 Thread Mahadev Konar
Hi Vishal,
 There are periodic pings sent from the leader to the followers.

Take a look at Leader.java:

syncedSet.add(self.getId());
synchronized (learners) {
for (LearnerHandler f : learners) {
if (f.synced()) {
syncedCount++;
syncedSet.add(f.getSid());
}
f.ping();
}
}


This code sends periodic pings to the followers to make sure they are
running fine. We should keep track of these pings and see if we havent seen
a ping packet from the leader for a long time and give up following the
leader in case we havent heard from him for a long time. This is definitely
worth fixing since we pride ourselves in being a highly available and
reliable service.

Please feel free to open a jira and work on it.
3.4 would be a good target for this.

Thanks
mahadev

On 11/10/10 12:26 PM, Vishal Kher vishalm...@gmail.com wrote:

 Hi,
 
 In Follower.followLeader() after syncing with the leader, the follower does:
 while (self.isRunning()) {
 readPacket(qp);
 processPacket(qp);
 }
 
 It looks like it relies on socket timeout expiry to figure out if the
 connection with the leader has gone down.  So a follower *with no cilents*
 may never notice a faulty leader if a Leader has a software hang, but the
 TCP connections with the peers are still valid. Since it has not cilents, it
 won't hearbeat with the Leader. If majority of followers are not connected
 to any clients, then even if other followers attempt to elect a new leader
 after detecting that the leader is unresponsive.
 
 Please correct me if I am wrong. If I am not mistaken, should we add code at
 the follower to monitor the heartbeat messages that it receives from the
 leader and take action if it misses heartbeats for time  (syncLimit *
 tickTime)? This certainly is a hypothetical case, however, I think it is
 worth a fix.
 
 Thanks.
 -Vishal
 



Re: What happens to a follower if leader hangs?

2010-11-10 Thread Vishal Kher
Yes, thats what I was planning to do. At the follower, start FLE if the
follower does not receive a ping for  (syncLimit * tickTime).


On Wed, Nov 10, 2010 at 2:48 PM, Mahadev Konar maha...@yahoo-inc.comwrote:

 Hi Vishal,
  There are periodic pings sent from the leader to the followers.

 Take a look at Leader.java:

 syncedSet.add(self.getId());
synchronized (learners) {
for (LearnerHandler f : learners) {
if (f.synced()) {
syncedCount++;
syncedSet.add(f.getSid());
}
f.ping();
}
}


 This code sends periodic pings to the followers to make sure they are
 running fine. We should keep track of these pings and see if we havent seen
 a ping packet from the leader for a long time and give up following the
 leader in case we havent heard from him for a long time. This is definitely
 worth fixing since we pride ourselves in being a highly available and
 reliable service.

 Please feel free to open a jira and work on it.
 3.4 would be a good target for this.

 Thanks
 mahadev

 On 11/10/10 12:26 PM, Vishal Kher vishalm...@gmail.com wrote:

  Hi,
 
  In Follower.followLeader() after syncing with the leader, the follower
 does:
  while (self.isRunning()) {
  readPacket(qp);
  processPacket(qp);
  }
 
  It looks like it relies on socket timeout expiry to figure out if the
  connection with the leader has gone down.  So a follower *with no
 cilents*
  may never notice a faulty leader if a Leader has a software hang, but the
  TCP connections with the peers are still valid. Since it has not cilents,
 it
  won't hearbeat with the Leader. If majority of followers are not
 connected
  to any clients, then even if other followers attempt to elect a new
 leader
  after detecting that the leader is unresponsive.
 
  Please correct me if I am wrong. If I am not mistaken, should we add code
 at
  the follower to monitor the heartbeat messages that it receives from the
  leader and take action if it misses heartbeats for time  (syncLimit *
  tickTime)? This certainly is a hypothetical case, however, I think it is
  worth a fix.
 
  Thanks.
  -Vishal
 




Re: What happens to a follower if leader hangs?

2010-11-10 Thread Patrick Hunt
I'd go 3.3.3 and 3.4.0. Any of this (incl the other issues
Vishal/others have been finding recently) point to some particular set
of testing we might add to find problems like this? What are we
missing?

Once 3.3.2 is out and immediate tlp issues are addressed I'm going to
start pushing for 3.4 regardless of whether everything is in yet or
not.

Patrick

On Wed, Nov 10, 2010 at 11:48 AM, Mahadev Konar maha...@yahoo-inc.com wrote:
 Hi Vishal,
  There are periodic pings sent from the leader to the followers.

 Take a look at Leader.java:

 syncedSet.add(self.getId());
                synchronized (learners) {
                    for (LearnerHandler f : learners) {
                        if (f.synced()) {
                            syncedCount++;
                            syncedSet.add(f.getSid());
                        }
                        f.ping();
                    }
                }


 This code sends periodic pings to the followers to make sure they are
 running fine. We should keep track of these pings and see if we havent seen
 a ping packet from the leader for a long time and give up following the
 leader in case we havent heard from him for a long time. This is definitely
 worth fixing since we pride ourselves in being a highly available and
 reliable service.

 Please feel free to open a jira and work on it.
 3.4 would be a good target for this.

 Thanks
 mahadev

 On 11/10/10 12:26 PM, Vishal Kher vishalm...@gmail.com wrote:

 Hi,

 In Follower.followLeader() after syncing with the leader, the follower does:
                 while (self.isRunning()) {
                     readPacket(qp);
                     processPacket(qp);
                 }

 It looks like it relies on socket timeout expiry to figure out if the
 connection with the leader has gone down.  So a follower *with no cilents*
 may never notice a faulty leader if a Leader has a software hang, but the
 TCP connections with the peers are still valid. Since it has not cilents, it
 won't hearbeat with the Leader. If majority of followers are not connected
 to any clients, then even if other followers attempt to elect a new leader
 after detecting that the leader is unresponsive.

 Please correct me if I am wrong. If I am not mistaken, should we add code at
 the follower to monitor the heartbeat messages that it receives from the
 leader and take action if it misses heartbeats for time  (syncLimit *
 tickTime)? This certainly is a hypothetical case, however, I think it is
 worth a fix.

 Thanks.
 -Vishal





Re: What happens to a follower if leader hangs?

2010-11-10 Thread Benjamin Reed
have you been able to make this happen? the behavior you are suggesting 
is exactly what should be happening. When we sync with the leader we set 
the socket timeout: sock.setSoTimeout(self.tickTime * self.syncLimit);


if the leader hangs, we should get a timeout and disconnect from the leader.

ben


On 11/10/2010 11:57 AM, Vishal Kher wrote:

Yes, thats what I was planning to do. At the follower, start FLE if the
follower does not receive a ping for  (syncLimit * tickTime).


On Wed, Nov 10, 2010 at 2:48 PM, Mahadev Konarmaha...@yahoo-inc.comwrote:


Hi Vishal,
  There are periodic pings sent from the leader to the followers.

Take a look at Leader.java:

syncedSet.add(self.getId());
synchronized (learners) {
for (LearnerHandler f : learners) {
if (f.synced()) {
syncedCount++;
syncedSet.add(f.getSid());
}
f.ping();
}
}


This code sends periodic pings to the followers to make sure they are
running fine. We should keep track of these pings and see if we havent seen
a ping packet from the leader for a long time and give up following the
leader in case we havent heard from him for a long time. This is definitely
worth fixing since we pride ourselves in being a highly available and
reliable service.

Please feel free to open a jira and work on it.
3.4 would be a good target for this.

Thanks
mahadev

On 11/10/10 12:26 PM, Vishal Khervishalm...@gmail.com  wrote:


Hi,

In Follower.followLeader() after syncing with the leader, the follower

does:

 while (self.isRunning()) {
 readPacket(qp);
 processPacket(qp);
 }

It looks like it relies on socket timeout expiry to figure out if the
connection with the leader has gone down.  So a follower *with no

cilents*

may never notice a faulty leader if a Leader has a software hang, but the
TCP connections with the peers are still valid. Since it has not cilents,

it

won't hearbeat with the Leader. If majority of followers are not

connected

to any clients, then even if other followers attempt to elect a new

leader

after detecting that the leader is unresponsive.

Please correct me if I am wrong. If I am not mistaken, should we add code

at

the follower to monitor the heartbeat messages that it receives from the
leader and take action if it misses heartbeats for time  (syncLimit *
tickTime)? This certainly is a hypothetical case, however, I think it is
worth a fix.

Thanks.
-Vishal