I'd go 3.3.3 and 3.4.0. Any of this (incl the other issues
Vishal/others have been finding recently) point to some particular set
of testing we might add to find problems like this? What are we
missing?

Once 3.3.2 is out and immediate tlp issues are addressed I'm going to
start pushing for 3.4 regardless of whether "everything" is in yet or
not.

Patrick

On Wed, Nov 10, 2010 at 11:48 AM, Mahadev Konar <maha...@yahoo-inc.com> wrote:
> Hi Vishal,
>  There are periodic pings sent from the leader to the followers.
>
> Take a look at Leader.java:
>
> syncedSet.add(self.getId());
>                synchronized (learners) {
>                    for (LearnerHandler f : learners) {
>                        if (f.synced()) {
>                            syncedCount++;
>                            syncedSet.add(f.getSid());
>                        }
>                        f.ping();
>                    }
>                }
>
>
> This code sends periodic pings to the followers to make sure they are
> running fine. We should keep track of these pings and see if we havent seen
> a ping packet from the leader for a long time and give up following the
> leader in case we havent heard from him for a long time. This is definitely
> worth fixing since we pride ourselves in being a highly available and
> reliable service.
>
> Please feel free to open a jira and work on it.
> 3.4 would be a good target for this.
>
> Thanks
> mahadev
>
> On 11/10/10 12:26 PM, "Vishal Kher" <vishalm...@gmail.com> wrote:
>
>> Hi,
>>
>> In Follower.followLeader() after syncing with the leader, the follower does:
>>                 while (self.isRunning()) {
>>                     readPacket(qp);
>>                     processPacket(qp);
>>                 }
>>
>> It looks like it relies on socket timeout expiry to figure out if the
>> connection with the leader has gone down.  So a follower *with no cilents*
>> may never notice a faulty leader if a Leader has a software hang, but the
>> TCP connections with the peers are still valid. Since it has not cilents, it
>> won't hearbeat with the Leader. If majority of followers are not connected
>> to any clients, then even if other followers attempt to elect a new leader
>> after detecting that the leader is unresponsive.
>>
>> Please correct me if I am wrong. If I am not mistaken, should we add code at
>> the follower to monitor the heartbeat messages that it receives from the
>> leader and take action if it misses heartbeats for time > (syncLimit *
>> tickTime)? This certainly is a hypothetical case, however, I think it is
>> worth a fix.
>>
>> Thanks.
>> -Vishal
>>
>
>

Reply via email to