AndyPang created MESOS-7847:
-------------------------------

             Summary: Master failover, the 'slaves.recovered' struct contain 
unvalid slaveID when slave reregistered.
                 Key: MESOS-7847
                 URL: https://issues.apache.org/jira/browse/MESOS-7847
             Project: Mesos
          Issue Type: Bug
          Components: master
    Affects Versions: 1.0.0
         Environment: os: ubuntu 14.04 mesos-1.0.0 version
mesos master(2) + mesos agent(1)
            Reporter: AndyPang


we run two mesos-masters and one mesos-slave, in order to test mesos HA. We do 
as follow steps:
1. startup master1(leader), master2(follower) and agent, the agent successful 
registerd to master1;
2. shutdown agent, after 75s(the master and agent ping-pong is 15s*5) shutdown 
master1, as a result master1 remove agent from 'registry' and send 
'ShutdownMessage' to agent, but agent process have terminated, so it can't 
receive this message;
3. master1 is terminated and master2 is leader now, it recovered from registry 
with 0 agent, meanwhile the agent is restarted, the agent 'reregisteded' to 
master2 is success.
So the issue is when master recovered from registry with 0 agent, the 
'slaves.recovered' strcut don't contain this Slave, it should not admit 
reregistered success, maybe should send 'ShutdownMessage' to agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to