AndyPang created MESOS-7847: ------------------------------- Summary: Master failover, the 'slaves.recovered' struct contain unvalid slaveID when slave reregistered. Key: MESOS-7847 URL: https://issues.apache.org/jira/browse/MESOS-7847 Project: Mesos Issue Type: Bug Components: master Affects Versions: 1.0.0 Environment: os: ubuntu 14.04 mesos-1.0.0 version mesos master(2) + mesos agent(1) Reporter: AndyPang
we run two mesos-masters and one mesos-slave, in order to test mesos HA. We do as follow steps: 1. startup master1(leader), master2(follower) and agent, the agent successful registerd to master1; 2. shutdown agent, after 75s(the master and agent ping-pong is 15s*5) shutdown master1, as a result master1 remove agent from 'registry' and send 'ShutdownMessage' to agent, but agent process have terminated, so it can't receive this message; 3. master1 is terminated and master2 is leader now, it recovered from registry with 0 agent, meanwhile the agent is restarted, the agent 'reregisteded' to master2 is success. So the issue is when master recovered from registry with 0 agent, the 'slaves.recovered' strcut don't contain this Slave, it should not admit reregistered success, maybe should send 'ShutdownMessage' to agent. -- This message was sent by Atlassian JIRA (v6.4.14#64029)