Benjamin Mahler created MESOS-1668:
--------------------------------------

             Summary: Handle a temporary one-way master --> slave socket 
closure.
                 Key: MESOS-1668
                 URL: https://issues.apache.org/jira/browse/MESOS-1668
             Project: Mesos
          Issue Type: Bug
          Components: master, slave
            Reporter: Benjamin Mahler
            Priority: Minor


In MESOS-1529, we realized that it's possible for a slave to remain 
disconnected in the master if the following occurs:

→ Master and Slave connected operating normally.
→ Temporary one-way network failure, master→slave link breaks.
→ Master marks slave as disconnected.
→ Network restored and health checking continues normally, slave is not removed 
as a result. Slave does not attempt to re-register since it is receiving pings 
once again.
→ Slave remains disconnected according to the master, and the slave does not 
try to re-register. Bad!

We were originally thinking of using a failover timeout in the master to remove 
these slaves that don't re-register. However, it can be dangerous when 
ZooKeeper issues are preventing the slave from re-registering with the master; 
we do not want to remove a ton of slaves in this situation.

Rather, when the slave is health checking correctly but does not re-register 
within a timeout, we could send a registration request from the master to the 
slave, telling the slave that it must re-register. This message could also be 
used when receiving status updates (or other messages) from slaves that are 
disconnected in the master.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to