[ https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072537#comment-14072537 ]
Benjamin Mahler edited comment on MESOS-1529 at 7/24/14 2:55 AM: ----------------------------------------------------------------- For now we will proceed by adding a ping timeout on the slave to ensure that the slave re-registers when the master is no longer pinging it. This will resolve the case that motivated this ticket: https://reviews.apache.org/r/23874/ https://reviews.apache.org/r/23875/ https://reviews.apache.org/r/23866/ https://reviews.apache.org/r/23867/ https://reviews.apache.org/r/23868/ I decided to punt on the failover timeout in the master in the first pass because it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation. Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master. was (Author: bmahler): For now we will proceed by adding a ping timeout on the slave to ensure that the slave re-registers when the master is no longer pinging it. This will resolve the case that motivated this ticket: https://reviews.apache.org/r/23866/ https://reviews.apache.org/r/23867/ https://reviews.apache.org/r/23868/ I decided to punt on the failover timeout in the master in the first pass because it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation. Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master. > Handle a network partition between Master and Slave > --------------------------------------------------- > > Key: MESOS-1529 > URL: https://issues.apache.org/jira/browse/MESOS-1529 > Project: Mesos > Issue Type: Bug > Reporter: Dominic Hamon > Assignee: Benjamin Mahler > > If a network partition occurs between a Master and Slave, the Master will > remove the Slave (as it fails health check) and mark the tasks being run > there as LOST. However, the Slave is not aware that it has been removed so > the tasks will continue to run. > (To clarify a little bit: neither the master nor the slave receives 'exited' > event, indicating that the connection between the master and slave is not > closed). > There are at least two possible approaches to solving this issue: > 1. Introduce a health check from Slave to Master so they have a consistent > view of a network partition. We may still see this issue should a one-way > connection error occur. > 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the > Slave reappears and reconcile then. We'd still need to mark Slaves and tasks > as potentially lost (zombie state) but maybe the Scheduler can make a more > intelligent decision. -- This message was sent by Atlassian JIRA (v6.2#6252)