[ 
https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072537#comment-14072537
 ] 

Benjamin Mahler edited comment on MESOS-1529 at 7/24/14 2:55 AM:
-----------------------------------------------------------------

For now we will proceed by adding a ping timeout on the slave to ensure that 
the slave re-registers when the master is no longer pinging it. This will 
resolve the case that motivated this ticket:

https://reviews.apache.org/r/23874/
https://reviews.apache.org/r/23875/
https://reviews.apache.org/r/23866/
https://reviews.apache.org/r/23867/
https://reviews.apache.org/r/23868/

I decided to punt on the failover timeout in the master in the first pass 
because it can be dangerous when ZooKeeper issues are preventing the slave from 
re-registering with the master; we do not want to remove a ton of slaves in 
this situation. Rather, when the slave is health checking correctly but does 
not re-register within a timeout, we could send a registration request from the 
master to the slave, telling the slave that it must re-register. This message 
could also be used when receiving status updates (or other messages) from 
slaves that are disconnected in the master.


was (Author: bmahler):
For now we will proceed by adding a ping timeout on the slave to ensure that 
the slave re-registers when the master is no longer pinging it. This will 
resolve the case that motivated this ticket:

https://reviews.apache.org/r/23866/
https://reviews.apache.org/r/23867/
https://reviews.apache.org/r/23868/

I decided to punt on the failover timeout in the master in the first pass 
because it can be dangerous when ZooKeeper issues are preventing the slave from 
re-registering with the master; we do not want to remove a ton of slaves in 
this situation. Rather, when the slave is health checking correctly but does 
not re-register within a timeout, we could send a registration request from the 
master to the slave, telling the slave that it must re-register. This message 
could also be used when receiving status updates (or other messages) from 
slaves that are disconnected in the master.

> Handle a network partition between Master and Slave
> ---------------------------------------------------
>
>                 Key: MESOS-1529
>                 URL: https://issues.apache.org/jira/browse/MESOS-1529
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Dominic Hamon
>            Assignee: Benjamin Mahler
>
> If a network partition occurs between a Master and Slave, the Master will 
> remove the Slave (as it fails health check) and mark the tasks being run 
> there as LOST. However, the Slave is not aware that it has been removed so 
> the tasks will continue to run.
> (To clarify a little bit: neither the master nor the slave receives 'exited' 
> event, indicating that the connection between the master and slave is not 
> closed).
> There are at least two possible approaches to solving this issue:
> 1. Introduce a health check from Slave to Master so they have a consistent 
> view of a network partition. We may still see this issue should a one-way 
> connection error occur.
> 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the 
> Slave reappears and reconcile then. We'd still need to mark Slaves and tasks 
> as potentially lost (zombie state) but maybe the Scheduler can make a more 
> intelligent decision.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to