The 'marking' of the task is not immediate: Master actually waits a beat or two to see if the Agent reconnects, there are various flags that control behavior around this [0].
Naive question: I am assuming that you already looked into a combination of: --max_slave_ping_timeouts=VALUE --slave_ping_timeout=VALUE --slave_removal_rate_limit=VALUE --slave_reregister_timeout=VALUE that may help with your use case? I'm not really an expert into these flags, so not entirely sure whether a combination thereof may work with your scenario. [0] http://mesos.apache.org/documentation/latest/configuration/ *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Fri, Oct 9, 2015 at 11:48 AM, Marcus Larsson <[email protected]> wrote: > Hi, > > I'm part of a project investigating the use of Mesos for a distributed > build and test system. For some of our tasks we would like to have more > control over the slave recovery policy. Currently, when a slave fails its > health check, it seems Mesos will always mark any task on the slave as > lost, and shutdown the slave when (or if) it reconnects. We would like the > framework to have more information and control over this. > > I found an issue [1] in JIRA that mentions implementing something like > this, but it seems only the part with the slave removal rate limiter was > implemented. What I'm wondering is if there is any support in Mesos for > letting the framework decide how to handle slave removal/recovery? > > For our case, we would like the framework to be notified when a slave > fails its health check, so that the appropriate action for the task running > on that slave can be taken. Some of our tasks will be very long running and > we don't want to restart a few days worth of work because the network was > down for a while. > > Thanks, > Marcus > > [1]: https://issues.apache.org/jira/browse/MESOS-2246 >

