Framework control over slave recovery

Marcus Larsson Fri, 09 Oct 2015 03:49:32 -0700

Hi,

I'm part of a project investigating the use of Mesos for a distributedbuild and test system. For some of our tasks we would like to have morecontrol over the slave recovery policy. Currently, when a slave failsits health check, it seems Mesos will always mark any task on the slaveas lost, and shutdown the slave when (or if) it reconnects. We wouldlike the framework to have more information and control over this.

I found an issue [1] in JIRA that mentions implementing something likethis, but it seems only the part with the slave removal rate limiter wasimplemented. What I'm wondering is if there is any support in Mesos forletting the framework decide how to handle slave removal/recovery?

For our case, we would like the framework to be notified when a slavefails its health check, so that the appropriate action for the taskrunning on that slave can be taken. Some of our tasks will be very longrunning and we don't want to restart a few days worth of work becausethe network was down for a while.


Thanks,
Marcus

[1]: https://issues.apache.org/jira/browse/MESOS-2246

Framework control over slave recovery

Reply via email to