Hi,

I'm part of a project investigating the use of Mesos for a distributed build and test system. For some of our tasks we would like to have more control over the slave recovery policy. Currently, when a slave fails its health check, it seems Mesos will always mark any task on the slave as lost, and shutdown the slave when (or if) it reconnects. We would like the framework to have more information and control over this.

I found an issue [1] in JIRA that mentions implementing something like this, but it seems only the part with the slave removal rate limiter was implemented. What I'm wondering is if there is any support in Mesos for letting the framework decide how to handle slave removal/recovery?

For our case, we would like the framework to be notified when a slave fails its health check, so that the appropriate action for the task running on that slave can be taken. Some of our tasks will be very long running and we don't want to restart a few days worth of work because the network was down for a while.

Thanks,
Marcus

[1]: https://issues.apache.org/jira/browse/MESOS-2246

Reply via email to