Hi,
I'm part of a project investigating the use of Mesos for a distributed
build and test system. For some of our tasks we would like to have more
control over the slave recovery policy. Currently, when a slave fails
its health check, it seems Mesos will always mark any task on the slave
as lost, and shutdown the slave when (or if) it reconnects. We would
like the framework to have more information and control over this.
I found an issue [1] in JIRA that mentions implementing something like
this, but it seems only the part with the slave removal rate limiter was
implemented. What I'm wondering is if there is any support in Mesos for
letting the framework decide how to handle slave removal/recovery?
For our case, we would like the framework to be notified when a slave
fails its health check, so that the appropriate action for the task
running on that slave can be taken. Some of our tasks will be very long
running and we don't want to restart a few days worth of work because
the network was down for a while.
Thanks,
Marcus
[1]: https://issues.apache.org/jira/browse/MESOS-2246