The 'marking' of the task is not immediate: Master actually waits a beat or
two to see if the Agent reconnects, there are various flags that control
behavior around this [0].

Naive question: I am assuming that you already looked into a combination of:

--max_slave_ping_timeouts=VALUE
--slave_ping_timeout=VALUE
--slave_removal_rate_limit=VALUE
--slave_reregister_timeout=VALUE

that may help with your use case?
I'm not really an expert into these flags, so not entirely sure whether a
combination thereof may work with your scenario.

[0] http://mesos.apache.org/documentation/latest/configuration/




*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Fri, Oct 9, 2015 at 11:48 AM, Marcus Larsson <[email protected]>
wrote:

> Hi,
>
> I'm part of a project investigating the use of Mesos for a distributed
> build and test system. For some of our tasks we would like to have more
> control over the slave recovery policy. Currently, when a slave fails its
> health check, it seems Mesos will always mark any task on the slave as
> lost, and shutdown the slave when (or if) it reconnects. We would like the
> framework to have more information and control over this.
>
> I found an issue [1] in JIRA that mentions implementing something like
> this, but it seems only the part with the slave removal rate limiter was
> implemented. What I'm wondering is if there is any support in Mesos for
> letting the framework decide how to handle slave removal/recovery?
>
> For our case, we would like the framework to be notified when a slave
> fails its health check, so that the appropriate action for the task running
> on that slave can be taken. Some of our tasks will be very long running and
> we don't want to restart a few days worth of work because the network was
> down for a while.
>
> Thanks,
> Marcus
>
> [1]: https://issues.apache.org/jira/browse/MESOS-2246
>

Reply via email to