Re: Framework control over slave recovery

Marcus Larsson Fri, 09 Oct 2015 07:30:18 -0700

Hi,

On 2015-10-09 15:26, Marco Massenzio wrote:

The 'marking' of the task is not immediate: Master actually waits abeat or two to see if the Agent reconnects, there are various flagsthat control behavior around this [0].
Naive question: I am assuming that you already looked into acombination of:
--max_slave_ping_timeouts=VALUE
--slave_ping_timeout=VALUE
--slave_removal_rate_limit=VALUE
--slave_reregister_timeout=VALUE

that may help with your use case?
I'm not really an expert into these flags, so not entirely surewhether a combination thereof may work with your scenario.

Yeah I've seen and tried using these flags. While they can be used toprevent Mesos from killing the agents too quickly, the framework willnot be notified about the slave failing the health checks unless ittimes out completely and the task is lost. Also, ideally we would wantper-task timeouts, whereas these settings are global.


Thanks,
Marcus


[0] http://mesos.apache.org/documentation/latest/configuration/




/Marco Massenzio/
/Distributed Systems Engineer
http://codetrips.com/

On Fri, Oct 9, 2015 at 11:48 AM, Marcus Larsson<[email protected] <mailto:[email protected]>> wrote:


    Hi,

    I'm part of a project investigating the use of Mesos for a
    distributed build and test system. For some of our tasks we would
    like to have more control over the slave recovery policy.
    Currently, when a slave fails its health check, it seems Mesos
    will always mark any task on the slave as lost, and shutdown the
    slave when (or if) it reconnects. We would like the framework to
    have more information and control over this.

    I found an issue [1] in JIRA that mentions implementing something
    like this, but it seems only the part with the slave removal rate
    limiter was implemented. What I'm wondering is if there is any
    support in Mesos for letting the framework decide how to handle
    slave removal/recovery?

    For our case, we would like the framework to be notified when a
    slave fails its health check, so that the appropriate action for
    the task running on that slave can be taken. Some of our tasks
    will be very long running and we don't want to restart a few days
    worth of work because the network was down for a while.

    Thanks,
    Marcus

    [1]: https://issues.apache.org/jira/browse/MESOS-2246

Re: Framework control over slave recovery

Reply via email to