Re: Understanding Slave Recovery Timeouts

Adam Bordelon Sat, 20 Jun 2015 22:56:51 -0700

FYI, the 75s (hard coded) timeout is being made configurable in MESOS-2110
<https://issues.apache.org/jira/browse/MESOS-2110>, hopefully landing in
0.23.


On Fri, Jun 19, 2015 at 4:18 PM, Roger Ignazio <[email protected]> wrote:

> On Fri, Jun 19, 2015 at 3:46 PM, Vinod Kone <[email protected]> wrote:
>
>>
>> *If* the 75 seconds is exceeded but we're within the recovery_timeout,
>>> the slave *should* register with a new slave ID. The slave daemon (with
>>> the new slave ID) reconnects to the old executors and updates them to use
>>> the new slave ID.
>>>
>>
>> This is not true. 'recovery_timeout' was added to make sure that if a
>> slave is down for a long time (>10 mins), the executors commit suicide. It
>> is better for the executor/task to die than keep running because the
>> framework might have already launched another replica of that instance.
>> This was not tied to the 75s timeout (hard coded) because it is possible
>> for a slave to successfully re-register with a master after 75s (e.g., both
>> master and slave are down for 5 min).
>>
>> Also, a slave cannot connect to old executors with a new slave id.
>>
>
> Perfect, thanks for the quick response Vinod!
>

Re: Understanding Slave Recovery Timeouts

Reply via email to