Re: mesos-slave recover strategy

Shuai Lin Tue, 19 Jan 2016 08:29:47 -0800

>
> I'd like to prevent having two running at the same time.


I've expereince that as well, but it's hard to prevent in real world.
Sometimes you can't even tell a network partition from other cases, e.g.
the slave OS hangs due to a kernel bug, in which case the task is LOST and
it's correct for the framework to relaunch the task.

On Wed, Jan 20, 2016 at 12:09 AM, Mauricio Garavaglia <[email protected]
> wrote:

> Thanks, I'm trying to prevent the case where the TASK_LOST is issued to
> the framework while the task is still running on the slave. This happened
> during a network partition where the slave got deregistered. Until the
> slave came back and killed the tasks, they were marked as LOST and
> rescheduled again in a different slave. I'd like to prevent having two
> running at the same time.
>
> On Tue, Jan 19, 2016 at 12:33 PM, Vinod Kone <[email protected]> wrote:
>
>> Killing is done by the agent/slave. So network partition doesn't affect
>> the killing. When the agent eventually connects with the master or times
>> out, TASK_LOST is sent to the framework.
>>
>> @vinodkone
>>
>> > On Jan 19, 2016, at 6:46 AM, Mauricio Garavaglia <
>> [email protected]> wrote:
>> >
>> > Hi,
>> > In the case of the --recover=cleanup option, acording to the docs it
>> "Kill any old live executors and exit". In the case of a network partition
>> that prevents the slave to reach the master, When does the killing of the
>> executors happen?
>> >
>> > Thanks
>> >
>>
>
>

Re: mesos-slave recover strategy

Reply via email to