It's the executor's responsibility to forcefully kill a task after the task
kill grace period. However, in your case it sounds like the executor is
getting stuck? What is happening in the executor? If the executor is alive
but doesn't implement the grace period force kill logic, the solution is to
update the executor to handle grace periods and pass the grace period from
the scheduler side.

If its the executor that is stuck, the scheduler can issue a SHUTDOWN and
after an agent configured timeout the executor will be forcefully killed:
https://github.com/apache/mesos/blob/1.5.0/include/mesos/v1/scheduler/scheduler.proto#L363-L373

However, this API is not possible to use reliably until MESOS-8167 is in
place.

There is also a KILL_CONTAINER agent API that allows you to manually kill a
stuck container as an operator:
https://github.com/apache/mesos/blob/1.5.0/include/mesos/v1/agent/agent.proto#L90

On Tue, Apr 3, 2018 at 8:59 PM, Venkat Morampudi <venkatmoramp...@gmail.com>
wrote:

> Hi,
>
> We have framework that launched Spark jobs on our Mesos cluster. We are
> currently having an issue where Spark jobs are getting stuck due to some
> timeout issue. We have cancel functionality that would kill send task_kill
> message to master. When the jobs get stuck Spark driver task is not getting
> killed even though the agent on the node that driver is running get the
> kill request. Is there any timeout that I can set so that Mesos agent can
> force kill the task in this scenario? Really appreciate your help.
>
> Thanks,
> Venkat
>
>
> Log entry from agent logs:
>
> I0404 03:44:47.367276 55066 slave.cpp:2035] Asked to kill task 79668.0.0
> of framework 35e600c2-6f43-402c-856f-9084c0040187-002
>
>

Reply via email to