When a framework executor terminates, Mesos sends TASK_LOST status updates
for tasks that were running. However, if a task had processes that do not
terminate when the executor dies, then we have a problem since Mesos
considers the slave resources assigned to those tasks as released. Where
as, the task processes are running without releasing those resources.

While it is a good practice for the task processes to exit when their
executor dies, I am not sure that can be guaranteed. I am wondering how
others are dealing with such "illegal" processes - that is, processes that
once belonged to Mesos run tasks but not anymore.

Conceivably, a per-slave reaper/GC process can periodically scan the
slave's process tree to ensure all processes are 'legal'. Assuming that
such a reaper exists (and could be tricky in a multi-framework environment)
on the slave and is not risky in killing illegal processes, there is still
the time window left until the reaper completes its next clean up routine.
In the mean time, new tasks can land and fail trying to use a resource that
was assumed to be free by Mesos. Especially problematic for ports. Not as
much for CPU and memory.

Would love to hear thoughts on how you are handling this scenario.

Reply via email to