Have you checked the mesos-slave and mesos-master logs for that task id?
There should be logs in there for task state updates, including FINISHED.
There can be specific cases where sometimes the task status is not reliably
sent to your scheduler (due to mesos-master restarts, leader election
changes, etc.). There is a task reconciliation support in Mesos. A periodic
call to reconcile tasks from the scheduler can be helpful. There are also
newer enhancements coming to the task reconciliation. In the mean time,
there are other strategies such as what I use, which is periodic heartbeats
from my custom executor to my scheduler (out of band). The timeouts for
task runtimes are similar to heartbeats, except, you need a priori
knowledge of all tasks' runtimes.

Task runtime limits are not support inherently, as far as I know. Your
executor can implement it, and that may be one simple way to do it. That
could also be a good way to implement shell's rlimit*, in general.



On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher <[email protected]>
wrote:

> I'm using a custom internal framework, loosely based on MesosSubmit.
> The phenomenon I'm seeing is something like this:
> 1. Task X is assigned to slave S.
> 2. I know this task should run for ~10minutes.
> 3. On the master dashboard, I see that task X is in the "Running" state
> for several *hours*.
> 4. I SSH into slave S, and see that task X is *not* running. According to
> the local logs on that slave, task X finished a long time ago, and seemed
> to finish OK.
> 5. According to the scheduler logs, it never got any update from task X
> after the Staging->Running update.
>
> The phenomenon occurs pretty often, but it's not consistent or
> deterministic.
>
> I'd appreciate your input on how to go about debugging it, and/or
> implement a workaround to avoid wasted resources.
>
> I'm pretty sure the executor on the slave sends the TASK_FINISHED status
> update (how can I verify that beyond my own logging?).
> I'm pretty sure the scheduler never receives that update (again, how can I
> verify that beyond my own logging?).
> I have no idea if the master got the update and passed it through (how can
> I check that?).
> My scheduler and executor are written in Python.
>
> As for a workaround - setting a timeout on a task should do the trick. I
> did not see any timeout field in the TaskInfo message. Does mesos support
> the concept of per-task timeouts? Or should I implement my own task
> tracking and timeouting mechanism in the scheduler?
>

Reply via email to