Trying to debug an issue in mesos task tracking

Itamar Ostricher Wed, 21 Jan 2015 01:23:19 -0800

I'm using a custom internal framework, loosely based on MesosSubmit.
The phenomenon I'm seeing is something like this:
1. Task X is assigned to slave S.
2. I know this task should run for ~10minutes.
3. On the master dashboard, I see that task X is in the "Running" state for
several *hours*.
4. I SSH into slave S, and see that task X is *not* running. According to
the local logs on that slave, task X finished a long time ago, and seemed
to finish OK.
5. According to the scheduler logs, it never got any update from task X
after the Staging->Running update.


The phenomenon occurs pretty often, but it's not consistent or
deterministic.

I'd appreciate your input on how to go about debugging it, and/or implement
a workaround to avoid wasted resources.

I'm pretty sure the executor on the slave sends the TASK_FINISHED status
update (how can I verify that beyond my own logging?).
I'm pretty sure the scheduler never receives that update (again, how can I
verify that beyond my own logging?).
I have no idea if the master got the update and passed it through (how can
I check that?).
My scheduler and executor are written in Python.

As for a workaround - setting a timeout on a task should do the trick. I
did not see any timeout field in the TaskInfo message. Does mesos support
the concept of per-task timeouts? Or should I implement my own task
tracking and timeouting mechanism in the scheduler?

Trying to debug an issue in mesos task tracking

Reply via email to