I'm using a custom internal framework, loosely based on MesosSubmit. The phenomenon I'm seeing is something like this: 1. Task X is assigned to slave S. 2. I know this task should run for ~10minutes. 3. On the master dashboard, I see that task X is in the "Running" state for several *hours*. 4. I SSH into slave S, and see that task X is *not* running. According to the local logs on that slave, task X finished a long time ago, and seemed to finish OK. 5. According to the scheduler logs, it never got any update from task X after the Staging->Running update.
The phenomenon occurs pretty often, but it's not consistent or deterministic. I'd appreciate your input on how to go about debugging it, and/or implement a workaround to avoid wasted resources. I'm pretty sure the executor on the slave sends the TASK_FINISHED status update (how can I verify that beyond my own logging?). I'm pretty sure the scheduler never receives that update (again, how can I verify that beyond my own logging?). I have no idea if the master got the update and passed it through (how can I check that?). My scheduler and executor are written in Python. As for a workaround - setting a timeout on a task should do the trick. I did not see any timeout field in the TaskInfo message. Does mesos support the concept of per-task timeouts? Or should I implement my own task tracking and timeouting mechanism in the scheduler?

