Itamar, beyond checking master and slave logs, could you pleasse verify your executor does send the TASK_FINISHED update? You may want to add some logging and the check executor log. Mesos guarantees the delivery of status updates, so I suspect the problem is on the executor's side.
On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila <spod...@netflix.com> wrote: > Have you checked the mesos-slave and mesos-master logs for that task id? > There should be logs in there for task state updates, including FINISHED. > There can be specific cases where sometimes the task status is not reliably > sent to your scheduler (due to mesos-master restarts, leader election > changes, etc.). There is a task reconciliation support in Mesos. A periodic > call to reconcile tasks from the scheduler can be helpful. There are also > newer enhancements coming to the task reconciliation. In the mean time, > there are other strategies such as what I use, which is periodic heartbeats > from my custom executor to my scheduler (out of band). The timeouts for task > runtimes are similar to heartbeats, except, you need a priori knowledge of > all tasks' runtimes. > > Task runtime limits are not support inherently, as far as I know. Your > executor can implement it, and that may be one simple way to do it. That > could also be a good way to implement shell's rlimit*, in general. > > > > On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher <ita...@yowza3d.com> > wrote: >> >> I'm using a custom internal framework, loosely based on MesosSubmit. >> The phenomenon I'm seeing is something like this: >> 1. Task X is assigned to slave S. >> 2. I know this task should run for ~10minutes. >> 3. On the master dashboard, I see that task X is in the "Running" state >> for several *hours*. >> 4. I SSH into slave S, and see that task X is *not* running. According to >> the local logs on that slave, task X finished a long time ago, and seemed to >> finish OK. >> 5. According to the scheduler logs, it never got any update from task X >> after the Staging->Running update. >> >> The phenomenon occurs pretty often, but it's not consistent or >> deterministic. >> >> I'd appreciate your input on how to go about debugging it, and/or >> implement a workaround to avoid wasted resources. >> >> I'm pretty sure the executor on the slave sends the TASK_FINISHED status >> update (how can I verify that beyond my own logging?). >> I'm pretty sure the scheduler never receives that update (again, how can I >> verify that beyond my own logging?). >> I have no idea if the master got the update and passed it through (how can >> I check that?). >> My scheduler and executor are written in Python. >> >> As for a workaround - setting a timeout on a task should do the trick. I >> did not see any timeout field in the TaskInfo message. Does mesos support >> the concept of per-task timeouts? Or should I implement my own task tracking >> and timeouting mechanism in the scheduler? > >