Itamar, you are right, Mesos executor and containerizer cannot distinguish between "busy" and "stuck" processes. However, since you use your own custom executor, you may want to implement a sort of health checks. It depends on what your task processes are doing.
There are hundreds of reasons why an OS process may "get stuck"; it doesn't look like it's Mesos-related in this case. On Sat, Jan 24, 2015 at 9:17 PM, Itamar Ostricher <ita...@yowza3d.com> wrote: > Alex, Sharma, thanks for your input! > > Trying to recreate the issue with a small cluster for the last few days, I > was not able to observe a scenario that I can be sure that my executor sent > the TASK_FINISHED update, but the scheduler did not receive it. > I did observe multiple times a scenario that a task seemed to be "stuck" in > TASK_RUNNING state, but when I SSH'ed into the slave that has the task, I > always saw that the process related to that task is still running (by > grepping `ps aux`). Most of the times, it seemed that the process did the > work (by examining the logs produced by the PID), but for some reason it was > "stuck" without exiting cleanly. Some times it seemed that the process > didn't do any work (an empty log file with the PID). All times, as soon as I > killed the PID, a TASK_FAILED update was sent and received successfully. > > So, it seems that the problem is in processes spawned by my executor, but I > don't fully understand why this happens. > Any ideas why a process would do some work (either 1% (just creating a log > file) or 99% (doing everything but not exiting) and "get stuck"? > > On Fri, Jan 23, 2015 at 1:01 PM, Alex Rukletsov <a...@mesosphere.io> wrote: >> >> Itamar, >> >> beyond checking master and slave logs, could you pleasse verify your >> executor does send the TASK_FINISHED update? You may want to add some >> logging and the check executor log. Mesos guarantees the delivery of >> status updates, so I suspect the problem is on the executor's side. >> >> On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila <spod...@netflix.com> >> wrote: >> > Have you checked the mesos-slave and mesos-master logs for that task id? >> > There should be logs in there for task state updates, including >> > FINISHED. >> > There can be specific cases where sometimes the task status is not >> > reliably >> > sent to your scheduler (due to mesos-master restarts, leader election >> > changes, etc.). There is a task reconciliation support in Mesos. A >> > periodic >> > call to reconcile tasks from the scheduler can be helpful. There are >> > also >> > newer enhancements coming to the task reconciliation. In the mean time, >> > there are other strategies such as what I use, which is periodic >> > heartbeats >> > from my custom executor to my scheduler (out of band). The timeouts for >> > task >> > runtimes are similar to heartbeats, except, you need a priori knowledge >> > of >> > all tasks' runtimes. >> > >> > Task runtime limits are not support inherently, as far as I know. Your >> > executor can implement it, and that may be one simple way to do it. That >> > could also be a good way to implement shell's rlimit*, in general. >> > >> > >> > >> > On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher <ita...@yowza3d.com> >> > wrote: >> >> >> >> I'm using a custom internal framework, loosely based on MesosSubmit. >> >> The phenomenon I'm seeing is something like this: >> >> 1. Task X is assigned to slave S. >> >> 2. I know this task should run for ~10minutes. >> >> 3. On the master dashboard, I see that task X is in the "Running" state >> >> for several *hours*. >> >> 4. I SSH into slave S, and see that task X is *not* running. According >> >> to >> >> the local logs on that slave, task X finished a long time ago, and >> >> seemed to >> >> finish OK. >> >> 5. According to the scheduler logs, it never got any update from task X >> >> after the Staging->Running update. >> >> >> >> The phenomenon occurs pretty often, but it's not consistent or >> >> deterministic. >> >> >> >> I'd appreciate your input on how to go about debugging it, and/or >> >> implement a workaround to avoid wasted resources. >> >> >> >> I'm pretty sure the executor on the slave sends the TASK_FINISHED >> >> status >> >> update (how can I verify that beyond my own logging?). >> >> I'm pretty sure the scheduler never receives that update (again, how >> >> can I >> >> verify that beyond my own logging?). >> >> I have no idea if the master got the update and passed it through (how >> >> can >> >> I check that?). >> >> My scheduler and executor are written in Python. >> >> >> >> As for a workaround - setting a timeout on a task should do the trick. >> >> I >> >> did not see any timeout field in the TaskInfo message. Does mesos >> >> support >> >> the concept of per-task timeouts? Or should I implement my own task >> >> tracking >> >> and timeouting mechanism in the scheduler? >> > >> > > >