I deal with Java programs running in my executor that spawn various "service/daemon threads". So, I tend to explicitly call TASK_FINISHED and call System.exit() (with a sleep to allow Mesos to communicate the task update) when I know the task is complete instead of waiting for natural exit of all threads.
Of course, this may not apply to your situation, but, just in case... On Mon, Jan 26, 2015 at 4:43 AM, Itamar Ostricher <[email protected]> wrote: > Thanks Alex. > I agree that it looks like it's not mesos-related. It's probably some > dead-lock. > > On Mon, Jan 26, 2015 at 1:31 PM, Alex Rukletsov <[email protected]> > wrote: > >> Itamar, >> >> you are right, Mesos executor and containerizer cannot distinguish >> between "busy" and "stuck" processes. However, since you use your own >> custom executor, you may want to implement a sort of health checks. It >> depends on what your task processes are doing. >> >> There are hundreds of reasons why an OS process may "get stuck"; it >> doesn't look like it's Mesos-related in this case. >> >> On Sat, Jan 24, 2015 at 9:17 PM, Itamar Ostricher <[email protected]> >> wrote: >> > Alex, Sharma, thanks for your input! >> > >> > Trying to recreate the issue with a small cluster for the last few >> days, I >> > was not able to observe a scenario that I can be sure that my executor >> sent >> > the TASK_FINISHED update, but the scheduler did not receive it. >> > I did observe multiple times a scenario that a task seemed to be >> "stuck" in >> > TASK_RUNNING state, but when I SSH'ed into the slave that has the task, >> I >> > always saw that the process related to that task is still running (by >> > grepping `ps aux`). Most of the times, it seemed that the process did >> the >> > work (by examining the logs produced by the PID), but for some reason >> it was >> > "stuck" without exiting cleanly. Some times it seemed that the process >> > didn't do any work (an empty log file with the PID). All times, as soon >> as I >> > killed the PID, a TASK_FAILED update was sent and received successfully. >> > >> > So, it seems that the problem is in processes spawned by my executor, >> but I >> > don't fully understand why this happens. >> > Any ideas why a process would do some work (either 1% (just creating a >> log >> > file) or 99% (doing everything but not exiting) and "get stuck"? >> > >> > On Fri, Jan 23, 2015 at 1:01 PM, Alex Rukletsov <[email protected]> >> wrote: >> >> >> >> Itamar, >> >> >> >> beyond checking master and slave logs, could you pleasse verify your >> >> executor does send the TASK_FINISHED update? You may want to add some >> >> logging and the check executor log. Mesos guarantees the delivery of >> >> status updates, so I suspect the problem is on the executor's side. >> >> >> >> On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila <[email protected]> >> >> wrote: >> >> > Have you checked the mesos-slave and mesos-master logs for that task >> id? >> >> > There should be logs in there for task state updates, including >> >> > FINISHED. >> >> > There can be specific cases where sometimes the task status is not >> >> > reliably >> >> > sent to your scheduler (due to mesos-master restarts, leader election >> >> > changes, etc.). There is a task reconciliation support in Mesos. A >> >> > periodic >> >> > call to reconcile tasks from the scheduler can be helpful. There are >> >> > also >> >> > newer enhancements coming to the task reconciliation. In the mean >> time, >> >> > there are other strategies such as what I use, which is periodic >> >> > heartbeats >> >> > from my custom executor to my scheduler (out of band). The timeouts >> for >> >> > task >> >> > runtimes are similar to heartbeats, except, you need a priori >> knowledge >> >> > of >> >> > all tasks' runtimes. >> >> > >> >> > Task runtime limits are not support inherently, as far as I know. >> Your >> >> > executor can implement it, and that may be one simple way to do it. >> That >> >> > could also be a good way to implement shell's rlimit*, in general. >> >> > >> >> > >> >> > >> >> > On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher < >> [email protected]> >> >> > wrote: >> >> >> >> >> >> I'm using a custom internal framework, loosely based on MesosSubmit. >> >> >> The phenomenon I'm seeing is something like this: >> >> >> 1. Task X is assigned to slave S. >> >> >> 2. I know this task should run for ~10minutes. >> >> >> 3. On the master dashboard, I see that task X is in the "Running" >> state >> >> >> for several *hours*. >> >> >> 4. I SSH into slave S, and see that task X is *not* running. >> According >> >> >> to >> >> >> the local logs on that slave, task X finished a long time ago, and >> >> >> seemed to >> >> >> finish OK. >> >> >> 5. According to the scheduler logs, it never got any update from >> task X >> >> >> after the Staging->Running update. >> >> >> >> >> >> The phenomenon occurs pretty often, but it's not consistent or >> >> >> deterministic. >> >> >> >> >> >> I'd appreciate your input on how to go about debugging it, and/or >> >> >> implement a workaround to avoid wasted resources. >> >> >> >> >> >> I'm pretty sure the executor on the slave sends the TASK_FINISHED >> >> >> status >> >> >> update (how can I verify that beyond my own logging?). >> >> >> I'm pretty sure the scheduler never receives that update (again, how >> >> >> can I >> >> >> verify that beyond my own logging?). >> >> >> I have no idea if the master got the update and passed it through >> (how >> >> >> can >> >> >> I check that?). >> >> >> My scheduler and executor are written in Python. >> >> >> >> >> >> As for a workaround - setting a timeout on a task should do the >> trick. >> >> >> I >> >> >> did not see any timeout field in the TaskInfo message. Does mesos >> >> >> support >> >> >> the concept of per-task timeouts? Or should I implement my own task >> >> >> tracking >> >> >> and timeouting mechanism in the scheduler? >> >> > >> >> > >> > >> > >> > >

