Thanks Alex. I agree that it looks like it's not mesos-related. It's probably some dead-lock.
On Mon, Jan 26, 2015 at 1:31 PM, Alex Rukletsov <[email protected]> wrote: > Itamar, > > you are right, Mesos executor and containerizer cannot distinguish > between "busy" and "stuck" processes. However, since you use your own > custom executor, you may want to implement a sort of health checks. It > depends on what your task processes are doing. > > There are hundreds of reasons why an OS process may "get stuck"; it > doesn't look like it's Mesos-related in this case. > > On Sat, Jan 24, 2015 at 9:17 PM, Itamar Ostricher <[email protected]> > wrote: > > Alex, Sharma, thanks for your input! > > > > Trying to recreate the issue with a small cluster for the last few days, > I > > was not able to observe a scenario that I can be sure that my executor > sent > > the TASK_FINISHED update, but the scheduler did not receive it. > > I did observe multiple times a scenario that a task seemed to be "stuck" > in > > TASK_RUNNING state, but when I SSH'ed into the slave that has the task, I > > always saw that the process related to that task is still running (by > > grepping `ps aux`). Most of the times, it seemed that the process did the > > work (by examining the logs produced by the PID), but for some reason it > was > > "stuck" without exiting cleanly. Some times it seemed that the process > > didn't do any work (an empty log file with the PID). All times, as soon > as I > > killed the PID, a TASK_FAILED update was sent and received successfully. > > > > So, it seems that the problem is in processes spawned by my executor, > but I > > don't fully understand why this happens. > > Any ideas why a process would do some work (either 1% (just creating a > log > > file) or 99% (doing everything but not exiting) and "get stuck"? > > > > On Fri, Jan 23, 2015 at 1:01 PM, Alex Rukletsov <[email protected]> > wrote: > >> > >> Itamar, > >> > >> beyond checking master and slave logs, could you pleasse verify your > >> executor does send the TASK_FINISHED update? You may want to add some > >> logging and the check executor log. Mesos guarantees the delivery of > >> status updates, so I suspect the problem is on the executor's side. > >> > >> On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila <[email protected]> > >> wrote: > >> > Have you checked the mesos-slave and mesos-master logs for that task > id? > >> > There should be logs in there for task state updates, including > >> > FINISHED. > >> > There can be specific cases where sometimes the task status is not > >> > reliably > >> > sent to your scheduler (due to mesos-master restarts, leader election > >> > changes, etc.). There is a task reconciliation support in Mesos. A > >> > periodic > >> > call to reconcile tasks from the scheduler can be helpful. There are > >> > also > >> > newer enhancements coming to the task reconciliation. In the mean > time, > >> > there are other strategies such as what I use, which is periodic > >> > heartbeats > >> > from my custom executor to my scheduler (out of band). The timeouts > for > >> > task > >> > runtimes are similar to heartbeats, except, you need a priori > knowledge > >> > of > >> > all tasks' runtimes. > >> > > >> > Task runtime limits are not support inherently, as far as I know. Your > >> > executor can implement it, and that may be one simple way to do it. > That > >> > could also be a good way to implement shell's rlimit*, in general. > >> > > >> > > >> > > >> > On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher <[email protected] > > > >> > wrote: > >> >> > >> >> I'm using a custom internal framework, loosely based on MesosSubmit. > >> >> The phenomenon I'm seeing is something like this: > >> >> 1. Task X is assigned to slave S. > >> >> 2. I know this task should run for ~10minutes. > >> >> 3. On the master dashboard, I see that task X is in the "Running" > state > >> >> for several *hours*. > >> >> 4. I SSH into slave S, and see that task X is *not* running. > According > >> >> to > >> >> the local logs on that slave, task X finished a long time ago, and > >> >> seemed to > >> >> finish OK. > >> >> 5. According to the scheduler logs, it never got any update from > task X > >> >> after the Staging->Running update. > >> >> > >> >> The phenomenon occurs pretty often, but it's not consistent or > >> >> deterministic. > >> >> > >> >> I'd appreciate your input on how to go about debugging it, and/or > >> >> implement a workaround to avoid wasted resources. > >> >> > >> >> I'm pretty sure the executor on the slave sends the TASK_FINISHED > >> >> status > >> >> update (how can I verify that beyond my own logging?). > >> >> I'm pretty sure the scheduler never receives that update (again, how > >> >> can I > >> >> verify that beyond my own logging?). > >> >> I have no idea if the master got the update and passed it through > (how > >> >> can > >> >> I check that?). > >> >> My scheduler and executor are written in Python. > >> >> > >> >> As for a workaround - setting a timeout on a task should do the > trick. > >> >> I > >> >> did not see any timeout field in the TaskInfo message. Does mesos > >> >> support > >> >> the concept of per-task timeouts? Or should I implement my own task > >> >> tracking > >> >> and timeouting mechanism in the scheduler? > >> > > >> > > > > > >

