Re: Trying to debug an issue in mesos task tracking
Itamar, you are right, Mesos executor and containerizer cannot distinguish between busy and stuck processes. However, since you use your own custom executor, you may want to implement a sort of health checks. It depends on what your task processes are doing. There are hundreds of reasons why an OS process may get stuck; it doesn't look like it's Mesos-related in this case. On Sat, Jan 24, 2015 at 9:17 PM, Itamar Ostricher ita...@yowza3d.com wrote: Alex, Sharma, thanks for your input! Trying to recreate the issue with a small cluster for the last few days, I was not able to observe a scenario that I can be sure that my executor sent the TASK_FINISHED update, but the scheduler did not receive it. I did observe multiple times a scenario that a task seemed to be stuck in TASK_RUNNING state, but when I SSH'ed into the slave that has the task, I always saw that the process related to that task is still running (by grepping `ps aux`). Most of the times, it seemed that the process did the work (by examining the logs produced by the PID), but for some reason it was stuck without exiting cleanly. Some times it seemed that the process didn't do any work (an empty log file with the PID). All times, as soon as I killed the PID, a TASK_FAILED update was sent and received successfully. So, it seems that the problem is in processes spawned by my executor, but I don't fully understand why this happens. Any ideas why a process would do some work (either 1% (just creating a log file) or 99% (doing everything but not exiting) and get stuck? On Fri, Jan 23, 2015 at 1:01 PM, Alex Rukletsov a...@mesosphere.io wrote: Itamar, beyond checking master and slave logs, could you pleasse verify your executor does send the TASK_FINISHED update? You may want to add some logging and the check executor log. Mesos guarantees the delivery of status updates, so I suspect the problem is on the executor's side. On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila spod...@netflix.com wrote: Have you checked the mesos-slave and mesos-master logs for that task id? There should be logs in there for task state updates, including FINISHED. There can be specific cases where sometimes the task status is not reliably sent to your scheduler (due to mesos-master restarts, leader election changes, etc.). There is a task reconciliation support in Mesos. A periodic call to reconcile tasks from the scheduler can be helpful. There are also newer enhancements coming to the task reconciliation. In the mean time, there are other strategies such as what I use, which is periodic heartbeats from my custom executor to my scheduler (out of band). The timeouts for task runtimes are similar to heartbeats, except, you need a priori knowledge of all tasks' runtimes. Task runtime limits are not support inherently, as far as I know. Your executor can implement it, and that may be one simple way to do it. That could also be a good way to implement shell's rlimit*, in general. On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher ita...@yowza3d.com wrote: I'm using a custom internal framework, loosely based on MesosSubmit. The phenomenon I'm seeing is something like this: 1. Task X is assigned to slave S. 2. I know this task should run for ~10minutes. 3. On the master dashboard, I see that task X is in the Running state for several *hours*. 4. I SSH into slave S, and see that task X is *not* running. According to the local logs on that slave, task X finished a long time ago, and seemed to finish OK. 5. According to the scheduler logs, it never got any update from task X after the Staging-Running update. The phenomenon occurs pretty often, but it's not consistent or deterministic. I'd appreciate your input on how to go about debugging it, and/or implement a workaround to avoid wasted resources. I'm pretty sure the executor on the slave sends the TASK_FINISHED status update (how can I verify that beyond my own logging?). I'm pretty sure the scheduler never receives that update (again, how can I verify that beyond my own logging?). I have no idea if the master got the update and passed it through (how can I check that?). My scheduler and executor are written in Python. As for a workaround - setting a timeout on a task should do the trick. I did not see any timeout field in the TaskInfo message. Does mesos support the concept of per-task timeouts? Or should I implement my own task tracking and timeouting mechanism in the scheduler?
Re: Trying to debug an issue in mesos task tracking
Alex, Sharma, thanks for your input! Trying to recreate the issue with a small cluster for the last few days, I was not able to observe a scenario that I can be sure that my executor sent the TASK_FINISHED update, but the scheduler did not receive it. I did observe multiple times a scenario that a task seemed to be stuck in TASK_RUNNING state, but when I SSH'ed into the slave that has the task, I always saw that the process related to that task is still running (by grepping `ps aux`). Most of the times, it seemed that the process did the work (by examining the logs produced by the PID), but for some reason it was stuck without exiting cleanly. Some times it seemed that the process didn't do any work (an empty log file with the PID). All times, as soon as I killed the PID, a TASK_FAILED update was sent and received successfully. So, it seems that the problem is in processes spawned by my executor, but I don't fully understand why this happens. Any ideas why a process would do some work (either 1% (just creating a log file) or 99% (doing everything but not exiting) and get stuck? On Fri, Jan 23, 2015 at 1:01 PM, Alex Rukletsov a...@mesosphere.io wrote: Itamar, beyond checking master and slave logs, could you pleasse verify your executor does send the TASK_FINISHED update? You may want to add some logging and the check executor log. Mesos guarantees the delivery of status updates, so I suspect the problem is on the executor's side. On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila spod...@netflix.com wrote: Have you checked the mesos-slave and mesos-master logs for that task id? There should be logs in there for task state updates, including FINISHED. There can be specific cases where sometimes the task status is not reliably sent to your scheduler (due to mesos-master restarts, leader election changes, etc.). There is a task reconciliation support in Mesos. A periodic call to reconcile tasks from the scheduler can be helpful. There are also newer enhancements coming to the task reconciliation. In the mean time, there are other strategies such as what I use, which is periodic heartbeats from my custom executor to my scheduler (out of band). The timeouts for task runtimes are similar to heartbeats, except, you need a priori knowledge of all tasks' runtimes. Task runtime limits are not support inherently, as far as I know. Your executor can implement it, and that may be one simple way to do it. That could also be a good way to implement shell's rlimit*, in general. On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher ita...@yowza3d.com wrote: I'm using a custom internal framework, loosely based on MesosSubmit. The phenomenon I'm seeing is something like this: 1. Task X is assigned to slave S. 2. I know this task should run for ~10minutes. 3. On the master dashboard, I see that task X is in the Running state for several *hours*. 4. I SSH into slave S, and see that task X is *not* running. According to the local logs on that slave, task X finished a long time ago, and seemed to finish OK. 5. According to the scheduler logs, it never got any update from task X after the Staging-Running update. The phenomenon occurs pretty often, but it's not consistent or deterministic. I'd appreciate your input on how to go about debugging it, and/or implement a workaround to avoid wasted resources. I'm pretty sure the executor on the slave sends the TASK_FINISHED status update (how can I verify that beyond my own logging?). I'm pretty sure the scheduler never receives that update (again, how can I verify that beyond my own logging?). I have no idea if the master got the update and passed it through (how can I check that?). My scheduler and executor are written in Python. As for a workaround - setting a timeout on a task should do the trick. I did not see any timeout field in the TaskInfo message. Does mesos support the concept of per-task timeouts? Or should I implement my own task tracking and timeouting mechanism in the scheduler?
Re: Trying to debug an issue in mesos task tracking
Itamar, beyond checking master and slave logs, could you pleasse verify your executor does send the TASK_FINISHED update? You may want to add some logging and the check executor log. Mesos guarantees the delivery of status updates, so I suspect the problem is on the executor's side. On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila spod...@netflix.com wrote: Have you checked the mesos-slave and mesos-master logs for that task id? There should be logs in there for task state updates, including FINISHED. There can be specific cases where sometimes the task status is not reliably sent to your scheduler (due to mesos-master restarts, leader election changes, etc.). There is a task reconciliation support in Mesos. A periodic call to reconcile tasks from the scheduler can be helpful. There are also newer enhancements coming to the task reconciliation. In the mean time, there are other strategies such as what I use, which is periodic heartbeats from my custom executor to my scheduler (out of band). The timeouts for task runtimes are similar to heartbeats, except, you need a priori knowledge of all tasks' runtimes. Task runtime limits are not support inherently, as far as I know. Your executor can implement it, and that may be one simple way to do it. That could also be a good way to implement shell's rlimit*, in general. On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher ita...@yowza3d.com wrote: I'm using a custom internal framework, loosely based on MesosSubmit. The phenomenon I'm seeing is something like this: 1. Task X is assigned to slave S. 2. I know this task should run for ~10minutes. 3. On the master dashboard, I see that task X is in the Running state for several *hours*. 4. I SSH into slave S, and see that task X is *not* running. According to the local logs on that slave, task X finished a long time ago, and seemed to finish OK. 5. According to the scheduler logs, it never got any update from task X after the Staging-Running update. The phenomenon occurs pretty often, but it's not consistent or deterministic. I'd appreciate your input on how to go about debugging it, and/or implement a workaround to avoid wasted resources. I'm pretty sure the executor on the slave sends the TASK_FINISHED status update (how can I verify that beyond my own logging?). I'm pretty sure the scheduler never receives that update (again, how can I verify that beyond my own logging?). I have no idea if the master got the update and passed it through (how can I check that?). My scheduler and executor are written in Python. As for a workaround - setting a timeout on a task should do the trick. I did not see any timeout field in the TaskInfo message. Does mesos support the concept of per-task timeouts? Or should I implement my own task tracking and timeouting mechanism in the scheduler?
Re: Trying to debug an issue in mesos task tracking
Have you checked the mesos-slave and mesos-master logs for that task id? There should be logs in there for task state updates, including FINISHED. There can be specific cases where sometimes the task status is not reliably sent to your scheduler (due to mesos-master restarts, leader election changes, etc.). There is a task reconciliation support in Mesos. A periodic call to reconcile tasks from the scheduler can be helpful. There are also newer enhancements coming to the task reconciliation. In the mean time, there are other strategies such as what I use, which is periodic heartbeats from my custom executor to my scheduler (out of band). The timeouts for task runtimes are similar to heartbeats, except, you need a priori knowledge of all tasks' runtimes. Task runtime limits are not support inherently, as far as I know. Your executor can implement it, and that may be one simple way to do it. That could also be a good way to implement shell's rlimit*, in general. On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher ita...@yowza3d.com wrote: I'm using a custom internal framework, loosely based on MesosSubmit. The phenomenon I'm seeing is something like this: 1. Task X is assigned to slave S. 2. I know this task should run for ~10minutes. 3. On the master dashboard, I see that task X is in the Running state for several *hours*. 4. I SSH into slave S, and see that task X is *not* running. According to the local logs on that slave, task X finished a long time ago, and seemed to finish OK. 5. According to the scheduler logs, it never got any update from task X after the Staging-Running update. The phenomenon occurs pretty often, but it's not consistent or deterministic. I'd appreciate your input on how to go about debugging it, and/or implement a workaround to avoid wasted resources. I'm pretty sure the executor on the slave sends the TASK_FINISHED status update (how can I verify that beyond my own logging?). I'm pretty sure the scheduler never receives that update (again, how can I verify that beyond my own logging?). I have no idea if the master got the update and passed it through (how can I check that?). My scheduler and executor are written in Python. As for a workaround - setting a timeout on a task should do the trick. I did not see any timeout field in the TaskInfo message. Does mesos support the concept of per-task timeouts? Or should I implement my own task tracking and timeouting mechanism in the scheduler?