Re: Trying to debug an issue in mesos task tracking

Itamar Ostricher Mon, 26 Jan 2015 04:44:06 -0800

Thanks Alex.
I agree that it looks like it's not mesos-related. It's probably some
dead-lock.


On Mon, Jan 26, 2015 at 1:31 PM, Alex Rukletsov <[email protected]> wrote:

> Itamar,
>
> you are right, Mesos executor and containerizer cannot distinguish
> between "busy" and "stuck" processes. However, since you use your own
> custom executor, you may want to implement a sort of health checks. It
> depends on what your task processes are doing.
>
> There are hundreds of reasons why an OS process may "get stuck"; it
> doesn't look like it's Mesos-related in this case.
>
> On Sat, Jan 24, 2015 at 9:17 PM, Itamar Ostricher <[email protected]>
> wrote:
> > Alex, Sharma, thanks for your input!
> >
> > Trying to recreate the issue with a small cluster for the last few days,
> I
> > was not able to observe a scenario that I can be sure that my executor
> sent
> > the TASK_FINISHED update, but the scheduler did not receive it.
> > I did observe multiple times a scenario that a task seemed to be "stuck"
> in
> > TASK_RUNNING state, but when I SSH'ed into the slave that has the task, I
> > always saw that the process related to that task is still running (by
> > grepping `ps aux`). Most of the times, it seemed that the process did the
> > work (by examining the logs produced by the PID), but for some reason it
> was
> > "stuck" without exiting cleanly. Some times it seemed that the process
> > didn't do any work (an empty log file with the PID). All times, as soon
> as I
> > killed the PID, a TASK_FAILED update was sent and received successfully.
> >
> > So, it seems that the problem is in processes spawned by my executor,
> but I
> > don't fully understand why this happens.
> > Any ideas why a process would do some work (either 1% (just creating a
> log
> > file) or 99% (doing everything but not exiting) and "get stuck"?
> >
> > On Fri, Jan 23, 2015 at 1:01 PM, Alex Rukletsov <[email protected]>
> wrote:
> >>
> >> Itamar,
> >>
> >> beyond checking master and slave logs, could you pleasse verify your
> >> executor does send the TASK_FINISHED update? You may want to add some
> >> logging and the check executor log. Mesos guarantees the delivery of
> >> status updates, so I suspect the problem is on the executor's side.
> >>
> >> On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila <[email protected]>
> >> wrote:
> >> > Have you checked the mesos-slave and mesos-master logs for that task
> id?
> >> > There should be logs in there for task state updates, including
> >> > FINISHED.
> >> > There can be specific cases where sometimes the task status is not
> >> > reliably
> >> > sent to your scheduler (due to mesos-master restarts, leader election
> >> > changes, etc.). There is a task reconciliation support in Mesos. A
> >> > periodic
> >> > call to reconcile tasks from the scheduler can be helpful. There are
> >> > also
> >> > newer enhancements coming to the task reconciliation. In the mean
> time,
> >> > there are other strategies such as what I use, which is periodic
> >> > heartbeats
> >> > from my custom executor to my scheduler (out of band). The timeouts
> for
> >> > task
> >> > runtimes are similar to heartbeats, except, you need a priori
> knowledge
> >> > of
> >> > all tasks' runtimes.
> >> >
> >> > Task runtime limits are not support inherently, as far as I know. Your
> >> > executor can implement it, and that may be one simple way to do it.
> That
> >> > could also be a good way to implement shell's rlimit*, in general.
> >> >
> >> >
> >> >
> >> > On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher <[email protected]
> >
> >> > wrote:
> >> >>
> >> >> I'm using a custom internal framework, loosely based on MesosSubmit.
> >> >> The phenomenon I'm seeing is something like this:
> >> >> 1. Task X is assigned to slave S.
> >> >> 2. I know this task should run for ~10minutes.
> >> >> 3. On the master dashboard, I see that task X is in the "Running"
> state
> >> >> for several *hours*.
> >> >> 4. I SSH into slave S, and see that task X is *not* running.
> According
> >> >> to
> >> >> the local logs on that slave, task X finished a long time ago, and
> >> >> seemed to
> >> >> finish OK.
> >> >> 5. According to the scheduler logs, it never got any update from
> task X
> >> >> after the Staging->Running update.
> >> >>
> >> >> The phenomenon occurs pretty often, but it's not consistent or
> >> >> deterministic.
> >> >>
> >> >> I'd appreciate your input on how to go about debugging it, and/or
> >> >> implement a workaround to avoid wasted resources.
> >> >>
> >> >> I'm pretty sure the executor on the slave sends the TASK_FINISHED
> >> >> status
> >> >> update (how can I verify that beyond my own logging?).
> >> >> I'm pretty sure the scheduler never receives that update (again, how
> >> >> can I
> >> >> verify that beyond my own logging?).
> >> >> I have no idea if the master got the update and passed it through
> (how
> >> >> can
> >> >> I check that?).
> >> >> My scheduler and executor are written in Python.
> >> >>
> >> >> As for a workaround - setting a timeout on a task should do the
> trick.
> >> >> I
> >> >> did not see any timeout field in the TaskInfo message. Does mesos
> >> >> support
> >> >> the concept of per-task timeouts? Or should I implement my own task
> >> >> tracking
> >> >> and timeouting mechanism in the scheduler?
> >> >
> >> >
> >
> >
>

Re: Trying to debug an issue in mesos task tracking

Reply via email to