Itamar,

beyond checking master and slave logs, could you pleasse verify your
executor does send the TASK_FINISHED update? You may want to add some
logging and the check executor log. Mesos guarantees the delivery of
status updates, so I suspect the problem is on the executor's side.

On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila <spod...@netflix.com> wrote:
> Have you checked the mesos-slave and mesos-master logs for that task id?
> There should be logs in there for task state updates, including FINISHED.
> There can be specific cases where sometimes the task status is not reliably
> sent to your scheduler (due to mesos-master restarts, leader election
> changes, etc.). There is a task reconciliation support in Mesos. A periodic
> call to reconcile tasks from the scheduler can be helpful. There are also
> newer enhancements coming to the task reconciliation. In the mean time,
> there are other strategies such as what I use, which is periodic heartbeats
> from my custom executor to my scheduler (out of band). The timeouts for task
> runtimes are similar to heartbeats, except, you need a priori knowledge of
> all tasks' runtimes.
>
> Task runtime limits are not support inherently, as far as I know. Your
> executor can implement it, and that may be one simple way to do it. That
> could also be a good way to implement shell's rlimit*, in general.
>
>
>
> On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher <ita...@yowza3d.com>
> wrote:
>>
>> I'm using a custom internal framework, loosely based on MesosSubmit.
>> The phenomenon I'm seeing is something like this:
>> 1. Task X is assigned to slave S.
>> 2. I know this task should run for ~10minutes.
>> 3. On the master dashboard, I see that task X is in the "Running" state
>> for several *hours*.
>> 4. I SSH into slave S, and see that task X is *not* running. According to
>> the local logs on that slave, task X finished a long time ago, and seemed to
>> finish OK.
>> 5. According to the scheduler logs, it never got any update from task X
>> after the Staging->Running update.
>>
>> The phenomenon occurs pretty often, but it's not consistent or
>> deterministic.
>>
>> I'd appreciate your input on how to go about debugging it, and/or
>> implement a workaround to avoid wasted resources.
>>
>> I'm pretty sure the executor on the slave sends the TASK_FINISHED status
>> update (how can I verify that beyond my own logging?).
>> I'm pretty sure the scheduler never receives that update (again, how can I
>> verify that beyond my own logging?).
>> I have no idea if the master got the update and passed it through (how can
>> I check that?).
>> My scheduler and executor are written in Python.
>>
>> As for a workaround - setting a timeout on a task should do the trick. I
>> did not see any timeout field in the TaskInfo message. Does mesos support
>> the concept of per-task timeouts? Or should I implement my own task tracking
>> and timeouting mechanism in the scheduler?
>
>

Reply via email to