The command executor was probably fixed somewhere between 0.21 and 1.3. The
only reason I mentioned 1.3+ is because any releases before that are out of
support period. If you can repro the issue with 1.3+ and paste the logs
here or in a JIRA, we can help debug it for you.

On Wed, Jan 10, 2018 at 9:47 AM, Ajay V <[email protected]> wrote:

> Thanks for getting back Vinod. So, does that mean that even for v1.2,
> these race conditions (where the command executor doesn't stay long enough
> ) existed and that 1.3 versions fixes them ?. Reason for asking is because
> I did try an upgrade to v1.2 and still found very similar issues.
>
> Regards,
> Ajay
>
> On Tue, Jan 9, 2018 at 6:48 PM, Vinod Kone <[email protected]> wrote:
>
>> 0.21 is really old and not supported. I highly recommend you upgrade to
>> 1.3+.
>>
>> Regarding what you are seeing, we definitely had issues in the past where
>> the command executor didn't stay up long enough to guarantee that
>> TASK_FINISHED was delivered to the agent; so races like above were possible.
>>
>> On Tue, Jan 9, 2018 at 5:33 PM, Ajay V <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> I'm trying to debug a TASK_LOST thats generated on the agent that I see
>>> on rare occasions.
>>>
>>> Following is a log that I'm trying to understand. This is happening
>>> after the driver.sendStatusUpdate() has been called with a task state of
>>> TASK_FINISHED from a java executor. It looks to me like the container is
>>> already exited before the TASK_FINISHED  is processed. Is there a timing
>>> issue here in this version of mesos that is causing this? The effect of
>>> this problem is that, even though the work of the executor is complete and
>>> the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is
>>> marked as LOST and the actual update of TASK_FINISHED is ignored.
>>>
>>> I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for
>>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited
>>>
>>> I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container
>>> 'bb0e5f2d-4bdb-479c-b829-4741993c4109'
>>>
>>> W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown
>>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109'
>>>
>>> W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource
>>> statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because:
>>> Failed to get usage: No process found at 28952
>>>
>>> I0108 10:16:52.899657 37278 slave.cpp:2898] Executor
>>> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework
>>> 20171208-050805-140555025-5050-3470-0000 exited with status 0
>>>
>>> I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update
>>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000 from @0.0.0.0:0
>>>
>>> I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task
>>> ff631ad1-cfab-493e-be18-961581abcf3d
>>>
>>> W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for
>>> unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109
>>>
>>> I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received
>>> status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5)
>>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000
>>>
>>> I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding
>>> update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000 to the slave
>>>
>>> I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update
>>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000 to [email protected]:5050
>>>
>>> I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager
>>> successfully handled status update TASK_LOST (UUID:
>>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000
>>>
>>> I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received
>>> status update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5)
>>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000
>>>
>>> I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up
>>> status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of
>>> framework 20171208-050805-140555025-5050-3470-0000
>>>
>>> I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager
>>> successfully handled status update acknowledgement (UUID:
>>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000
>>>
>>> I0108 10:16:52.958693 37268 slave.cpp:4344] Completing task
>>> ff631ad1-cfab-493e-be18-961581abcf3d
>>>
>>> I0108 10:16:52.960364 37268 slave.cpp:3007] Cleaning up executor
>>> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework
>>> 20171208-050805-140555025-5050-3470-0000
>>>
>>> Regards,
>>> Ajay
>>>
>>
>>
>

Reply via email to