The command executor was probably fixed somewhere between 0.21 and 1.3. The only reason I mentioned 1.3+ is because any releases before that are out of support period. If you can repro the issue with 1.3+ and paste the logs here or in a JIRA, we can help debug it for you.
On Wed, Jan 10, 2018 at 9:47 AM, Ajay V <[email protected]> wrote: > Thanks for getting back Vinod. So, does that mean that even for v1.2, > these race conditions (where the command executor doesn't stay long enough > ) existed and that 1.3 versions fixes them ?. Reason for asking is because > I did try an upgrade to v1.2 and still found very similar issues. > > Regards, > Ajay > > On Tue, Jan 9, 2018 at 6:48 PM, Vinod Kone <[email protected]> wrote: > >> 0.21 is really old and not supported. I highly recommend you upgrade to >> 1.3+. >> >> Regarding what you are seeing, we definitely had issues in the past where >> the command executor didn't stay up long enough to guarantee that >> TASK_FINISHED was delivered to the agent; so races like above were possible. >> >> On Tue, Jan 9, 2018 at 5:33 PM, Ajay V <[email protected]> wrote: >> >>> Hello, >>> >>> I'm trying to debug a TASK_LOST thats generated on the agent that I see >>> on rare occasions. >>> >>> Following is a log that I'm trying to understand. This is happening >>> after the driver.sendStatusUpdate() has been called with a task state of >>> TASK_FINISHED from a java executor. It looks to me like the container is >>> already exited before the TASK_FINISHED is processed. Is there a timing >>> issue here in this version of mesos that is causing this? The effect of >>> this problem is that, even though the work of the executor is complete and >>> the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is >>> marked as LOST and the actual update of TASK_FINISHED is ignored. >>> >>> I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for >>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited >>> >>> I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container >>> 'bb0e5f2d-4bdb-479c-b829-4741993c4109' >>> >>> W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown >>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' >>> >>> W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource >>> statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because: >>> Failed to get usage: No process found at 28952 >>> >>> I0108 10:16:52.899657 37278 slave.cpp:2898] Executor >>> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework >>> 20171208-050805-140555025-5050-3470-0000 exited with status 0 >>> >>> I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update >>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470-0000 from @0.0.0.0:0 >>> >>> I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task >>> ff631ad1-cfab-493e-be18-961581abcf3d >>> >>> W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for >>> unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109 >>> >>> I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received >>> status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) >>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470-0000 >>> >>> I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding >>> update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470-0000 to the slave >>> >>> I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update >>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470-0000 to [email protected]:5050 >>> >>> I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager >>> successfully handled status update TASK_LOST (UUID: >>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470-0000 >>> >>> I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received >>> status update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) >>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470-0000 >>> >>> I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up >>> status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of >>> framework 20171208-050805-140555025-5050-3470-0000 >>> >>> I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager >>> successfully handled status update acknowledgement (UUID: >>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470-0000 >>> >>> I0108 10:16:52.958693 37268 slave.cpp:4344] Completing task >>> ff631ad1-cfab-493e-be18-961581abcf3d >>> >>> I0108 10:16:52.960364 37268 slave.cpp:3007] Cleaning up executor >>> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework >>> 20171208-050805-140555025-5050-3470-0000 >>> >>> Regards, >>> Ajay >>> >> >> >

