Re: Mesos rare TASK_LOST scenario v 0.21.0
The command executor was probably fixed somewhere between 0.21 and 1.3. The only reason I mentioned 1.3+ is because any releases before that are out of support period. If you can repro the issue with 1.3+ and paste the logs here or in a JIRA, we can help debug it for you. On Wed, Jan 10, 2018 at 9:47 AM, Ajay V wrote: > Thanks for getting back Vinod. So, does that mean that even for v1.2, > these race conditions (where the command executor doesn't stay long enough > ) existed and that 1.3 versions fixes them ?. Reason for asking is because > I did try an upgrade to v1.2 and still found very similar issues. > > Regards, > Ajay > > On Tue, Jan 9, 2018 at 6:48 PM, Vinod Kone wrote: > >> 0.21 is really old and not supported. I highly recommend you upgrade to >> 1.3+. >> >> Regarding what you are seeing, we definitely had issues in the past where >> the command executor didn't stay up long enough to guarantee that >> TASK_FINISHED was delivered to the agent; so races like above were possible. >> >> On Tue, Jan 9, 2018 at 5:33 PM, Ajay V wrote: >> >>> Hello, >>> >>> I'm trying to debug a TASK_LOST thats generated on the agent that I see >>> on rare occasions. >>> >>> Following is a log that I'm trying to understand. This is happening >>> after the driver.sendStatusUpdate() has been called with a task state of >>> TASK_FINISHED from a java executor. It looks to me like the container is >>> already exited before the TASK_FINISHED is processed. Is there a timing >>> issue here in this version of mesos that is causing this? The effect of >>> this problem is that, even though the work of the executor is complete and >>> the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is >>> marked as LOST and the actual update of TASK_FINISHED is ignored. >>> >>> I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for >>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited >>> >>> I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container >>> 'bb0e5f2d-4bdb-479c-b829-4741993c4109' >>> >>> W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown >>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' >>> >>> W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource >>> statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because: >>> Failed to get usage: No process found at 28952 >>> >>> I0108 10:16:52.899657 37278 slave.cpp:2898] Executor >>> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework >>> 20171208-050805-140555025-5050-3470- exited with status 0 >>> >>> I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update >>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- from @0.0.0.0:0 >>> >>> I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task >>> ff631ad1-cfab-493e-be18-961581abcf3d >>> >>> W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for >>> unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109 >>> >>> I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received >>> status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) >>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- >>> >>> I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding >>> update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- to the slave >>> >>> I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update >>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- to master@17.179.96.8:5050 >>> >>> I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager >>> successfully handled status update TASK_LOST (UUID: >>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- >>> >>> I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received >>> status update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) >>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- >>> >>> I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up >>> status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of >>> framework 20171208-050805-140555025-5050-3470- >>> >>> I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager >>> successfully handled status update acknowledgement (UUID: >>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- >>> >>> I0108 10:16:52.958693 37268 slave.cpp:4344] Completing task >>> ff631ad1-cfab-493e-be18-961
Re: Mesos rare TASK_LOST scenario v 0.21.0
Thanks for getting back Vinod. So, does that mean that even for v1.2, these race conditions (where the command executor doesn't stay long enough ) existed and that 1.3 versions fixes them ?. Reason for asking is because I did try an upgrade to v1.2 and still found very similar issues. Regards, Ajay On Tue, Jan 9, 2018 at 6:48 PM, Vinod Kone wrote: > 0.21 is really old and not supported. I highly recommend you upgrade to > 1.3+. > > Regarding what you are seeing, we definitely had issues in the past where > the command executor didn't stay up long enough to guarantee that > TASK_FINISHED was delivered to the agent; so races like above were possible. > > On Tue, Jan 9, 2018 at 5:33 PM, Ajay V wrote: > >> Hello, >> >> I'm trying to debug a TASK_LOST thats generated on the agent that I see >> on rare occasions. >> >> Following is a log that I'm trying to understand. This is happening after >> the driver.sendStatusUpdate() has been called with a task state of >> TASK_FINISHED from a java executor. It looks to me like the container is >> already exited before the TASK_FINISHED is processed. Is there a timing >> issue here in this version of mesos that is causing this? The effect of >> this problem is that, even though the work of the executor is complete and >> the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is >> marked as LOST and the actual update of TASK_FINISHED is ignored. >> >> I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for >> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited >> >> I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container >> 'bb0e5f2d-4bdb-479c-b829-4741993c4109' >> >> W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown >> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' >> >> W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource >> statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because: >> Failed to get usage: No process found at 28952 >> >> I0108 10:16:52.899657 37278 slave.cpp:2898] Executor >> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework >> 20171208-050805-140555025-5050-3470- exited with status 0 >> >> I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update >> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >> ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- from @0.0.0.0:0 >> >> I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task >> ff631ad1-cfab-493e-be18-961581abcf3d >> >> W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for >> unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109 >> >> I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received >> status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for >> task ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- >> >> I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding >> update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >> ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- to the slave >> >> I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update >> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >> ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- to master@17.179.96.8:5050 >> >> I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager >> successfully handled status update TASK_LOST (UUID: >> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >> ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- >> >> I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received >> status update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) >> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- >> >> I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up >> status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of >> framework 20171208-050805-140555025-5050-3470- >> >> I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager >> successfully handled status update acknowledgement (UUID: >> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >> ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- >> >> I0108 10:16:52.958693 37268 slave.cpp:4344] Completing task >> ff631ad1-cfab-493e-be18-961581abcf3d >> >> I0108 10:16:52.960364 37268 slave.cpp:3007] Cleaning up executor >> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework >> 20171208-050805-140555025-5050-3470- >> >> Regards, >> Ajay >> > >
Re: Mesos rare TASK_LOST scenario v 0.21.0
0.21 is really old and not supported. I highly recommend you upgrade to 1.3+. Regarding what you are seeing, we definitely had issues in the past where the command executor didn't stay up long enough to guarantee that TASK_FINISHED was delivered to the agent; so races like above were possible. On Tue, Jan 9, 2018 at 5:33 PM, Ajay V wrote: > Hello, > > I'm trying to debug a TASK_LOST thats generated on the agent that I see on > rare occasions. > > Following is a log that I'm trying to understand. This is happening after > the driver.sendStatusUpdate() has been called with a task state of > TASK_FINISHED from a java executor. It looks to me like the container is > already exited before the TASK_FINISHED is processed. Is there a timing > issue here in this version of mesos that is causing this? The effect of > this problem is that, even though the work of the executor is complete and > the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is > marked as LOST and the actual update of TASK_FINISHED is ignored. > > I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for container > 'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited > > I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container > 'bb0e5f2d-4bdb-479c-b829-4741993c4109' > > W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown > container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' > > W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource > statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because: > Failed to get usage: No process found at 28952 > > I0108 10:16:52.899657 37278 slave.cpp:2898] Executor > 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework > 20171208-050805-140555025-5050-3470- exited with status 0 > > I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update > TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task > ff631ad1-cfab-493e-be18-961581abcf3d of framework > 20171208-050805-140555025-5050-3470- from @0.0.0.0:0 > > I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task > ff631ad1-cfab-493e-be18-961581abcf3d > > W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for > unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109 > > I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received status > update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task > ff631ad1-cfab-493e-be18-961581abcf3d of framework > 20171208-050805-140555025-5050-3470- > > I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding > update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task > ff631ad1-cfab-493e-be18-961581abcf3d of framework > 20171208-050805-140555025-5050-3470- to the slave > > I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update > TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task > ff631ad1-cfab-493e-be18-961581abcf3d of framework > 20171208-050805-140555025-5050-3470- to master@17.179.96.8:5050 > > I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager > successfully handled status update TASK_LOST (UUID: > f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) > for task ff631ad1-cfab-493e-be18-961581abcf3d of framework > 20171208-050805-140555025-5050-3470- > > I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received status > update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for > task ff631ad1-cfab-493e-be18-961581abcf3d of framework > 20171208-050805-140555025-5050-3470- > > I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up > status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of > framework 20171208-050805-140555025-5050-3470- > > I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager > successfully handled status update acknowledgement (UUID: > f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task > ff631ad1-cfab-493e-be18-961581abcf3d > of framework 20171208-050805-140555025-5050-3470- > > I0108 10:16:52.958693 37268 slave.cpp:4344] Completing task > ff631ad1-cfab-493e-be18-961581abcf3d > > I0108 10:16:52.960364 37268 slave.cpp:3007] Cleaning up executor > 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework > 20171208-050805-140555025-5050-3470- > > Regards, > Ajay >
Mesos rare TASK_LOST scenario v 0.21.0
Hello, I'm trying to debug a TASK_LOST thats generated on the agent that I see on rare occasions. Following is a log that I'm trying to understand. This is happening after the driver.sendStatusUpdate() has been called with a task state of TASK_FINISHED from a java executor. It looks to me like the container is already exited before the TASK_FINISHED is processed. Is there a timing issue here in this version of mesos that is causing this? The effect of this problem is that, even though the work of the executor is complete and the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is marked as LOST and the actual update of TASK_FINISHED is ignored. I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because: Failed to get usage: No process found at 28952 I0108 10:16:52.899657 37278 slave.cpp:2898] Executor 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework 20171208-050805-140555025-5050-3470- exited with status 0 I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task ff631ad1-cfab-493e-be18-961581abcf3d of framework 20171208-050805-140555025-5050-3470- from @0.0.0.0:0 I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task ff631ad1-cfab-493e-be18-961581abcf3d W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109 I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task ff631ad1-cfab-493e-be18-961581abcf3d of framework 20171208-050805-140555025-5050-3470- I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task ff631ad1-cfab-493e-be18-961581abcf3d of framework 20171208-050805-140555025-5050-3470- to the slave I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task ff631ad1-cfab-493e-be18-961581abcf3d of framework 20171208-050805-140555025-5050-3470- to master@17.179.96.8:5050 I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager successfully handled status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task ff631ad1-cfab-493e-be18-961581abcf3d of framework 20171208-050805-140555025-5050-3470- I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received status update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task ff631ad1-cfab-493e-be18-961581abcf3d of framework 20171208-050805-140555025-5050-3470- I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of framework 20171208-050805-140555025-5050-3470- I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager successfully handled status update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task ff631ad1-cfab-493e-be18-961581abcf3d of framework 20171208-050805-140555025-5050-3470- I0108 10:16:52.958693 37268 slave.cpp:4344] Completing task ff631ad1-cfab-493e-be18-961581abcf3d I0108 10:16:52.960364 37268 slave.cpp:3007] Cleaning up executor 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework 20171208-050805-140555025-5050-3470- Regards, Ajay