[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827045#comment-16827045 ] Eric Yang commented on YARN-9486: - Thank you [~Jim_Brennan] for the review, and [~ebadger] for the commit. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch, YARN-9486.004.patch, YARN-9486.005.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826580#comment-16826580 ] Hudson commented on YARN-9486: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #16466 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16466/]) YARN-9486. Docker container exited with failure does not get clean up (ebadger: rev 79d3d35398cb5348cfd62e41e3318ec7a337421a) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerCleanup.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerRelaunch.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerCleanup.java > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch, YARN-9486.004.patch, YARN-9486.005.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id:
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826234#comment-16826234 ] Hadoop QA commented on YARN-9486: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 22s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 13s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 20m 48s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 69m 15s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9486 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12967031/YARN-9486.005.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 1fc77044de83 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / b5dcf64 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24022/testReport/ | | Max. process+thread count | 446 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/24022/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Docker container exited with failure does not get clean up
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826198#comment-16826198 ] Jim Brennan commented on YARN-9486: --- [~eyang] thanks for updating the comment. +1 (non-binding) on patch 005. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch, YARN-9486.004.patch, YARN-9486.005.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826192#comment-16826192 ] Eric Yang commented on YARN-9486: - [~Jim_Brennan] Thank you for the review. Patch 005 is same as patch 004 with comment added to explain the corner cases. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch, YARN-9486.004.patch, YARN-9486.005.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826142#comment-16826142 ] Jim Brennan commented on YARN-9486: --- [~eyang], I am +1 (non-binding) on patch 004. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch, YARN-9486.004.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826101#comment-16826101 ] Jim Brennan commented on YARN-9486: --- {quote} As the result, we need to check both markedLaunched and isLaunchCompleted to get a better picture if the contained failed to launch, still running, or has not started at all. {quote} [~eyang] Thanks again for the follow-up. I agree that adding the isLaunchCompleted check is warranted to cover all cases. It might be helpful to add a comment about the relaunch case where a containerAlreadyLaunched is false but isCompleted is true, which seems counter-intuitive. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch, YARN-9486.004.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825581#comment-16825581 ] Hadoop QA commented on YARN-9486: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 7s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 25s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 24s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 70m 16s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9486 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12966949/YARN-9486.004.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 9d1db0582e98 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / a703dae | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24020/testReport/ | | Max. process+thread count | 446 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/24020/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Docker container exited with failure does not get clean up
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825534#comment-16825534 ] Eric Yang commented on YARN-9486: - I also tried: {code} boolean alreadyLaunched = launch.isLaunchCompleted(); {code} This prevent the container relaunch from happening. Completed flag is not set, if relaunch a container that is still running. As the result, we need to check both markedLaunched and isLaunchCompleted to get a better picture if the contained failed to launch, still running, or has not started at all. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825503#comment-16825503 ] Eric Yang commented on YARN-9486: - {quote}The only option I can think of other than adding the isLaunchCompleted check in ContainerCleanup would be to call markLaunched() when you catch an exception in ContainerRelaunch.call(). That's a little unexpected, so you'd need to add a comment to say we need to mark isLaunched in this case to ensure the original container is cleaned up.{quote} Tried this, and this approach creates another problem. If container relaunch failed, container is marked as launched. Reattempt on the failed container does not happen, and the container is reporting it is running. The decision of launching container is based on containerAlreadyLaunched flag. Therefore, manually changing the state of this flag can create undesired side effect. For clean up, maybe it is cleaner to base on isLaunchCompleted because it is always set even if container failed to launch. Thoughts? > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code}
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825465#comment-16825465 ] Jim Brennan commented on YARN-9486: --- {quote}Patch 003 added the safe guard for missing pid file, and reverted the isLaunchCompleted logic. If IOException is thrown by disk health check, it will leave containers behind. Is that ok? I feel safer to check isLaunchCompleted flag to catch the corner cases, but I understand it may not be helpful in code readability. {quote} Yeah - really anything that throws before you actually call relaunchContainer() will put you in that state - the new call to getLocalPathForWrite() can throw IOException as well. I don't think it's ok to leave containers behind. The only option I can think of other than adding the isLaunchCompleted check in ContainerCleanup would be to call markLaunched() when you catch an exception in ContainerRelaunch.call(). That's a little unexpected, so you'd need to add a comment to say we need to mark isLaunched in this case to ensure the original container is cleaned up. My concern about the isLaunchCompleted check is that we always set that in the finally clause for ContainerLaunch.call(), so any failure before the launchContainer() call will now cause a cleanup where it didn't before (like if we fail on the areDisksHealthy() check like you mentioned for the relaunch case. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO >
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825460#comment-16825460 ] Hadoop QA commented on YARN-9486: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 35s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 18s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 3 new + 48 unchanged - 0 fixed = 51 total (was 48) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 9s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 6s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 69m 42s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9486 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12966930/YARN-9486.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 6f9c9c79b953 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / a703dae | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/24018/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24018/testReport/ | | Max. process+thread count | 412 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U:
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825416#comment-16825416 ] Eric Yang commented on YARN-9486: - [~Jim_Brennan] Patch 003 added the safe guard for missing pid file, and reverted the isLaunchCompleted logic. If IOException is thrown by disk health check, it will leave containers behind. Is that ok? I feel safer to check isLaunchCompleted flag to catch the corner cases, but I understand it may not be helpful in code readability. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825404#comment-16825404 ] Jim Brennan commented on YARN-9486: --- [~eyang] {quote} The right logic is probably try to locate it first, if it is not found, then create a new path. {quote} I agree. I think it we fix this, we won't need to change the cleanup logic. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825344#comment-16825344 ] Eric Yang commented on YARN-9486: - [~Jim_Brennan] getLocalPathForWrite will choose the first disk that can write, where getLocalPathForRead will locate the file, if it exists or throw IOException if the file does not exist. If it is changed to use getLocalPathForWrite, then we may end up with pid files on multiple disks. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825333#comment-16825333 ] Jim Brennan commented on YARN-9486: --- [~eyang] I am not too familiar with the ContainerRelaunch path, but why is it using getLocalPathForRead() ? Doesn't it need to overwrite that file? ContainerLaunch is using: {noformat} String pidFileSubpath = getPidFileSubpath(appIdStr, containerIdStr); pidFilePath = dirsHandler.getLocalPathForWrite(pidFileSubpath); {noformat} > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825312#comment-16825312 ] Eric Yang commented on YARN-9486: - [~Jim_Brennan] This stacktrace tells the whole story: {code} 2019-04-23 22:34:08,919 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch: Failed to relaunch container. java.io.IOException: Could not find nmPrivate/application_1556058714621_0001/container_1556058714621_0001_01_02//container_1556058714621_0001_01_02.pid in any of the directories at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getPathToRead(LocalDirsHandlerService.java:597) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForRead(LocalDirsHandlerService.java:612) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.getPidFilePath(ContainerRelaunch.java:200) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:90) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} It got a IOException because pidFilePath does not exist, and this causes relaunchContainer logic to skip prepareForLaunch and setContainerCompletedStatus to true. This means if pidFile does not exist, relaunch logic can not work. This is problematic for container that fail to start, and relaunch would not retry. It looks like we may want to put a empty pid file to allow pidPathFile finder to work, even if no pid file could be found in Docker. We may want to remove docker container in method cleanupContainerFiles in ContainerLaunch class. Otherwise, the existence of previous docker container will prevent the relaunch from happening as well. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825222#comment-16825222 ] Jim Brennan commented on YARN-9486: --- [~eyang] I'm not sure I agree. This suggests that containerAlreadyLaunched has not been set yet when we get here. It seems to me that the bug is in the relaunch case - shouldn't we be marking the container launched when we relaunch it? It looks like the ContainerLaunch.relaunchContainer() calls prepareForLaunch(), which should set it. Do you know why this is not happening in this case? > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824644#comment-16824644 ] Eric Yang commented on YARN-9486: - [~Jim_Brennan] I added a couple debug statement: {code:java} +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerCleanup.java @@ -96,9 +96,10 @@ public void run() { } // launch flag will be set to true if process already launched boolean alreadyLaunched = !launch.markLaunched(); +LOG.info("alreadyLaunched: "+alreadyLaunched+" isLaunchCompleted: "+launch.isLaunchCompleted()); if (!alreadyLaunched) { + LOG.info("!alreadyLaunched: "+!alreadyLaunched); LOG.info("Container " + containerIdStr + " not launched." + " No cleanup needed to be done"); return; {code} Output of the logs for node manager looks like this: {code:java} 2019-04-23 22:34:08,919 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch: Failed to relaunch container. java.io.IOException: Could not find nmPrivate/application_1556058714621_0001/container_1556058714621_0001_01_02//container_1556058714621_0001_01_02.pid in any of the directories at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getPathToRead(LocalDirsHandlerService.java:597) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForRead(LocalDirsHandlerService.java:612) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.getPidFilePath(ContainerRelaunch.java:200) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:90) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2019-04-23 22:34:08,922 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1556058714621_0001_01_02 transitioned from RELAUNCHING to EXITED_WITH_FAILURE 2019-04-23 22:34:08,925 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: Cleaning up container container_1556058714621_0001_01_02 2019-04-23 22:34:08,926 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: alreadyLaunched: false isLaunchCompleted: true 2019-04-23 22:34:08,926 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: !alreadyLaunched: true 2019-04-23 22:34:08,926 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: Container container_1556058714621_0001_01_02 not launched. No cleanup needed to be done 2019-04-23 22:34:08,963 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /tmp/hadoop-yarn/nm-local-dir/usercache/hbase/appcache/application_1556058714621_0001/container_1556058714621_0001_01_02 2019-04-23 22:34:08,963 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Privileged Execution Command Array: [/usr/local/hadoop-3.3.0-SNAPSHOT/bin/container-executor, hbase, hbase, 3, /tmp/hadoop-yarn/nm-local-dir/usercache/hbase/appcache/application_1556058714621_0001/container_1556058714621_0001_01_02] 2019-04-23 22:34:08,963 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase OPERATION=Container Finished - Failed TARGET=ContainerImplRESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1556058714621_0001 CONTAINERID=container_1556058714621_0001_01_02 2019-04-23 22:34:08,967 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1556058714621_0001_01_02 transitioned from EXITED_WITH_FAILURE to DONE {code} If it is set to relaunch, the markedLaunched will return false because it was previously marked by prepareForLaunch and launched. This atomic boolean compare false to true, will returned false. Double logic negating for false is still false. This causes the failure to clean up the previous instance of the container. I think the added logic is necessary to ensure relaunch will proceed with clean up Docker container instance logic by checking if container had been completed. Do you agree with this analysis? > Docker container exited with failure does
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823395#comment-16823395 ] Eric Yang commented on YARN-9486: - {quote}and true if it was true. In either case, it will be true after this call. I was accounting for this in my comment above.{quote} [~Jim_Brennan] I agree with you that logic negating in this code, should have worked as it is written. ContainerLaunch should have right state when it reaches cleanup logic. I am not sure if we ever clone ContainerLaunch object to cause this issue. More investigation is required. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823309#comment-16823309 ] Jim Brennan commented on YARN-9486: --- {quote} It looks like the problem is in the usage of compareAndSet(false, true);. {code:java} /** * Marks the container to be launched only if it was not launched. * * @return true if successful; false otherwise. */ boolean markLaunched() { return containerAlreadyLaunched.compareAndSet(false, true); }{code} This will return false if the actual value is not equal to expected value. The person who coded this is assuming it will return the value of containerAlreadyLaunched. {quote} This is why it is negating the return from mark.launched(). !mark.launched() will be false if containerAlreadyLaunched was false, and true if it was true. In either case, it will be true after this call. I was accounting for this in my comment above. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823222#comment-16823222 ] Eric Yang commented on YARN-9486: - It looks like the problem is in the usage of compareAndSet(false, true);. {code} /** * Marks the container to be launched only if it was not launched. * * @return true if successful; false otherwise. */ boolean markLaunched() { return containerAlreadyLaunched.compareAndSet(false, true); }{code} This will return false if the actual value is not equal to expected value. The person who coded this is assuming it will return the value of containerAlreadyLaunched. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823213#comment-16823213 ] Eric Yang commented on YARN-9486: - [~Jim_Brennan] I don't fully understand it myself when I put up this patch. It looks like race condition that launching thread and clean up threads are getting inconsistent values for the given ContainerLaunch object. I am guessing this is the result of merging the new implementation of ContainerCleanup class with upgrade logic in ContainerLauncher. Access to ContainerLauncher internal state is not guarded correctly to cause the race condition to surface. The test was using YARN service and intentionally failing the container to reproduce the docker container fail to clean up, and I was able to print out the value of alreadyLaunched and launch.isLaunchCompleted to see that isLaunchCompleted is set to true while alreadyLaunched is false. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823189#comment-16823189 ] Jim Brennan commented on YARN-9486: --- [~eyang] I'm just trying to understand the logic here. containerAlreadyLaunched is initiialized as false. In prepareForLaunch(), it is set to true. In signalContainer() it is set to true. So it will be true if we attempted a container launch, or if we have signaled it (presumably for killing). In the containerCleanup thread, we currently have: {noformat} boolean alreadyLaunched = !launch.markLaunched(); if (!alreadyLaunched) { //skip {noformat} Which will also set it to true. If it was previously false, then we skip, so either it was never launched, or it was signaled. The patch adds this check: {noformat} boolean alreadyLaunched = !launch.markLaunched() || launch.isLaunchCompleted(); if (!alreadyLaunched) { // skip {noformat} The completed flag is set after a container returns from launchContainer. So basically any container that has fully completed will set alreadyLaunched true here. The part I am not following is how launch.isLaunchCompleted() can ever be true while when containerAlreadyLaunched is false? That is the only case that is changing here. In the current code, if containerAlreadyLaunched is false, then launch.markLaunched() will return true, so alreadyLaunched will be false and we will skip. And if containerAlreadyLaunched is true, then launch.markLaunched() will return false, so alreadyLaunched will be true, and we will not skip. In the patch, if launch.isLaunchCompleted() returns false, then the behavior is unchanged. If launch.isLaunchCompleted() returns true, it will affect the case where containerAlreadyLaunched is false - setting alreadyLaunched to true instead of false, and we won't skip. So the questions remains, how is it that we can have isLaunchCompleted() return true while containerAlreadyLaunched is false? > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822175#comment-16822175 ] Eric Yang commented on YARN-9486: - [~Jim_Brennan] containerAlreadyLaunched is default to false. If the container has been marked for killing before containerAlreadyLaunched is set (i.e. pid file doesn't exist for a period of 30 seconds), then it will return false, and never set containerAlreadyLaunched to true. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822028#comment-16822028 ] Jim Brennan commented on YARN-9486: --- [~eyang], why is launch.markLaunched() returning false in this case? > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821339#comment-16821339 ] Hadoop QA commented on YARN-9486: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 51s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 24s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 25m 20s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 30s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 75m 14s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.nodemanager.containermanager.TestContainerManager | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9486 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12966384/YARN-9486.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux c052e090c920 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / b979fdd | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/23996/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23996/testReport/ | | Max. process+thread count | 412 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U:
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818468#comment-16818468 ] Hadoop QA commented on YARN-9486: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 9s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 12s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 25s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 70m 58s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9486 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12965999/YARN-9486.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux ec542941ad8c 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 5583e1b | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/23962/artifact/out/whitespace-eol.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23962/testReport/ | | Max. process+thread count | 447 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output |
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818440#comment-16818440 ] Eric Yang commented on YARN-9486: - Patch 001 implements a simple check to verify if the container has been completed. If it has been marked as completed, then go through the flow of removing the container. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org