[ https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871697#comment-17871697 ]
ASF GitHub Bot commented on YARN-11709: --------------------------------------- ferdelyi commented on code in PR #6960: URL: https://github.com/apache/hadoop/pull/6960#discussion_r1707211380 ########## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java: ########## @@ -451,8 +451,10 @@ public void startLocalizer(LocalizerStartContext ctx) } catch (PrivilegedOperationException e) { int exitCode = e.getExitCode(); - LOG.warn("Exit code from container {} startLocalizer is : {}", - locId, exitCode, e); + LOG.error("Unrecoverable issue occurred. Marking the node as unhealthy to prevent " + + "further containers to get scheduled on the node and cause application failures. " + + "Exit code from the container " + locId + "startLocalizer is : " + exitCode, e); + nmContext.getNodeStatusUpdater().reportException(e); Review Comment: @zeekling thank you for looking into this change! Yes, when we hit an unrecoverable issue with the NM, the root cause needs to be fixed and the NM manually restarted. This way the RM will not schedule applications to the node while the issue is present. When we let the RM to place containers to the faulty NM, it can lead to application failures. E.g. by reaching maximum number of application attempts when the AM was scheduled to the same node twice. > NodeManager should be shut down or blacklisted when it cannot run program > "/var/lib/yarn-ce/bin/container-executor" > ------------------------------------------------------------------------------------------------------------------- > > Key: YARN-11709 > URL: https://issues.apache.org/jira/browse/YARN-11709 > Project: Hadoop YARN > Issue Type: Improvement > Components: container-executor > Reporter: Ferenc Erdelyi > Assignee: Ferenc Erdelyi > Priority: Major > Labels: pull-request-available > > When NodeManager encounters the below "No such file or directory" error > reported against the "container-executor", it should give up participating in > the cluster as it is not capable to run any container, but just fail the jobs. > {code:java} > 2023-01-18 10:08:10,600 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_e159_1673543180101_9407_02_ > 000014 startLocalizer is : -1 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j > ava:1250) > Caused by: java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org