container-executor"

ASF GitHub Bot (Jira) Wed, 07 Aug 2024 08:06:05 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871697#comment-17871697
 ]


ASF GitHub Bot commented on YARN-11709:
---------------------------------------

ferdelyi commented on code in PR #6960:
URL: https://github.com/apache/hadoop/pull/6960#discussion_r1707211380


##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java:
##########
@@ -451,8 +451,10 @@ public void startLocalizer(LocalizerStartContext ctx)
 
     } catch (PrivilegedOperationException e) {
       int exitCode = e.getExitCode();
-      LOG.warn("Exit code from container {} startLocalizer is : {}",
-          locId, exitCode, e);
+      LOG.error("Unrecoverable issue occurred. Marking the node as unhealthy 
to prevent "
+          + "further containers to get scheduled on the node and cause 
application failures. " +
+          "Exit code from the container " + locId + "startLocalizer is : " + 
exitCode, e);
+      nmContext.getNodeStatusUpdater().reportException(e);

Review Comment:
   @zeekling thank you for looking into this change! Yes, when we hit an 
unrecoverable issue with the NM, the root cause needs to be fixed and the NM 
manually restarted. This way the RM will not schedule applications to the node 
while the issue is present. When we let the RM to place containers to the 
faulty NM, it can lead to application failures. E.g. by reaching maximum number 
of application attempts when the AM was scheduled to the same node twice.





> NodeManager should be shut down or blacklisted when it cannot run program 
> "/var/lib/yarn-ce/bin/container-executor"
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-11709
>                 URL: https://issues.apache.org/jira/browse/YARN-11709
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: container-executor
>            Reporter: Ferenc Erdelyi
>            Assignee: Ferenc Erdelyi
>            Priority: Major
>              Labels: pull-request-available
>
> When NodeManager encounters the below "No such file or directory" error 
> reported against the "container-executor", it should give up participating in 
> the cluster as it is not capable to run any container, but just fail the jobs.
> {code:java}
> 2023-01-18 10:08:10,600 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_e159_1673543180101_9407_02_
> 000014 startLocalizer is : -1
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j
> ava:1250)
> Caused by: java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"

Reply via email to