[jira] [Updated] (YARN-11534) Incorrect exception handling in RecoveredContainerLaunch

Peter Szucs (Jira) Tue, 18 Jul 2023 04:21:04 -0700


     [ 
https://issues.apache.org/jira/browse/YARN-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Peter Szucs updated YARN-11534:
-------------------------------
    Description: 
When NM is restarted during a container recovery, it can happen that it 
interrupts the container reaquisition during the LinuxContainerExecutor's 
signalContainer method. In this case we will get the following exception:
java.io.InterruptedIOException: java.lang.InterruptedException
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:1011)
    at org.apache.hadoop.util.Shell.run(Shell.java:901)
    at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:177)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
    at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.InterruptedException
    at java.base/java.lang.Object.wait(Native Method)
    at java.base/java.lang.Object.wait(Object.java:328)
    at java.base/java.lang.ProcessImpl.waitFor(ProcessImpl.java:495)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:1001)
    ... 15 more
Later this InterruptedIOException get caught and wrapped inside a 
PrivilegedOperationException and a ContainerExecutionException. In 
LinuxContainerExecutor's 
[signalContainer|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java#L790]
 method we catch this exception again, and throw an IOException from it, 
causing the following stack trace:
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
 Signal container failed
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:183)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
    at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
2023-06-20 18:24:31,777 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
 Unable to recover container container_e03_1687266197584_0033_01_000001
java.io.IOException: Problem signalling container 256974 with NULL; output: 
null and exitCode: -1
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:746)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
    at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: 
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
 Signal container failed
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:183)
    at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
    at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
    ... 9 more
 

Since YARN-2846, we are using a "nonInterrupted" flag in 
RecoveredContainerLaunch's call method.

We indicate interruption when we catch InterruptedException and 
InterruptedIOException, see 
[this|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java#L95]
 code part. By default every container has a 154 (LOST) error code, and if the 
recovery is interrupted, we will keep this value. But when the flag indicates 
that an interruption happened, we won't persist this in the NM state store. But 
since in LinuxContainerExecutor we throw an IOException in the above case, we 
won't treat it interrupted. The default "LOST" state will be persisted for the 
container, and after an NM restart the RM will kill it.

 

The goal of this ticket is to improve the exception handling here, and indicate 
somehow the interruption if signalContainer method cannot be run successfully.

> Incorrect exception handling in RecoveredContainerLaunch
> --------------------------------------------------------
>
>                 Key: YARN-11534
>                 URL: https://issues.apache.org/jira/browse/YARN-11534
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>            Reporter: Peter Szucs
>            Assignee: Peter Szucs
>            Priority: Major
>
> When NM is restarted during a container recovery, it can happen that it 
> interrupts the container reaquisition during the LinuxContainerExecutor's 
> signalContainer method. In this case we will get the following exception:
> java.io.InterruptedIOException: java.lang.InterruptedException
>     at org.apache.hadoop.util.Shell.runCommand(Shell.java:1011)
>     at org.apache.hadoop.util.Shell.run(Shell.java:901)
>     at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:177)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.InterruptedException
>     at java.base/java.lang.Object.wait(Native Method)
>     at java.base/java.lang.Object.wait(Object.java:328)
>     at java.base/java.lang.ProcessImpl.waitFor(ProcessImpl.java:495)
>     at org.apache.hadoop.util.Shell.runCommand(Shell.java:1001)
>     ... 15 more
> Later this InterruptedIOException get caught and wrapped inside a 
> PrivilegedOperationException and a ContainerExecutionException. In 
> LinuxContainerExecutor's 
> [signalContainer|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java#L790]
>  method we catch this exception again, and throw an IOException from it, 
> causing the following stack trace:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
>  Signal container failed
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:183)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:834)
> 2023-06-20 18:24:31,777 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
>  Unable to recover container container_e03_1687266197584_0033_01_000001
> java.io.IOException: Problem signalling container 256974 with NULL; output: 
> null and exitCode: -1
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:746)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
>  Signal container failed
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:183)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
>     ... 9 more
>  
> Since YARN-2846, we are using a "nonInterrupted" flag in 
> RecoveredContainerLaunch's call method.
> We indicate interruption when we catch InterruptedException and 
> InterruptedIOException, see 
> [this|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java#L95]
>  code part. By default every container has a 154 (LOST) error code, and if 
> the recovery is interrupted, we will keep this value. But when the flag 
> indicates that an interruption happened, we won't persist this in the NM 
> state store. But since in LinuxContainerExecutor we throw an IOException in 
> the above case, we won't treat it interrupted. The default "LOST" state will 
> be persisted for the container, and after an NM restart the RM will kill it.
>  
> The goal of this ticket is to improve the exception handling here, and 
> indicate somehow the interruption if signalContainer method cannot be run 
> successfully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YARN-11534) Incorrect exception handling in RecoveredContainerLaunch

Reply via email to