[ 
https://issues.apache.org/jira/browse/YARN-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anil Sadineni resolved YARN-10205.
----------------------------------
    Resolution: Not A Problem

> NodeManager stateful restart feature did not work as expected - information 
> only (Resolved)
> -------------------------------------------------------------------------------------------
>
>                 Key: YARN-10205
>                 URL: https://issues.apache.org/jira/browse/YARN-10205
>             Project: Hadoop YARN
>          Issue Type: Test
>          Components: graceful, nodemanager, rolling upgrade, yarn
>            Reporter: Anil Sadineni
>            Priority: Major
>
> *TL;DR* This is information only Jira on stateful restart of node manager 
> feature. Unexpected behavior of this feature was due to systemd process 
> configurations in this case. Please read below for more details - 
> Stateful restart of Node Manager(YARN-1336) i introduced in Hadoop 2.6. This 
> feature worked as expected in Hadoop 2.6 for us. Recently we upgraded our 
> clusters from 2.6 to 2.9.2 along with some OS upgrades. This feature was 
> broken after the upgrade. one of the initial suspicion was 
> LinuxContainerExecutor as we started using it in this upgrade. 
> yarn-site.xml has all required configurations to enable this feature - 
> {{yarn.nodemanager.recovery.enabled: 'true'}}
> {{yarn.nodemanager.recovery.dir:'<nm_recovery_dir>'}}
> {{yarn.nodemanager.recovery.supervised: 'true'}}
> {{yarn.nodemanager.address: '0.0.0.0:8041'}}
> While containers running and NM restarted, below is the exception constantly 
> observed in Node Manager logs - 
> {quote}
> java.io.IOException: *Timeout while waiting for exit code from 
> container_e37_1583181000856_0008_01_000043*2020-03-05 17:45:18,241 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
>  Unable to recover container container_e37_1583181000856_0008_01_000043
> {quote}
> {quote}        at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:274)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:631)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2020-03-05 17:45:18,241 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
>  Unable to recover container container_e37_1583181000856_0008_01_000018
> java.io.IOException: Timeout while waiting for exit code from 
> container_e37_1583181000856_0008_01_000018
>         at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:274)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:631)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2020-03-05 17:45:18,242 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
>  Recovered container exited with a non-zero exit code 154
> {quote}
> {quote}2020-03-05 17:45:18,243 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
>  Recovered container exited with a non-zero exit code 154
> {quote}
> After some digging on what was causing exitfile missing, at OS level 
> identified that running container processes are going down as soon as NM is 
> going down. Process tree looks perfectly fine as the container-executor takes 
> care of forking child process as expected.  Dig deeper into various parts of 
> code to see if anything caused the failure. 
> One question was did we break anything in our internal repo after we forked 
> 2.9.2 from open source. Started looking into code at different areas like NM 
> shutdown hook and clean up process, NM State store on container launch, NM 
> aux services, container-executor, Shell launch and clean up related hooks, 
> etc. Things were looking fine as expected. 
> It was identified that hadoop-nodemanager systemd process configured to use 
> default KillMode which is control-group. 
> [https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillMode=]
> This is causing systemd to send a terminate signal to all child processes as 
> soon as NM daemon is down either through stop command or via kill -9 command. 
> With this, NM stateful restart is working as expected. As part of migration, 
> we moved all daemons from monit to systemd and this bug seems introduced 
> around that time. 
> I am sharing this information here so that it will be helpful if anyone goes 
> through a similar problem.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to