[
https://issues.apache.org/jira/browse/YARN-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anil Sadineni resolved YARN-10205.
----------------------------------
Resolution: Not A Problem
> NodeManager stateful restart feature did not work as expected - information
> only (Resolved)
> -------------------------------------------------------------------------------------------
>
> Key: YARN-10205
> URL: https://issues.apache.org/jira/browse/YARN-10205
> Project: Hadoop YARN
> Issue Type: Test
> Components: graceful, nodemanager, rolling upgrade, yarn
> Reporter: Anil Sadineni
> Priority: Major
>
> *TL;DR* This is information only Jira on stateful restart of node manager
> feature. Unexpected behavior of this feature was due to systemd process
> configurations in this case. Please read below for more details -
> Stateful restart of Node Manager(YARN-1336) i introduced in Hadoop 2.6. This
> feature worked as expected in Hadoop 2.6 for us. Recently we upgraded our
> clusters from 2.6 to 2.9.2 along with some OS upgrades. This feature was
> broken after the upgrade. one of the initial suspicion was
> LinuxContainerExecutor as we started using it in this upgrade.
> yarn-site.xml has all required configurations to enable this feature -
> {{yarn.nodemanager.recovery.enabled: 'true'}}
> {{yarn.nodemanager.recovery.dir:'<nm_recovery_dir>'}}
> {{yarn.nodemanager.recovery.supervised: 'true'}}
> {{yarn.nodemanager.address: '0.0.0.0:8041'}}
> While containers running and NM restarted, below is the exception constantly
> observed in Node Manager logs -
> {quote}
> java.io.IOException: *Timeout while waiting for exit code from
> container_e37_1583181000856_0008_01_000043*2020-03-05 17:45:18,241 ERROR
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
> Unable to recover container container_e37_1583181000856_0008_01_000043
> {quote}
> {quote} at
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:274)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:631)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2020-03-05 17:45:18,241 ERROR
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
> Unable to recover container container_e37_1583181000856_0008_01_000018
> java.io.IOException: Timeout while waiting for exit code from
> container_e37_1583181000856_0008_01_000018
> at
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:274)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:631)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2020-03-05 17:45:18,242 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
> Recovered container exited with a non-zero exit code 154
> {quote}
> {quote}2020-03-05 17:45:18,243 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
> Recovered container exited with a non-zero exit code 154
> {quote}
> After some digging on what was causing exitfile missing, at OS level
> identified that running container processes are going down as soon as NM is
> going down. Process tree looks perfectly fine as the container-executor takes
> care of forking child process as expected. Dig deeper into various parts of
> code to see if anything caused the failure.
> One question was did we break anything in our internal repo after we forked
> 2.9.2 from open source. Started looking into code at different areas like NM
> shutdown hook and clean up process, NM State store on container launch, NM
> aux services, container-executor, Shell launch and clean up related hooks,
> etc. Things were looking fine as expected.
> It was identified that hadoop-nodemanager systemd process configured to use
> default KillMode which is control-group.
> [https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillMode=]
> This is causing systemd to send a terminate signal to all child processes as
> soon as NM daemon is down either through stop command or via kill -9 command.
> With this, NM stateful restart is working as expected. As part of migration,
> we moved all daemons from monit to systemd and this bug seems introduced
> around that time.
> I am sharing this information here so that it will be helpful if anyone goes
> through a similar problem.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]