Anubhav Dhoot created YARN-4046:
-----------------------------------
Summary: NM container recovery is broken on some linux distro
because of syntax of signal
Key: YARN-4046
URL: https://issues.apache.org/jira/browse/YARN-4046
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
Priority: Critical
On a debian machine we have seen node manager recovery of containers fail
because the signal syntax for process group may not work. We see errors in
checking if process is alive during container recovery which causes the
container to be declared as LOST (154) on a NodeManager restart.
The application will fail with error
{noformat}
Application application_1439244348718_0001 failed 1 times due to Attempt
recovered after RM restartAM Container for appattempt_1439244348718_0001_000001
exited with exitCode: 154
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)