[
https://issues.apache.org/jira/browse/YARN-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682284#comment-14682284
]
Advertising
Anubhav Dhoot commented on YARN-4046:
-------------------------------------
The error in NodeManager shows
{noformat}
2015-08-10 15:14:05,567 ERROR
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch:
Unable to recover container container_e45_1439244348718_0001_01_000001
java.io.IOException: Timeout while waiting for exit code from
container_e45_1439244348718_0001_01_000001
at
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:199)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}
Looking under the debugger the actual shell command to check if container is
alive fails because the kill command syntax "kill -0 -20773" fails.
{noformat}
his = {org.apache.hadoop.util.Shell$ShellCommandExecutor@6740} "kill -0 -20773 "
builder = {java.lang.ProcessBuilder@6789}
command = {java.util.ArrayList@6813} size = 3
directory = null
environment = null
redirectErrorStream = false
redirects = null
timeOutTimer = null
timeoutTimerTask = null
errReader = {java.io.BufferedReader@6830}
inReader = {java.io.BufferedReader@6833}
errMsg = {java.lang.StringBuffer@6836} "kill: invalid option -- '2'\n\nUsage:\n
kill [options] <pid> [...]\n\nOptions:\n <pid> [...] send signal to
every <pid> listed\n -<signal>, -s, --signal <signal>\n
specify the <signal> to be sent\n -l, --list=[<signal>] list all signal names,
or convert one to a name\n -L, --table list all signal names in a
nice table\n\n -h, --help display this help and exit\n -V, --version
output version information and exit\n\nFor more details see kill(1).\n"
errThread = {org.apache.hadoop.util.Shell$1@6839} "Thread[Thread-102,5,]"
line = null
exitCode = 1
completed = {java.util.concurrent.atomic.AtomicBoolean@6806} "true"
{noformat}
This causes DefaultContainerExecutor#containerIsAlive to catch
ExitCodeException thrown by ShellCommandExecutor.execute making it assume the
container is lost.
> Applications fail on NM restart on some linux distro because NM container
> recovery declares AM container as LOST
> ----------------------------------------------------------------------------------------------------------------
>
> Key: YARN-4046
> URL: https://issues.apache.org/jira/browse/YARN-4046
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Anubhav Dhoot
> Assignee: Anubhav Dhoot
> Priority: Critical
>
> On a debian machine we have seen node manager recovery of containers fail
> because the signal syntax for process group may not work. We see errors in
> checking if process is alive during container recovery which causes the
> container to be declared as LOST (154) on a NodeManager restart.
> The application will fail with error
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt
> recovered after RM restartAM Container for
> appattempt_1439244348718_0001_000001 exited with exitCode: 154
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)