Anubhav Dhoot commented on YARN-4046:

The error in NodeManager shows 
2015-08-10 15:14:05,567 ERROR 
 Unable to recover container container_e45_1439244348718_0001_01_000001
java.io.IOException: Timeout while waiting for exit code from 
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.lang.Thread.run(Thread.java:745)

Looking under the debugger the actual shell command to check if container is 
alive fails because the kill command syntax  "kill -0 -20773" fails.
his = {org.apache.hadoop.util.Shell$ShellCommandExecutor@6740} "kill -0 -20773 "
builder = {java.lang.ProcessBuilder@6789} 
 command = {java.util.ArrayList@6813}  size = 3
 directory = null
 environment = null
 redirectErrorStream = false
 redirects = null
timeOutTimer = null
timeoutTimerTask = null
errReader = {java.io.BufferedReader@6830} 
inReader = {java.io.BufferedReader@6833} 
errMsg = {java.lang.StringBuffer@6836} "kill: invalid option -- '2'\n\nUsage:\n 
kill [options] <pid> [...]\n\nOptions:\n <pid> [...]            send signal to 
every <pid> listed\n -<signal>, -s, --signal <signal>\n                        
specify the <signal> to be sent\n -l, --list=[<signal>]  list all signal names, 
or convert one to a name\n -L, --table            list all signal names in a 
nice table\n\n -h, --help     display this help and exit\n -V, --version  
output version information and exit\n\nFor more details see kill(1).\n"
errThread = {org.apache.hadoop.util.Shell$1@6839} "Thread[Thread-102,5,]"
line = null
exitCode = 1
completed = {java.util.concurrent.atomic.AtomicBoolean@6806} "true"

This causes DefaultContainerExecutor#containerIsAlive to catch 
ExitCodeException thrown by ShellCommandExecutor.execute making it assume the 
container is lost.

> Applications fail on NM restart on some linux distro because NM container 
> recovery declares AM container as LOST
> ----------------------------------------------------------------------------------------------------------------
>                 Key: YARN-4046
>                 URL: https://issues.apache.org/jira/browse/YARN-4046
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Anubhav Dhoot
>            Assignee: Anubhav Dhoot
>            Priority: Critical
> On a debian machine we have seen node manager recovery of containers fail 
> because the signal syntax for process group may not work. We see errors in 
> checking if process is alive during container recovery which causes the 
> container to be declared as LOST (154) on a NodeManager restart.
> The application will fail with error
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt 
> recovered after RM restartAM Container for 
> appattempt_1439244348718_0001_000001 exited with exitCode: 154
> {noformat}

This message was sent by Atlassian JIRA

Reply via email to