[ 
https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15957837#comment-15957837
 ] 

Eric Badger commented on YARN-6091:
-----------------------------------

I think the issue is that the container-executor is calling pclose() before the 
command is finished. According to 
http://man7.org/linux/man-pages/man3/pclose.3p.html pclose() will close the 
stream before waiting for the command executed in popen() to terminate. So if 
the command that was executed in popen() is still executing when we call 
pclose() and is trying to write to stdout, the command will return a SIGPIPE 
(signal number 13). Since this depends on how long it takes to execute the 
command, it is non-deterministic between nodes and won't always return the 
SIGPIPE. 

I believe a fix would be to redirect stdout to /dev/null on the instances of 
popen() where we do not read the stdout of the command. I already tested 
locally that the stderr of the underlying command from popen() will be passed 
to the stderr of the process running popen(). This will end up having stderr 
print to stderr of the NM, but I think that is better than silently swallowing 
stderr altogether. 

I'll test out this approach locally and then put up a patch once I'm done. 

> the AppMaster register failed when use Docker on LinuxContainer 
> ----------------------------------------------------------------
>
>                 Key: YARN-6091
>                 URL: https://issues.apache.org/jira/browse/YARN-6091
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, yarn
>    Affects Versions: 2.8.1
>         Environment: CentOS
>            Reporter: zhengchenyu
>            Priority: Critical
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In some servers, When I use Docker on LinuxContainer, I found the aciton that 
> AppMaster register to Resourcemanager failed. But didn't happen in other 
> servers. 
> I found the pclose (in container-executor.c) return different value in 
> different server, even though the process which is launched by popen is 
> running normally. Some server return 0, and others return 13. 
> Because yarn regard the application as failed application when pclose return 
> nonzero, and yarn will remove the AMRMToken, then the AppMaster register 
> failed because Resourcemanager have removed this applicaiton's token. 
> In container-executor.c, the judgement condition is whether the return code 
> is zero. But man the pclose, the document tells that "pclose return -1" 
> represent wrong. So I change the judgement condition, then slove this 
> problem. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to