[ 
https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17296034#comment-17296034
 ] 

Peter Bacsko edited comment on YARN-8786 at 3/5/21, 2:14 PM:
-------------------------------------------------------------

[~jlowe] [~ebadger] [~jonbender-stripe] do we still need this JIRA open? Is the 
issue still happening after YARN-9833? (as it turned out that fix is still not 
100% perfect, but very close enough to 100% which makes it acceptable).


was (Author: pbacsko):
[~jlowe] [~ebadger] [~jonbender-stripe] do we still need this JIRA open? Is the 
issue still happening after YARN-9833 (as it turned out that fix is still not 
100% perfect, but very close enough to 100% which makes it acceptable).

> LinuxContainerExecutor fails sporadically in create_local_dirs
> --------------------------------------------------------------
>
>                 Key: YARN-8786
>                 URL: https://issues.apache.org/jira/browse/YARN-8786
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Jon Bender
>            Priority: Major
>
> We started using CGroups with LinuxContainerExecutor recently, running Apache 
> Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn 
> container will fail with a message like the following:
> {code:java}
> [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: 
> Container container_1530684675517_516620_01_020846 transitioned from 
> SCHEDULED to RUNNING
> [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO 
> monitor.ContainersMonitorImpl: Starting resource-monitoring for 
> container_1530684675517_516620_01_020846
> [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN 
> privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 
> 35. Privileged Execution Operation Stderr:
> [2018-09-02 23:48:02.506159] Could not create container dirsCould not create 
> local files and directories
> [2018-09-02 23:48:02.506220]
> [2018-09-02 23:48:02.506238] Stdout: main : command provided 1
> [2018-09-02 23:48:02.506258] main : run as user is nobody
> [2018-09-02 23:48:02.506282] main : requested yarn user is root
> [2018-09-02 23:48:02.506294] Getting exit code file...
> [2018-09-02 23:48:02.506307] Creating script paths...
> [2018-09-02 23:48:02.506330] Writing pid file...
> [2018-09-02 23:48:02.506366] Writing to tmp file 
> /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp
> [2018-09-02 23:48:02.506389] Writing to cgroup task files...
> [2018-09-02 23:48:02.506402] Creating local dirs...
> [2018-09-02 23:48:02.506414] Getting exit code file...
> [2018-09-02 23:48:02.506435] Creating script paths...
> {code}
> Looking at the container executor source it's traceable to errors here: 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604]
>  And ultimately to 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672]
> The root failure seems to be in the underlying mkdir call, but that exit code 
> / errno is swallowed so we don't have more details. We tend to see this when 
> many containers start at the same time for the same application on a host, 
> and suspect it may be related to some race conditions around those shared 
> directories between containers for the same application.
> For example, this is a typical pattern in the audit logs:
> {code:java}
> [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root      IP=<> Container Request 
> TARGET=ContainerManageImpl      RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root      IP=<> Container Request 
> TARGET=ContainerManageImpl      RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012870
> [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN 
> nodemanager.NMAuditLogger: USER=root      OPERATION=Container Finished - 
> Failed   TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed 
> with state: EXITED_WITH_FAILURE    APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> {code}
> Two containers for the same application starting in quick succession followed 
> by the EXITED_WITH_FAILURE step (exit code 35).
> We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, 
> the only major JIRAs that affected the executor since 3.0.0 seem unrelated 
> ([https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8]
>  and 
> [https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to