[ 
https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reopened YARN-8786:
------------------------------

This should be left open to track the sporadic failure in creating directories. 
 YARN-8751 may make this sporadic problem not take out the NM, but it still 
causes the requested container launch to fail.  That should be fixed, and this 
JIRA can track that effort.

> LinuxContainerExecutor fails sporadically in create_local_dirs
> --------------------------------------------------------------
>
>                 Key: YARN-8786
>                 URL: https://issues.apache.org/jira/browse/YARN-8786
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Jon Bender
>            Priority: Major
>
> We started using CGroups with LinuxContainerExecutor recently, running Apache 
> Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn 
> container will fail with a message like the following:
> {code:java}
> [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: 
> Container container_1530684675517_516620_01_020846 transitioned from 
> SCHEDULED to RUNNING
> [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO 
> monitor.ContainersMonitorImpl: Starting resource-monitoring for 
> container_1530684675517_516620_01_020846
> [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN 
> privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 
> 35. Privileged Execution Operation Stderr:
> [2018-09-02 23:48:02.506159] Could not create container dirsCould not create 
> local files and directories
> [2018-09-02 23:48:02.506220]
> [2018-09-02 23:48:02.506238] Stdout: main : command provided 1
> [2018-09-02 23:48:02.506258] main : run as user is nobody
> [2018-09-02 23:48:02.506282] main : requested yarn user is root
> [2018-09-02 23:48:02.506294] Getting exit code file...
> [2018-09-02 23:48:02.506307] Creating script paths...
> [2018-09-02 23:48:02.506330] Writing pid file...
> [2018-09-02 23:48:02.506366] Writing to tmp file 
> /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp
> [2018-09-02 23:48:02.506389] Writing to cgroup task files...
> [2018-09-02 23:48:02.506402] Creating local dirs...
> [2018-09-02 23:48:02.506414] Getting exit code file...
> [2018-09-02 23:48:02.506435] Creating script paths...
> {code}
> Looking at the container executor source it's traceable to errors here: 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604]
>  And ultimately to 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672]
> The root failure seems to be in the underlying mkdir call, but that exit code 
> / errno is swallowed so we don't have more details. We tend to see this when 
> many containers start at the same time for the same application on a host, 
> and suspect it may be related to some race conditions around those shared 
> directories between containers for the same application.
> For example, this is a typical pattern in the audit logs:
> {code:java}
> [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root      IP=<> Container Request 
> TARGET=ContainerManageImpl      RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root      IP=<> Container Request 
> TARGET=ContainerManageImpl      RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012870
> [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN 
> nodemanager.NMAuditLogger: USER=root      OPERATION=Container Finished - 
> Failed   TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed 
> with state: EXITED_WITH_FAILURE    APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> {code}
> Two containers for the same application starting in quick succession followed 
> by the EXITED_WITH_FAILURE step (exit code 35).
> We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, 
> the only major JIRAs that affected the executor since 3.0.0 seem unrelated 
> ([https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8]
>  and 
> [https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to