[ 
https://issues.apache.org/jira/browse/YARN-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397068#comment-16397068
 ] 

Jason Lowe commented on YARN-7999:
----------------------------------

The container executor should always create at least one log directory.  
{{launch_docker_container_as_user}} calls {{create_local_dirs}} which in turn 
calls {{create_container_dirs}} and that creates the container log directories 
(well, at least one after YARN-7590).  Docker has always been mounting the log 
directories, even before YARN-7815, so I can't readily explain how this failure 
is new.

Are you able to run any containers on trunk, entry point or not, and with or 
without this patch?  Do you have any details on how to reproduce this?  We were 
able to readily reproduce the original failure described in the JIRA, but this 
new failure mode we cannot reproduce and I cannot explain based on how the 
container executor code in trunk is written.

bq. The container attempted on the faulty node, and initialized logging 
directory on the faulty node. When the same attempt is started on other nodes, 
it does not initialize logging directory on other node which leads to the 
failure.

There's normally no state shared between nodes, so I can't explain how a faulty 
node could change the container initializing behavior on another node unless 
they are sharing NM directories via NFS or a similarly odd setup.  Do you have 
any idea how one node's failure could affect the other node behaviors?


> Docker launch fails when user private filecache directory is missing
> --------------------------------------------------------------------
>
>                 Key: YARN-7999
>                 URL: https://issues.apache.org/jira/browse/YARN-7999
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.0
>            Reporter: Eric Yang
>            Assignee: Jason Lowe
>            Priority: Major
>         Attachments: YARN-7999.001.patch, YARN-7999.002.patch
>
>
> Docker container is failing to launch in trunk.  The root cause is:
> {code}
> [COMPINSTANCE sleeper-1 : container_1520032931921_0001_01_000020]: 
> [2018-03-02 23:26:09.196]Exception from container-launch.
> Container id: container_1520032931921_0001_01_000020
> Exit code: 29
> Exception message: image: hadoop/centos:latest is trusted in hadoop registry.
> Could not determine real path of mount 
> '/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache'
> Could not determine real path of mount 
> '/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache'
> Invalid docker mount 
> '/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache:/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache',
>  realpath=/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache
> Error constructing docker command, docker error code=12, error 
> message='Invalid docker mount'
> Shell output: main : command provided 4
> main : run as user is hbase
> main : requested yarn user is hbase
> Creating script paths...
> Creating local dirs...
> [2018-03-02 23:26:09.240]Diagnostic message from attempt 0 : [2018-03-02 
> 23:26:09.240]
> [2018-03-02 23:26:09.240]Container exited with a non-zero exit code 29.
> [2018-03-02 23:26:39.278]Could not find 
> nmPrivate/application_1520032931921_0001/container_1520032931921_0001_01_000020//container_1520032931921_0001_01_000020.pid
>  in any of the directories
> [COMPONENT sleeper]: Failed 11 times, exceeded the limit - 10. Shutting down 
> now...
> {code}
> The filecache cant not be mounted because it doesn't exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to