[
https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jim Brennan updated YARN-8515:
------------------------------
Attachment: YARN-8515.001.patch
> container-executor can crash with SIGPIPE after nodemanager restart
> -------------------------------------------------------------------
>
> Key: YARN-8515
> URL: https://issues.apache.org/jira/browse/YARN-8515
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Jim Brennan
> Assignee: Jim Brennan
> Priority: Major
> Labels: Docker
> Attachments: YARN-8515.001.patch
>
>
> When running with docker on large clusters, we have noticed that sometimes
> docker containers are not removed - they remain in the exited state, and the
> corresponding container-executor is no longer running. Upon investigation,
> we noticed that this always seemed to happen after a nodemanager restart.
> The sequence leading to the stranded docker containers is:
> # Nodemanager restarts
> # Containers are recovered and then run for a while
> # Containers are killed for some (legitimate) reason
> # Container-executor exits without removing the docker container.
> After reproducing this on a test cluster, we found that the
> container-executor was exiting due to a SIGPIPE.
> What is happening is that the shell command executor that is used to start
> container-executor has threads reading from c-e's stdout and stderr. When
> the NM is restarted, these threads are killed. Then when the
> container-executor continues executing after the container exits with error,
> it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is
> not handled, this crashes the container-executor before it can actually
> remove the docker container.
> We ran into this in branch 2.8. The way docker containers are removed has
> been completely redesigned in trunk, so I don't think it will lead to this
> exact failure, but after an NM restart, potentially any write to stderr or
> stdout in the container-executor could cause it to crash.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]