Jim Brennan created YARN-8515:
---------------------------------
Summary: container-executor can crash with SIGPIPE after
nodemanager restart
Key: YARN-8515
URL: https://issues.apache.org/jira/browse/YARN-8515
Project: Hadoop YARN
Issue Type: Bug
Reporter: Jim Brennan
Assignee: Jim Brennan
When running with docker on large clusters, we have noticed that sometimes
docker containers are not removed - they remain in the exited state, and the
corresponding container-executor is no longer running. Upon investigation, we
noticed that this always seemed to happen after a nodemanager restart. The
sequence leading to the stranded docker containers is:
# Nodemanager restarts
# Containers are recovered and then run for a while
# Containers are killed for some (legitimate) reason
# Container-executor exits without removing the docker container.
After reproducing this on a test cluster, we found that the container-executor
was exiting due to a SIGPIPE.
What is happening is that the shell command executor that is used to start
container-executor has threads reading from c-e's stdout and stderr. When the
NM is restarted, these threads are killed. Then when the container-executor
continues executing after the container exits with error, it tries to write to
stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is not handled, this
crashes the container-executor before it can actually remove the docker
container.
We ran into this in branch 2.8. The way docker containers are removed has been
completely redesigned in trunk, so I don't think it will lead to this exact
failure, but after an NM restart, potentially any write to stderr or stdout in
the container-executor could cause it to crash.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]