[jira] [Created] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart

Jim Brennan (JIRA) Tue, 10 Jul 2018 14:52:25 -0700

Jim Brennan created YARN-8515:
---------------------------------

             Summary: container-executor can crash with SIGPIPE after 
nodemanager restart
                 Key: YARN-8515
                 URL: https://issues.apache.org/jira/browse/YARN-8515
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Jim Brennan
            Assignee: Jim Brennan



When running with docker on large clusters, we have noticed that sometimes 
docker containers are not removed - they remain in the exited state, and the 
corresponding container-executor is no longer running.  Upon investigation, we 
noticed that this always seemed to happen after a nodemanager restart.   The 
sequence leading to the stranded docker containers is:
 # Nodemanager restarts
 # Containers are recovered and then run for a while
 # Containers are killed for some (legitimate) reason
 # Container-executor exits without removing the docker container.

After reproducing this on a test cluster, we found that the container-executor 
was exiting due to a SIGPIPE.

What is happening is that the shell command executor that is used to start 
container-executor has threads reading from c-e's stdout and stderr.  When the 
NM is restarted, these threads are killed.  Then when the container-executor 
continues executing after the container exits with error, it tries to write to 
stderr (ERRORFILE) and gets a SIGPIPE.  Since SIGPIPE is not handled, this 
crashes the container-executor before it can actually remove the docker 
container.

We ran into this in branch 2.8.  The way docker containers are removed has been 
completely redesigned in trunk, so I don't think it will lead to this exact 
failure, but after an NM restart, potentially any write to stderr or stdout in 
the container-executor could cause it to crash.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart

Reply via email to