[ https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542176#comment-16542176 ]
Jason Lowe commented on YARN-8515: ---------------------------------- Thanks for the patch! +1 lgtm. I'll commit this tomorrow if there are no objections. > container-executor can crash with SIGPIPE after nodemanager restart > ------------------------------------------------------------------- > > Key: YARN-8515 > URL: https://issues.apache.org/jira/browse/YARN-8515 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Jim Brennan > Assignee: Jim Brennan > Priority: Major > Labels: Docker > Attachments: YARN-8515.001.patch > > > When running with docker on large clusters, we have noticed that sometimes > docker containers are not removed - they remain in the exited state, and the > corresponding container-executor is no longer running. Upon investigation, > we noticed that this always seemed to happen after a nodemanager restart. > The sequence leading to the stranded docker containers is: > # Nodemanager restarts > # Containers are recovered and then run for a while > # Containers are killed for some (legitimate) reason > # Container-executor exits without removing the docker container. > After reproducing this on a test cluster, we found that the > container-executor was exiting due to a SIGPIPE. > What is happening is that the shell command executor that is used to start > container-executor has threads reading from c-e's stdout and stderr. When > the NM is restarted, these threads are killed. Then when the > container-executor continues executing after the container exits with error, > it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is > not handled, this crashes the container-executor before it can actually > remove the docker container. > We ran into this in branch 2.8. The way docker containers are removed has > been completely redesigned in trunk, so I don't think it will lead to this > exact failure, but after an NM restart, potentially any write to stderr or > stdout in the container-executor could cause it to crash. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org