[ https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539286#comment-16539286 ]
Jim Brennan commented on YARN-8515: ----------------------------------- Here is an example case that we saw: Docker ps info for this container: {noformat} 968e4a1a0fca 90188f3d752e "bash /grid/4/tmp/..." 6 days ago Exited (143) 6 days ago container_e07_1528760012992_2875921_01_000069 {noformat} NM Log with some added info from Docker container and journalctl to show where the docker container started/exited: {noformat} 2018-06-27 16:32:48,779 [IPC Server handler 9 on 8041] INFO containermanager.ContainerManagerImpl: Start request for container_e07_1528760012992_2875921_01_000069 by user p_condor 2018-06-27 16:32:48,782 [AsyncDispatcher event handler] INFO application.ApplicationImpl: Adding container_e07_1528760012992_2875921_01_000069 to application application_1528760012992_2875921 2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_000069 transitioned from NEW to LOCALIZING 2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO yarn.YarnShuffleService: Initializing container container_e07_1528760012992_2875921_01_000069 2018-06-27 16:32:48,786 [AsyncDispatcher event handler] INFO localizer.ResourceLocalizationService: Created localizer for container_e07_1528760012992_2875921_01_000069 2018-06-27 16:32:48,786 [LocalizerRunner for container_e07_1528760012992_2875921_01_000069] INFO localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /grid/4/tmp/yarn-local/nmPrivate/container_e07_1528760012992_2875921_01_000069.tokens. Credentials list: 2018-06-27 16:32:52,654 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_000069 transitioned from LOCALIZING to LOCALIZED 2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_000069 transitioned from LOCALIZED to RUNNING 2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO monitor.ContainersMonitorImpl: Starting resource-monitoring for container_e07_1528760012992_2875921_01_000069 2018-06-27 16:32:53.345 Docker container started 2018-06-27 16:32:54,429 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 103072 for container-id container_e07_1528760012992_2875921_01_000069: 132.5 MB of 3 GB physical memory used; 4.3 GB of 6.3 GB virtual memory used 2018-06-27 16:33:25,422 [main] INFO nodemanager.NodeManager: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NodeManager STARTUP_MSG: user = mapred STARTUP_MSG: host = gsbl607n22.blue.ygrid.yahoo.com/10.213.59.232 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.8.3.2.1806111934 2018-06-27 16:33:31,140 [main] INFO containermanager.ContainerManagerImpl: Recovering container_e07_1528760012992_2875921_01_000069 in state LAUNCHED with exit code -1000 2018-06-27 16:33:31,140 [main] INFO application.ApplicationImpl: Adding container_e07_1528760012992_2875921_01_000069 to application application_1528760012992_2875921 2018-06-27 16:33:32,771 [main] INFO containermanager.ContainerManagerImpl: Waiting for containers: 2018-06-27 16:33:33,280 [main] INFO containermanager.ContainerManagerImpl: Waiting for containers: 2018-06-27 16:33:33,178 [main] INFO containermanager.ContainerManagerImpl: Waiting for containers: 2018-06-27 16:33:33,776 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_000069 transitioned from NEW to LOCALIZING 2018-06-27 16:33:34,393 [AsyncDispatcher event handler] INFO yarn.YarnShuffleService: Initializing container container_e07_1528760012992_2875921_01_000069 2018-06-27 16:33:34,433 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_000069 transitioned from LOCALIZING to LOCALIZED 2018-06-27 16:33:34,461 [ContainersLauncher #23] INFO nodemanager.ContainerExecutor: Reacquiring container_e07_1528760012992_2875921_01_000069 with pid 103072 2018-06-27 16:33:34,463 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_000069 transitioned from LOCALIZED to RUNNING 2018-06-27 16:33:34,482 [AsyncDispatcher event handler] INFO monitor.ContainersMonitorImpl: Starting resource-monitoring for container_e07_1528760012992_2875921_01_000069 2018-06-27 16:33:35,304 [main] INFO nodemanager.NodeStatusUpdaterImpl: Sending out 598 NM container statuses: 2018-06-27 16:33:35,356 [main] INFO nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers 2018-06-27 16:33:35,902 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 103072 for container-id container_e07_1528760012992_2875921_01_000069: 1.1 GB of 3 GB physical memory used; 4.5 GB of 6.3 GB virtual memory used 2018-06-27 16:34:22,480 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 103072 for container-id container_e07_1528760012992_2875921_01_000069: 1.1 GB of 3 GB physical memory used; 4.5 GB of 6.3 GB virtual memory us 2018-06-27 16:34:25,738 [IPC Server handler 6 on 8041] INFO containermanager.ContainerManagerImpl: Stopping container with container Id: container_e07_1528760012992_2875921_01_000069 2018-06-27 16:34:25,739 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_000069 transitioned from RUNNING to KILLING 2018-06-27 16:34:25,739 [AsyncDispatcher event handler] INFO launcher.ContainerLaunch: Cleaning up container container_e07_1528760012992_2875921_01_000069 2018-06-27 16:34:26.276 Docker container exited Jun 27 16:34:26 gsbl607n22.blue.ygrid.yahoo.com dockerd-current[17973]: time="2018-06-27T16:34:26.275701908Z" level=error msg="containerd: deleting container" error="exit status 1: \"container 968e4a1a0fcafa7e01a6ab688a86a86615369b644061e90939f60542 Jun 27 16:34:28 gsbl607n22.blue.ygrid.yahoo.com dockerd-current[17973]: time="2018-06-27T16:34:28.470794342Z" level=warning msg="968e4a1a0fcafa7e01a6ab688a86a86615369b644061e90939f60542468dcb1b cleanup: failed to unmount secrets: invalid argument" 2018-06-27 16:34:26,516 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 103072 for container-id container_e07_1528760012992_2875921_01_000069: -1B of 3 GB physical memory used; -1B of 6.3 GB virtual memory used 2018-06-27 16:34:26,769 [ContainersLauncher #23] INFO nodemanager.ContainerExecutor: container_e07_1528760012992_2875921_01_000069 was deactivated 2018-06-27 16:34:26,870 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_000069 transitioned from KILLING to EXITED_WITH_FAILURE 2018-06-27 16:34:26,872 [DeletionService #1] INFO nodemanager.LinuxContainerExecutor: Deleting absolute path : /grid/3/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069 2018-06-27 16:34:26,872 [DeletionService #3] INFO nodemanager.LinuxContainerExecutor: Deleting absolute path : /grid/5/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069 2018-06-27 16:34:26,872 [DeletionService #0] INFO nodemanager.LinuxContainerExecutor: Deleting absolute path : /grid/2/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069 2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_000069 transitioned from EXITED_WITH_FAILURE to DONE 2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO application.ApplicationImpl: Removing container_e07_1528760012992_2875921_01_000069 from application application_1528760012992_2875921 2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_e07_1528760012992_2875921_01_000069 2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO logaggregation.AppLogAggregatorImpl: Considering container container_e07_1528760012992_2875921_01_000069 for log-aggregation 2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1528760012992_2875921 2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO yarn.YarnShuffleService: Stopping container container_e07_1528760012992_2875921_01_000069 2018-06-27 16:34:26,875 [DeletionService #0] INFO nodemanager.LinuxContainerExecutor: Deleting absolute path : /grid/4/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069 2018-06-27 16:34:26,875 [DeletionService #1] INFO nodemanager.LinuxContainerExecutor: Deleting absolute path : /grid/1/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069 2018-06-27 16:34:26,876 [AsyncDispatcher event handler] WARN containermanager.ContainerManagerImpl: couldn't find container container_e07_1528760012992_2875921_01_000069 while processing FINISH_CONTAINERS event 2018-06-27 16:34:26,877 [DeletionService #1] INFO nodemanager.LinuxContainerExecutor: Deleting absolute path : /grid/0/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069 2018-06-27 16:34:30,557 [Node Status Updater] INFO nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_e07_1528760012992_2875921_01_000069] {noformat} Docker state info: {noformat} "State": { "Status": "exited", "Running": false, "Paused": false, "Restarting": false, "OOMKilled": false, "Dead": false, "Pid": 0, "ExitCode": 143, "Error": "", "StartedAt": "2018-06-27T16:32:53.345303052Z", "FinishedAt": "2018-06-27T16:34:26.276037261Z" }, {noformat} > container-executor can crash with SIGPIPE after nodemanager restart > ------------------------------------------------------------------- > > Key: YARN-8515 > URL: https://issues.apache.org/jira/browse/YARN-8515 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Jim Brennan > Assignee: Jim Brennan > Priority: Major > > When running with docker on large clusters, we have noticed that sometimes > docker containers are not removed - they remain in the exited state, and the > corresponding container-executor is no longer running. Upon investigation, > we noticed that this always seemed to happen after a nodemanager restart. > The sequence leading to the stranded docker containers is: > # Nodemanager restarts > # Containers are recovered and then run for a while > # Containers are killed for some (legitimate) reason > # Container-executor exits without removing the docker container. > After reproducing this on a test cluster, we found that the > container-executor was exiting due to a SIGPIPE. > What is happening is that the shell command executor that is used to start > container-executor has threads reading from c-e's stdout and stderr. When > the NM is restarted, these threads are killed. Then when the > container-executor continues executing after the container exits with error, > it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is > not handled, this crashes the container-executor before it can actually > remove the docker container. > We ran into this in branch 2.8. The way docker containers are removed has > been completely redesigned in trunk, so I don't think it will lead to this > exact failure, but after an NM restart, potentially any write to stderr or > stdout in the container-executor could cause it to crash. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org