[ 
https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539286#comment-16539286
 ] 

Jim Brennan commented on YARN-8515:
-----------------------------------

Here is an example case that we saw:
Docker ps info for this container:
{noformat}
968e4a1a0fca 90188f3d752e "bash /grid/4/tmp/..." 6 days ago Exited (143) 6 days 
ago container_e07_1528760012992_2875921_01_000069
{noformat}
NM Log with some added info from Docker container and journalctl to show where 
the docker container started/exited:
{noformat}
2018-06-27 16:32:48,779 [IPC Server handler 9 on 8041] INFO 
containermanager.ContainerManagerImpl: Start request for 
container_e07_1528760012992_2875921_01_000069 by user p_condor
2018-06-27 16:32:48,782 [AsyncDispatcher event handler] INFO 
application.ApplicationImpl: Adding 
container_e07_1528760012992_2875921_01_000069 to application 
application_1528760012992_2875921
2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_000069 transitioned from NEW to 
LOCALIZING
2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO 
yarn.YarnShuffleService: Initializing container 
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:32:48,786 [AsyncDispatcher event handler] INFO 
localizer.ResourceLocalizationService: Created localizer for 
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:32:48,786 [LocalizerRunner for 
container_e07_1528760012992_2875921_01_000069] INFO 
localizer.ResourceLocalizationService: Writing credentials to the nmPrivate 
file 
/grid/4/tmp/yarn-local/nmPrivate/container_e07_1528760012992_2875921_01_000069.tokens.
 Credentials list: 
2018-06-27 16:32:52,654 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_000069 transitioned from LOCALIZING to 
LOCALIZED
2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_000069 transitioned from LOCALIZED to 
RUNNING
2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO 
monitor.ContainersMonitorImpl: Starting resource-monitoring for 
container_e07_1528760012992_2875921_01_000069

2018-06-27 16:32:53.345 Docker container started

2018-06-27 16:32:54,429 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 103072 for container-id 
container_e07_1528760012992_2875921_01_000069: 132.5 MB of 3 GB physical memory 
used; 4.3 GB of 6.3 GB virtual memory used

2018-06-27 16:33:25,422 [main] INFO nodemanager.NodeManager: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NodeManager
STARTUP_MSG: user = mapred
STARTUP_MSG: host = gsbl607n22.blue.ygrid.yahoo.com/10.213.59.232
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.8.3.2.1806111934

2018-06-27 16:33:31,140 [main] INFO containermanager.ContainerManagerImpl: 
Recovering container_e07_1528760012992_2875921_01_000069 in state LAUNCHED with 
exit code -1000
2018-06-27 16:33:31,140 [main] INFO application.ApplicationImpl: Adding 
container_e07_1528760012992_2875921_01_000069 to application 
application_1528760012992_2875921

2018-06-27 16:33:32,771 [main] INFO containermanager.ContainerManagerImpl: 
Waiting for containers: 
2018-06-27 16:33:33,280 [main] INFO containermanager.ContainerManagerImpl: 
Waiting for containers: 
2018-06-27 16:33:33,178 [main] INFO containermanager.ContainerManagerImpl: 
Waiting for containers:

2018-06-27 16:33:33,776 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_000069 transitioned from NEW to 
LOCALIZING
2018-06-27 16:33:34,393 [AsyncDispatcher event handler] INFO 
yarn.YarnShuffleService: Initializing container 
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:33:34,433 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_000069 transitioned from LOCALIZING to 
LOCALIZED
2018-06-27 16:33:34,461 [ContainersLauncher #23] INFO 
nodemanager.ContainerExecutor: Reacquiring 
container_e07_1528760012992_2875921_01_000069 with pid 103072
2018-06-27 16:33:34,463 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_000069 transitioned from LOCALIZED to 
RUNNING
2018-06-27 16:33:34,482 [AsyncDispatcher event handler] INFO 
monitor.ContainersMonitorImpl: Starting resource-monitoring for 
container_e07_1528760012992_2875921_01_000069

2018-06-27 16:33:35,304 [main] INFO nodemanager.NodeStatusUpdaterImpl: Sending 
out 598 NM container statuses: 
2018-06-27 16:33:35,356 [main] INFO nodemanager.NodeStatusUpdaterImpl: 
Registering with RM using containers 
2018-06-27 16:33:35,902 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 103072 for container-id 
container_e07_1528760012992_2875921_01_000069: 1.1 GB of 3 GB physical memory 
used; 4.5 GB of 6.3 GB virtual memory used
2018-06-27 16:34:22,480 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 103072 for container-id 
container_e07_1528760012992_2875921_01_000069: 1.1 GB of 3 GB physical memory 
used; 4.5 GB of 6.3 GB virtual memory us
2018-06-27 16:34:25,738 [IPC Server handler 6 on 8041] INFO 
containermanager.ContainerManagerImpl: Stopping container with container Id: 
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:25,739 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_000069 transitioned from RUNNING to 
KILLING
2018-06-27 16:34:25,739 [AsyncDispatcher event handler] INFO 
launcher.ContainerLaunch: Cleaning up container 
container_e07_1528760012992_2875921_01_000069

2018-06-27 16:34:26.276 Docker container exited
Jun 27 16:34:26 gsbl607n22.blue.ygrid.yahoo.com dockerd-current[17973]: 
time="2018-06-27T16:34:26.275701908Z" level=error msg="containerd: deleting 
container" error="exit status 1: \"container 
968e4a1a0fcafa7e01a6ab688a86a86615369b644061e90939f60542
Jun 27 16:34:28 gsbl607n22.blue.ygrid.yahoo.com dockerd-current[17973]: 
time="2018-06-27T16:34:28.470794342Z" level=warning 
msg="968e4a1a0fcafa7e01a6ab688a86a86615369b644061e90939f60542468dcb1b cleanup: 
failed to unmount secrets: invalid argument"

2018-06-27 16:34:26,516 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 103072 for container-id 
container_e07_1528760012992_2875921_01_000069: -1B of 3 GB physical memory 
used; -1B of 6.3 GB virtual memory used
2018-06-27 16:34:26,769 [ContainersLauncher #23] INFO 
nodemanager.ContainerExecutor: container_e07_1528760012992_2875921_01_000069 
was deactivated
2018-06-27 16:34:26,870 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_000069 transitioned from KILLING to 
EXITED_WITH_FAILURE
2018-06-27 16:34:26,872 [DeletionService #1] INFO 
nodemanager.LinuxContainerExecutor: Deleting absolute path : 
/grid/3/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,872 [DeletionService #3] INFO 
nodemanager.LinuxContainerExecutor: Deleting absolute path : 
/grid/5/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,872 [DeletionService #0] INFO 
nodemanager.LinuxContainerExecutor: Deleting absolute path : 
/grid/2/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_000069 transitioned from 
EXITED_WITH_FAILURE to DONE
2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO 
application.ApplicationImpl: Removing 
container_e07_1528760012992_2875921_01_000069 from application 
application_1528760012992_2875921
2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO 
monitor.ContainersMonitorImpl: Stopping resource-monitoring for 
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO 
logaggregation.AppLogAggregatorImpl: Considering container 
container_e07_1528760012992_2875921_01_000069 for log-aggregation
2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO 
containermanager.AuxServices: Got event CONTAINER_STOP for appId 
application_1528760012992_2875921
2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO 
yarn.YarnShuffleService: Stopping container 
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,875 [DeletionService #0] INFO 
nodemanager.LinuxContainerExecutor: Deleting absolute path : 
/grid/4/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,875 [DeletionService #1] INFO 
nodemanager.LinuxContainerExecutor: Deleting absolute path : 
/grid/1/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,876 [AsyncDispatcher event handler] WARN 
containermanager.ContainerManagerImpl: couldn't find container 
container_e07_1528760012992_2875921_01_000069 while processing 
FINISH_CONTAINERS event
2018-06-27 16:34:26,877 [DeletionService #1] INFO 
nodemanager.LinuxContainerExecutor: Deleting absolute path : 
/grid/0/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:30,557 [Node Status Updater] INFO 
nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM 
context: [container_e07_1528760012992_2875921_01_000069]
{noformat}
Docker state info:
{noformat}
 "State": {
 "Status": "exited",
 "Running": false,
 "Paused": false,
 "Restarting": false,
 "OOMKilled": false,
 "Dead": false,
 "Pid": 0,
 "ExitCode": 143,
 "Error": "",
 "StartedAt": "2018-06-27T16:32:53.345303052Z",
 "FinishedAt": "2018-06-27T16:34:26.276037261Z"
 },
{noformat}

> container-executor can crash with SIGPIPE after nodemanager restart
> -------------------------------------------------------------------
>
>                 Key: YARN-8515
>                 URL: https://issues.apache.org/jira/browse/YARN-8515
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Major
>
> When running with docker on large clusters, we have noticed that sometimes 
> docker containers are not removed - they remain in the exited state, and the 
> corresponding container-executor is no longer running.  Upon investigation, 
> we noticed that this always seemed to happen after a nodemanager restart.   
> The sequence leading to the stranded docker containers is:
>  # Nodemanager restarts
>  # Containers are recovered and then run for a while
>  # Containers are killed for some (legitimate) reason
>  # Container-executor exits without removing the docker container.
> After reproducing this on a test cluster, we found that the 
> container-executor was exiting due to a SIGPIPE.
> What is happening is that the shell command executor that is used to start 
> container-executor has threads reading from c-e's stdout and stderr.  When 
> the NM is restarted, these threads are killed.  Then when the 
> container-executor continues executing after the container exits with error, 
> it tries to write to stderr (ERRORFILE) and gets a SIGPIPE.  Since SIGPIPE is 
> not handled, this crashes the container-executor before it can actually 
> remove the docker container.
> We ran into this in branch 2.8.  The way docker containers are removed has 
> been completely redesigned in trunk, so I don't think it will lead to this 
> exact failure, but after an NM restart, potentially any write to stderr or 
> stdout in the container-executor could cause it to crash.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to