[
https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539286#comment-16539286
]
Jim Brennan commented on YARN-8515:
-----------------------------------
Here is an example case that we saw:
Docker ps info for this container:
{noformat}
968e4a1a0fca 90188f3d752e "bash /grid/4/tmp/..." 6 days ago Exited (143) 6 days
ago container_e07_1528760012992_2875921_01_000069
{noformat}
NM Log with some added info from Docker container and journalctl to show where
the docker container started/exited:
{noformat}
2018-06-27 16:32:48,779 [IPC Server handler 9 on 8041] INFO
containermanager.ContainerManagerImpl: Start request for
container_e07_1528760012992_2875921_01_000069 by user p_condor
2018-06-27 16:32:48,782 [AsyncDispatcher event handler] INFO
application.ApplicationImpl: Adding
container_e07_1528760012992_2875921_01_000069 to application
application_1528760012992_2875921
2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO
container.ContainerImpl: Container
container_e07_1528760012992_2875921_01_000069 transitioned from NEW to
LOCALIZING
2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO
yarn.YarnShuffleService: Initializing container
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:32:48,786 [AsyncDispatcher event handler] INFO
localizer.ResourceLocalizationService: Created localizer for
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:32:48,786 [LocalizerRunner for
container_e07_1528760012992_2875921_01_000069] INFO
localizer.ResourceLocalizationService: Writing credentials to the nmPrivate
file
/grid/4/tmp/yarn-local/nmPrivate/container_e07_1528760012992_2875921_01_000069.tokens.
Credentials list:
2018-06-27 16:32:52,654 [AsyncDispatcher event handler] INFO
container.ContainerImpl: Container
container_e07_1528760012992_2875921_01_000069 transitioned from LOCALIZING to
LOCALIZED
2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO
container.ContainerImpl: Container
container_e07_1528760012992_2875921_01_000069 transitioned from LOCALIZED to
RUNNING
2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO
monitor.ContainersMonitorImpl: Starting resource-monitoring for
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:32:53.345 Docker container started
2018-06-27 16:32:54,429 [Container Monitor] DEBUG ContainersMonitorImpl.audit:
Memory usage of ProcessTree 103072 for container-id
container_e07_1528760012992_2875921_01_000069: 132.5 MB of 3 GB physical memory
used; 4.3 GB of 6.3 GB virtual memory used
2018-06-27 16:33:25,422 [main] INFO nodemanager.NodeManager: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NodeManager
STARTUP_MSG: user = mapred
STARTUP_MSG: host = gsbl607n22.blue.ygrid.yahoo.com/10.213.59.232
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.8.3.2.1806111934
2018-06-27 16:33:31,140 [main] INFO containermanager.ContainerManagerImpl:
Recovering container_e07_1528760012992_2875921_01_000069 in state LAUNCHED with
exit code -1000
2018-06-27 16:33:31,140 [main] INFO application.ApplicationImpl: Adding
container_e07_1528760012992_2875921_01_000069 to application
application_1528760012992_2875921
2018-06-27 16:33:32,771 [main] INFO containermanager.ContainerManagerImpl:
Waiting for containers:
2018-06-27 16:33:33,280 [main] INFO containermanager.ContainerManagerImpl:
Waiting for containers:
2018-06-27 16:33:33,178 [main] INFO containermanager.ContainerManagerImpl:
Waiting for containers:
2018-06-27 16:33:33,776 [AsyncDispatcher event handler] INFO
container.ContainerImpl: Container
container_e07_1528760012992_2875921_01_000069 transitioned from NEW to
LOCALIZING
2018-06-27 16:33:34,393 [AsyncDispatcher event handler] INFO
yarn.YarnShuffleService: Initializing container
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:33:34,433 [AsyncDispatcher event handler] INFO
container.ContainerImpl: Container
container_e07_1528760012992_2875921_01_000069 transitioned from LOCALIZING to
LOCALIZED
2018-06-27 16:33:34,461 [ContainersLauncher #23] INFO
nodemanager.ContainerExecutor: Reacquiring
container_e07_1528760012992_2875921_01_000069 with pid 103072
2018-06-27 16:33:34,463 [AsyncDispatcher event handler] INFO
container.ContainerImpl: Container
container_e07_1528760012992_2875921_01_000069 transitioned from LOCALIZED to
RUNNING
2018-06-27 16:33:34,482 [AsyncDispatcher event handler] INFO
monitor.ContainersMonitorImpl: Starting resource-monitoring for
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:33:35,304 [main] INFO nodemanager.NodeStatusUpdaterImpl: Sending
out 598 NM container statuses:
2018-06-27 16:33:35,356 [main] INFO nodemanager.NodeStatusUpdaterImpl:
Registering with RM using containers
2018-06-27 16:33:35,902 [Container Monitor] DEBUG ContainersMonitorImpl.audit:
Memory usage of ProcessTree 103072 for container-id
container_e07_1528760012992_2875921_01_000069: 1.1 GB of 3 GB physical memory
used; 4.5 GB of 6.3 GB virtual memory used
2018-06-27 16:34:22,480 [Container Monitor] DEBUG ContainersMonitorImpl.audit:
Memory usage of ProcessTree 103072 for container-id
container_e07_1528760012992_2875921_01_000069: 1.1 GB of 3 GB physical memory
used; 4.5 GB of 6.3 GB virtual memory us
2018-06-27 16:34:25,738 [IPC Server handler 6 on 8041] INFO
containermanager.ContainerManagerImpl: Stopping container with container Id:
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:25,739 [AsyncDispatcher event handler] INFO
container.ContainerImpl: Container
container_e07_1528760012992_2875921_01_000069 transitioned from RUNNING to
KILLING
2018-06-27 16:34:25,739 [AsyncDispatcher event handler] INFO
launcher.ContainerLaunch: Cleaning up container
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26.276 Docker container exited
Jun 27 16:34:26 gsbl607n22.blue.ygrid.yahoo.com dockerd-current[17973]:
time="2018-06-27T16:34:26.275701908Z" level=error msg="containerd: deleting
container" error="exit status 1: \"container
968e4a1a0fcafa7e01a6ab688a86a86615369b644061e90939f60542
Jun 27 16:34:28 gsbl607n22.blue.ygrid.yahoo.com dockerd-current[17973]:
time="2018-06-27T16:34:28.470794342Z" level=warning
msg="968e4a1a0fcafa7e01a6ab688a86a86615369b644061e90939f60542468dcb1b cleanup:
failed to unmount secrets: invalid argument"
2018-06-27 16:34:26,516 [Container Monitor] DEBUG ContainersMonitorImpl.audit:
Memory usage of ProcessTree 103072 for container-id
container_e07_1528760012992_2875921_01_000069: -1B of 3 GB physical memory
used; -1B of 6.3 GB virtual memory used
2018-06-27 16:34:26,769 [ContainersLauncher #23] INFO
nodemanager.ContainerExecutor: container_e07_1528760012992_2875921_01_000069
was deactivated
2018-06-27 16:34:26,870 [AsyncDispatcher event handler] INFO
container.ContainerImpl: Container
container_e07_1528760012992_2875921_01_000069 transitioned from KILLING to
EXITED_WITH_FAILURE
2018-06-27 16:34:26,872 [DeletionService #1] INFO
nodemanager.LinuxContainerExecutor: Deleting absolute path :
/grid/3/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,872 [DeletionService #3] INFO
nodemanager.LinuxContainerExecutor: Deleting absolute path :
/grid/5/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,872 [DeletionService #0] INFO
nodemanager.LinuxContainerExecutor: Deleting absolute path :
/grid/2/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO
container.ContainerImpl: Container
container_e07_1528760012992_2875921_01_000069 transitioned from
EXITED_WITH_FAILURE to DONE
2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO
application.ApplicationImpl: Removing
container_e07_1528760012992_2875921_01_000069 from application
application_1528760012992_2875921
2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO
monitor.ContainersMonitorImpl: Stopping resource-monitoring for
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO
logaggregation.AppLogAggregatorImpl: Considering container
container_e07_1528760012992_2875921_01_000069 for log-aggregation
2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO
containermanager.AuxServices: Got event CONTAINER_STOP for appId
application_1528760012992_2875921
2018-06-27 16:34:26,873 [AsyncDispatcher event handler] INFO
yarn.YarnShuffleService: Stopping container
container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,875 [DeletionService #0] INFO
nodemanager.LinuxContainerExecutor: Deleting absolute path :
/grid/4/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,875 [DeletionService #1] INFO
nodemanager.LinuxContainerExecutor: Deleting absolute path :
/grid/1/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:26,876 [AsyncDispatcher event handler] WARN
containermanager.ContainerManagerImpl: couldn't find container
container_e07_1528760012992_2875921_01_000069 while processing
FINISH_CONTAINERS event
2018-06-27 16:34:26,877 [DeletionService #1] INFO
nodemanager.LinuxContainerExecutor: Deleting absolute path :
/grid/0/tmp/yarn-local/usercache/p_condor/appcache/application_1528760012992_2875921/container_e07_1528760012992_2875921_01_000069
2018-06-27 16:34:30,557 [Node Status Updater] INFO
nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM
context: [container_e07_1528760012992_2875921_01_000069]
{noformat}
Docker state info:
{noformat}
"State": {
"Status": "exited",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 143,
"Error": "",
"StartedAt": "2018-06-27T16:32:53.345303052Z",
"FinishedAt": "2018-06-27T16:34:26.276037261Z"
},
{noformat}
> container-executor can crash with SIGPIPE after nodemanager restart
> -------------------------------------------------------------------
>
> Key: YARN-8515
> URL: https://issues.apache.org/jira/browse/YARN-8515
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Jim Brennan
> Assignee: Jim Brennan
> Priority: Major
>
> When running with docker on large clusters, we have noticed that sometimes
> docker containers are not removed - they remain in the exited state, and the
> corresponding container-executor is no longer running. Upon investigation,
> we noticed that this always seemed to happen after a nodemanager restart.
> The sequence leading to the stranded docker containers is:
> # Nodemanager restarts
> # Containers are recovered and then run for a while
> # Containers are killed for some (legitimate) reason
> # Container-executor exits without removing the docker container.
> After reproducing this on a test cluster, we found that the
> container-executor was exiting due to a SIGPIPE.
> What is happening is that the shell command executor that is used to start
> container-executor has threads reading from c-e's stdout and stderr. When
> the NM is restarted, these threads are killed. Then when the
> container-executor continues executing after the container exits with error,
> it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is
> not handled, this crashes the container-executor before it can actually
> remove the docker container.
> We ran into this in branch 2.8. The way docker containers are removed has
> been completely redesigned in trunk, so I don't think it will lead to this
> exact failure, but after an NM restart, potentially any write to stderr or
> stdout in the container-executor could cause it to crash.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]