[ https://issues.apache.org/jira/browse/YARN-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642576#comment-16642576 ]
Jason Lowe commented on YARN-7644: ---------------------------------- Thanks for the patch! I'm a little concerned about the container executor lock that was added in YARN-8160 and used as part of this patch. IIUC the launchContainer method for the executor is a synchronous, blocking call that won't return until the container completes. For example, see DefaultContainerExecutor#launchContainer where it invokes Shell.CommandExecutor#execute. That means the executor lock would be held continuously while the container is running. Therefore I'm not sure how the thread running ContainerLaunch#reapContainer is going to obtain the executor lock to be able to proceed to kill the container. Seems like it would just hang, but maybe I'm missing something. This may be more of an issue with YARN-8160 than this one, as it looks like this mostly just refactored existing code to move it into a ContainerCleanup class. To be honest I'm not quite sure what the purpose of the lock is, since there are many places we invoke the executor without the lock like deactivating and signalling. The use of the lock seems inconsistent if it's supposed to guard when we are invoking the executor. Nit: It feels a little off for ContainerCleanup to reach into fields directly, e.g.: launch.pidFilePath, launch.containerAlreadyLaunched, launch.completed, etc. It made more sense when this code was part of ContainerLaunch because the field was private and no code other than the implementation details of ContainerLaunch needed to know. Now there's another, separate class that is reaching in to grab all these fields. Seems like either ContainerCleanup should be a private static class of ContainerLaunch or there should be accessors so we can keep these fields private. ContainerLaunch.sleepDelayBeforeSigKill should be final like the other properties. The assignment to 250 will always be clobbered by the constructor anyway. > NM gets backed up deleting docker containers > -------------------------------------------- > > Key: YARN-7644 > URL: https://issues.apache.org/jira/browse/YARN-7644 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Reporter: Eric Badger > Assignee: Chandni Singh > Priority: Major > Labels: Docker > Attachments: YARN-7644.001.patch, YARN-7644.002.patch > > > We are sending a {{docker stop}} to the docker container with a timeout of 10 > seconds when we shut down a container. If the container does not stop after > 10 seconds then we force kill it. However, the {{docker stop}} command is a > blocking call. So in cases where lots of containers don't go down with the > initial SIGTERM, we have to wait 10+ seconds for the {{docker stop}} to > return. This ties up the ContainerLaunch handler and so these kill events > back up. It also appears to be backing up new container launches as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org