[
https://issues.apache.org/jira/browse/YARN-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642595#comment-16642595
]
Chandni Singh edited comment on YARN-7644 at 10/9/18 12:43 AM:
---------------------------------------------------------------
[~jlowe] please see my response below
{quote}IIUC the launchContainer method for the executor is a synchronous,
blocking call that won't return until the container completes. For example, see
DefaultContainerExecutor#launchContainer where it invokes
Shell.CommandExecutor#execute. That means the executor lock would be held
continuously while the container is running. Therefore I'm not sure how the
thread running ContainerLaunch#reapContainer is going to obtain the executor
lock to be able to proceed to kill the container. Seems like it would just
hang, but maybe I'm missing something. This may be more of an issue with
YARN-8160 than this one, as it looks like this mostly just refactored existing
code to move it into a ContainerCleanup class.
{quote}
Before {{reapContainer()}}, container term/kill signal is always sent. This is
not blocked. With YARN-8160, we wait for {{launchContainer()}} to complete
after the signal is sent and then perform the {{reapContainer(). }}
Note: reapContainer removes the container. The stopping of container by
sending KILL/TERM is not part of reapContainer. It is done before reap.
{quote}To be honest I'm not quite sure what the purpose of the lock is, since
there are many places we invoke the executor without the lock like deactivating
and signalling. The use of the lock seems inconsistent if it's supposed to
guard when we are invoking the executor.
{quote}
This is the comment that describes the issue which the change in YARN-8160
fixed:
https://issues.apache.org/jira/browse/YARN-8160?focusedCommentId=16570774&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16570774
I will summarize it here.
* Container is launched
* Re-init of container is requested
* Re-init triggers container stop and removes the container
* Meanwhile the container launch exits with 255 because the container files
are cleaned up by reap container. This is because after the executor exits the
launch, it performs docker inspect
With the executorLock, we are waiting for the executor.launchContainer to
complete after term/kill signal is sent to it. Once the launch is completed, we
have the correct exit code from the container. Then the reap is performed.
Possibly, the name {{executorLock}} is confusing which I can change?
I will address your other comments in the next patch.
was (Author: csingh):
[~jlowe] please see my response below
{quote}IIUC the launchContainer method for the executor is a synchronous,
blocking call that won't return until the container completes. For example, see
DefaultContainerExecutor#launchContainer where it invokes
Shell.CommandExecutor#execute. That means the executor lock would be held
continuously while the container is running. Therefore I'm not sure how the
thread running ContainerLaunch#reapContainer is going to obtain the executor
lock to be able to proceed to kill the container. Seems like it would just
hang, but maybe I'm missing something. This may be more of an issue with
YARN-8160 than this one, as it looks like this mostly just refactored existing
code to move it into a ContainerCleanup class.
{quote}
Before {{reapContainer()}}, container term/kill signal is always sent. This is
not blocked. With YARN-8160, we wait for {{launchContainer()}} to complete
after the signal is sent and then perform the {{reapContainer()}}
{quote}To be honest I'm not quite sure what the purpose of the lock is, since
there are many places we invoke the executor without the lock like deactivating
and signalling. The use of the lock seems inconsistent if it's supposed to
guard when we are invoking the executor.
{quote}
This is the comment that describes the issue which the change in YARN-8160
fixed:
https://issues.apache.org/jira/browse/YARN-8160?focusedCommentId=16570774&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16570774
I will summarize it here.
* Container is launched
* Re-init of container is requested
* Re-init triggers container stop and removes the container
* Meanwhile the container launch exits with 255 because the container files
are cleaned up by reap container. This is because after the executor exits the
launch, it performs docker inspect
With the executorLock, we are waiting for the executor.launchContainer to
complete after term/kill signal is sent to it. Once the launch is completed, we
have the correct exit code from the container. Then the reap is performed.
Possibly, the name {{executorLock}} is confusing which I can change?
I will address your other comments in the next patch.
> NM gets backed up deleting docker containers
> --------------------------------------------
>
> Key: YARN-7644
> URL: https://issues.apache.org/jira/browse/YARN-7644
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Reporter: Eric Badger
> Assignee: Chandni Singh
> Priority: Major
> Labels: Docker
> Attachments: YARN-7644.001.patch, YARN-7644.002.patch
>
>
> We are sending a {{docker stop}} to the docker container with a timeout of 10
> seconds when we shut down a container. If the container does not stop after
> 10 seconds then we force kill it. However, the {{docker stop}} command is a
> blocking call. So in cases where lots of containers don't go down with the
> initial SIGTERM, we have to wait 10+ seconds for the {{docker stop}} to
> return. This ties up the ContainerLaunch handler and so these kill events
> back up. It also appears to be backing up new container launches as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]