[jira] [Comment Edited] (YARN-7644) NM gets backed up deleting docker containers

Chandni Singh (JIRA) Mon, 08 Oct 2018 17:44:36 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642595#comment-16642595
 ]


Chandni Singh edited comment on YARN-7644 at 10/9/18 12:43 AM:
---------------------------------------------------------------

[~jlowe] please see my response below
{quote}IIUC the launchContainer method for the executor is a synchronous, 
blocking call that won't return until the container completes. For example, see 
DefaultContainerExecutor#launchContainer where it invokes 
Shell.CommandExecutor#execute. That means the executor lock would be held 
continuously while the container is running. Therefore I'm not sure how the 
thread running ContainerLaunch#reapContainer is going to obtain the executor 
lock to be able to proceed to kill the container. Seems like it would just 
hang, but maybe I'm missing something. This may be more of an issue with 
YARN-8160 than this one, as it looks like this mostly just refactored existing 
code to move it into a ContainerCleanup class. 
{quote}
Before {{reapContainer()}}, container term/kill signal is always sent. This is 
not blocked. With YARN-8160, we wait for {{launchContainer()}} to complete 
after the signal is sent and then perform the {{reapContainer(). }}

Note: reapContainer  removes the container. The stopping of container by 
sending KILL/TERM is not part of reapContainer. It is done before reap.
{quote}To be honest I'm not quite sure what the purpose of the lock is, since 
there are many places we invoke the executor without the lock like deactivating 
and signalling. The use of the lock seems inconsistent if it's supposed to 
guard when we are invoking the executor.
{quote}
This is the comment that describes the issue which the change in YARN-8160 
fixed:

https://issues.apache.org/jira/browse/YARN-8160?focusedCommentId=16570774&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16570774

I will summarize it here.
 * Container is launched
 * Re-init of container is requested
 * Re-init triggers container stop and removes the container
 * Meanwhile the container launch exits with 255 because the container files 
are cleaned up by reap container. This is because after the executor exits the 
launch, it performs docker inspect 

With the executorLock, we are waiting for the executor.launchContainer to 
complete after term/kill signal is sent to it. Once the launch is completed, we 
have the correct exit code from the container. Then the reap is performed.

Possibly, the name {{executorLock}} is confusing which I can change?

I will address your other comments in the next patch.

 


was (Author: csingh):
[~jlowe] please see my response below
{quote}IIUC the launchContainer method for the executor is a synchronous, 
blocking call that won't return until the container completes. For example, see 
DefaultContainerExecutor#launchContainer where it invokes 
Shell.CommandExecutor#execute. That means the executor lock would be held 
continuously while the container is running. Therefore I'm not sure how the 
thread running ContainerLaunch#reapContainer is going to obtain the executor 
lock to be able to proceed to kill the container. Seems like it would just 
hang, but maybe I'm missing something. This may be more of an issue with 
YARN-8160 than this one, as it looks like this mostly just refactored existing 
code to move it into a ContainerCleanup class. 
{quote}
Before {{reapContainer()}}, container term/kill signal is always sent. This is 
not blocked. With YARN-8160, we wait for {{launchContainer()}} to complete 
after the signal is sent and then perform the {{reapContainer()}}
{quote}To be honest I'm not quite sure what the purpose of the lock is, since 
there are many places we invoke the executor without the lock like deactivating 
and signalling. The use of the lock seems inconsistent if it's supposed to 
guard when we are invoking the executor.
{quote}
This is the comment that describes the issue which the change in YARN-8160 
fixed:

https://issues.apache.org/jira/browse/YARN-8160?focusedCommentId=16570774&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16570774

I will summarize it here.
 * Container is launched
 * Re-init of container is requested
 * Re-init triggers container stop and removes the container
 * Meanwhile the container launch exits with 255 because the container files 
are cleaned up by reap container. This is because after the executor exits the 
launch, it performs docker inspect 

With the executorLock, we are waiting for the executor.launchContainer to 
complete after term/kill signal is sent to it. Once the launch is completed, we 
have the correct exit code from the container. Then the reap is performed.

Possibly, the name {{executorLock}} is confusing which I can change?

I will address your other comments in the next patch.

 

> NM gets backed up deleting docker containers
> --------------------------------------------
>
>                 Key: YARN-7644
>                 URL: https://issues.apache.org/jira/browse/YARN-7644
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Eric Badger
>            Assignee: Chandni Singh
>            Priority: Major
>              Labels: Docker
>         Attachments: YARN-7644.001.patch, YARN-7644.002.patch
>
>
> We are sending a {{docker stop}} to the docker container with a timeout of 10 
> seconds when we shut down a container. If the container does not stop after 
> 10 seconds then we force kill it. However, the {{docker stop}} command is a 
> blocking call. So in cases where lots of containers don't go down with the 
> initial SIGTERM, we have to wait 10+ seconds for the {{docker stop}} to 
> return. This ties up the ContainerLaunch handler and so these kill events 
> back up. It also appears to be backing up new container launches as well. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-7644) NM gets backed up deleting docker containers

Reply via email to