[ 
https://issues.apache.org/jira/browse/YARN-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194012#comment-14194012
 ] 

Vinod Kumar Vavilapalli commented on YARN-1922:
-----------------------------------------------

Here's my notes trying to understand the scenario
 - If the cleanup request came in before the launch happens, we simply skip 
both launch and cleanup. This is taken care of by ContainersLauncher together 
with ContainerLaunch.shouldLaunchContainer.
 - The interesting case is when both launch and cleanup came in a the same 
time. This can be split into more cases, depending on _when_ the 
cleanup-request came in.
    ## Launch failed and pid-file is not written, and _completed_ is true.
    ## Launch happened, pid-file is written but entire process-tree crashed 
immediately after, so _completed_ is true.
    ## Launch happened, pid-file is written, root process crashed immediately 
after, but the remaining process-tree is still running, so _completed_ is true.
    ## Launch happened, pid-file is written and process-tree is running, so and 
_completed_ is false.

>From the code, the current behavior is fine for (1), (2) and (4), but (3) is 
>leaking processes before the patch.

I don't think what the patch is doing is sufficient - it does address (3) but 
only if pid-file was written with 100ms before the root-process crashes.

For comprehensively tackling (3), what we really need is to just wait till 
either pid-file is written or for the timer to expire maxKillWaitTime. Waiting 
for once (100ms) is not enough - I don't think your test is catching this, but 
it should be possible write a test for this that will prove that the patch is 
not sufficient.

Agreed?

> Process group remains alive after container process is killed externally
> ------------------------------------------------------------------------
>
>                 Key: YARN-1922
>                 URL: https://issues.apache.org/jira/browse/YARN-1922
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.0
>         Environment: CentOS 6.4
>            Reporter: Billie Rinaldi
>            Assignee: Billie Rinaldi
>         Attachments: YARN-1922.1.patch, YARN-1922.2.patch, YARN-1922.3.patch, 
> YARN-1922.4.patch, YARN-1922.5.patch
>
>
> If the main container process is killed externally, ContainerLaunch does not 
> kill the rest of the process group.  Before sending the event that results in 
> the ContainerLaunch.containerCleanup method being called, ContainerLaunch 
> sets the "completed" flag to true.  Then when cleaning up, it doesn't try to 
> read the pid file if the completed flag is true.  If it read the pid file, it 
> would proceed to send the container a kill signal.  In the case of the 
> DefaultContainerExecutor, this would kill the process group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to