[ 
https://issues.apache.org/jira/browse/YARN-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547997#comment-14547997
 ] 

gu-chi commented on YARN-1922:
------------------------------

Hi, I see you comment here to check in YARN-1922.5.patch, but why 
YARN-1922.6.patch merged? What is the concern?
I find this solution may have defect.
Suppose one container finished, then it will do clean up, the PID file still 
exist and will trigger once singalContainer, this will kill the process with 
the pid in PID file, but as container already finished, so this PID may be 
occupied by other process, this may cause serious issue.
As I know, my NM was killed unexpectedly, what I described can be the cause. 
Even rarely occur.
Below is error scenario, task clean up not finished but NM was killed, then 
started

2015-05-14 21:49:03,063 | INFO  | DeletionService #1 | Deleting absolute path : 
/export/data1/yarn/nm/localdir/usercache/omm/appcache/application_1430456703237_8047/container_1430456703237_8047_01_12582917
 | 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:400)
2015-05-14 21:49:03,063 | INFO  | AsyncDispatcher event handler | Container 
container_1430456703237_8047_01_12582917 transitioned from EXITED_WITH_SUCCESS 
to DONE | 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:918)
2015-05-14 21:49:03,064 | INFO  | AsyncDispatcher event handler | Removing 
container_1430456703237_8047_01_12582917 from application 
application_1430456703237_8047 | 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl$ContainerDoneTransition.transition(ApplicationImpl.java:340)
2015-05-14 21:49:03,064 | INFO  | AsyncDispatcher event handler | Considering 
container container_1430456703237_8047_01_12582917 for log-aggregation | 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.startContainerLogAggregation(AppLogAggregatorImpl.java:342)
2015-05-14 21:49:03,064 | INFO  | AsyncDispatcher event handler | Got event 
CONTAINER_STOP for appId application_1430456703237_8047 | 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.handle(AuxServices.java:196)
2015-05-14 21:49:03,152 | INFO  | Node Status Updater | Removed completed 
containers from NM context: [container_1430456703237_8047_01_12582917] | 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeCompletedContainersFromContext(NodeStatusUpdaterImpl.java:417)
2015-05-14 21:49:03,293 | INFO  | Task killer for 26924 | Using 
linux-container-executor.users as omm | 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:349)
2015-05-14 21:49:20,667 | INFO  | main | STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NodeManager
STARTUP_MSG:   host = SR6S11/192.168.10.21
STARTUP_MSG:   args = []
STARTUP_MSG:   version = V100R001C00
STARTUP_MSG:   classpath = 

> Process group remains alive after container process is killed externally
> ------------------------------------------------------------------------
>
>                 Key: YARN-1922
>                 URL: https://issues.apache.org/jira/browse/YARN-1922
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.0
>         Environment: CentOS 6.4
>            Reporter: Billie Rinaldi
>            Assignee: Billie Rinaldi
>             Fix For: 2.6.0
>
>         Attachments: YARN-1922.1.patch, YARN-1922.2.patch, YARN-1922.3.patch, 
> YARN-1922.4.patch, YARN-1922.5.patch, YARN-1922.6.patch
>
>
> If the main container process is killed externally, ContainerLaunch does not 
> kill the rest of the process group.  Before sending the event that results in 
> the ContainerLaunch.containerCleanup method being called, ContainerLaunch 
> sets the "completed" flag to true.  Then when cleaning up, it doesn't try to 
> read the pid file if the completed flag is true.  If it read the pid file, it 
> would proceed to send the container a kill signal.  In the case of the 
> DefaultContainerExecutor, this would kill the process group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to