[ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709089#comment-16709089
 ] 

Eric Yang commented on YARN-9071:
---------------------------------

In my local testing, if a container failed to start on node A, and moved 
container to node B.  With patch 004, when performing upgrade, the reinit will 
try to relaunch container on node A.  The default readiness check for IP 
address, ContainerMonitor contains IP address of previous instance of container 
without getting refreshed by new instance of the container.  AM will 
incorrectly determine the reinit of the container is successful, but no actual 
container was launched.

> NM and service AM don't have updated status for reinitialized containers
> ------------------------------------------------------------------------
>
>                 Key: YARN-9071
>                 URL: https://issues.apache.org/jira/browse/YARN-9071
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Billie Rinaldi
>            Assignee: Chandni Singh
>            Priority: Critical
>         Attachments: YARN-9071.001.patch, YARN-9071.002.patch, 
> YARN-9071.003.patch, YARN-9071.004.patch, q.log
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to