[
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709089#comment-16709089
]
Eric Yang commented on YARN-9071:
---------------------------------
In my local testing, if a container failed to start on node A, and moved
container to node B. With patch 004, when performing upgrade, the reinit will
try to relaunch container on node A. The default readiness check for IP
address, ContainerMonitor contains IP address of previous instance of container
without getting refreshed by new instance of the container. AM will
incorrectly determine the reinit of the container is successful, but no actual
container was launched.
> NM and service AM don't have updated status for reinitialized containers
> ------------------------------------------------------------------------
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Billie Rinaldi
> Assignee: Chandni Singh
> Priority: Critical
> Attachments: YARN-9071.001.patch, YARN-9071.002.patch,
> YARN-9071.003.patch, YARN-9071.004.patch, q.log
>
>
> Container resource monitoring is not stopped during the reinitialization
> process, and this prevents the NM from obtaining updated process tree
> information when the container starts running again. I observed a
> reinitialized container go from RUNNING to REINITIALIZING to
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring
> was then started for a second time, but since the trackingContainers entry
> had already been initialized for the container, ContainersMonitor skipped
> finding the new PID and IP for the container. A possible solution would be to
> stop the container monitoring in the reinitialization process so that the
> process tree information would be initialized properly when monitoring is
> restarted. When the same container was stopped by the NM later, the NM did
> not kill the container, and the service AM received an unexpected event (stop
> at reinitializing).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]