[
https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341894#comment-15341894
]
Jason Lowe commented on YARN-5197:
----------------------------------
bq. is this possible that container info disappear from node update?
Yes it is definitely possible since we've seen it in practice. If the
application is tearing down the RM will tell the NM to clean up the
application. There are scenarios where the NM can fail to report a completed
container for an application that is being cleaned up, since it's removing all
the app state and the containers that go with it. Since the app is cleaning
up, there's no AM around to ack. And if the NM never reports a completion
event for a container then RMNodeImpl clearly leaks in the launchedContainers
map without this patch.
The patch also covers the corner case where the NM failed to record state for a
container somehow (I/O error or other state store failure) and reconnected with
partial state. In that scenario the RM will properly detect that the container
is no longer being tracked by the NM and report the completion to the
application (as well as preventing the leak in launchedContainers).
> RM leaks containers if running container disappears from node update
> --------------------------------------------------------------------
>
> Key: YARN-5197
> URL: https://issues.apache.org/jira/browse/YARN-5197
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.7.2, 2.6.4
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Fix For: 2.6.5, 2.7.4
>
> Attachments: YARN-5197-branch-2.7.003.patch,
> YARN-5197-branch-2.8.003.patch, YARN-5197.001.patch, YARN-5197.002.patch,
> YARN-5197.003.patch
>
>
> Once a node reports a container running in a status update, the corresponding
> RMNodeImpl will track the container in its launchedContainers map. If the
> node somehow misses sending the completed container status to the RM and the
> container simply disappears from subsequent heartbeats, the container will
> leak in launchedContainers forever and the container completion event will
> not be sent to the scheduler.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]