[ 
https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341894#comment-15341894
 ] 

Jason Lowe commented on YARN-5197:
----------------------------------

bq. is this possible that container info disappear from node update?

Yes it is definitely possible since we've seen it in practice.  If the 
application is tearing down the RM will tell the NM to clean up the 
application.  There are scenarios where the NM can fail to report a completed 
container for an application that is being cleaned up, since it's removing all 
the app state and the containers that go with it.  Since the app is cleaning 
up, there's no AM around to ack.  And if the NM never reports a completion 
event for a container then RMNodeImpl clearly leaks in the launchedContainers 
map without this patch.

The patch also covers the corner case where the NM failed to record state for a 
container somehow (I/O error or other state store failure) and reconnected with 
partial state.  In that scenario the RM will properly detect that the container 
is no longer being tracked by the NM and report the completion to the 
application (as well as preventing the leak in launchedContainers).


> RM leaks containers if running container disappears from node update
> --------------------------------------------------------------------
>
>                 Key: YARN-5197
>                 URL: https://issues.apache.org/jira/browse/YARN-5197
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.2, 2.6.4
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 2.6.5, 2.7.4
>
>         Attachments: YARN-5197-branch-2.7.003.patch, 
> YARN-5197-branch-2.8.003.patch, YARN-5197.001.patch, YARN-5197.002.patch, 
> YARN-5197.003.patch
>
>
> Once a node reports a container running in a status update, the corresponding 
> RMNodeImpl will track the container in its launchedContainers map.  If the 
> node somehow misses sending the completed container status to the RM and the 
> container simply disappears from subsequent heartbeats, the container will 
> leak in launchedContainers forever and the container completion event will 
> not be sent to the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to