[
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324955#comment-14324955
]
Jason Lowe commented on YARN-3194:
----------------------------------
bq. Jason Lowe, I remember we discussed this case in some JIRA under YARN-1336,
did you see this problem before?
I didn't see this problem originally, but I suspect it was because there were
two things that masked it. As mentioned above, this problem doesn't manifest
before YARN-2997. In addition, I was testing it with MapReduce applications,
and the MR AM will explicitly kill containers for tasks that have completed (as
reported by the umbilical connection between the AM and tasks).
I agree that we should be processing the container report sent with the NM
registration, and it appears that is being dropped in the reconnected event.
Comments on the patch:
I noticed that the container status processing code is _almost_ a duplicate of
the same code in StatusUpdateWhenHealthyTransition. One difference is that we
don't remove containers that have completed from the launchedContainers map
which seems wrong. I don't see why we would process container status sent
during a reconnect differently than a regular status update from the NM.
Therefore I think we should refactor the code to reuse this logic, as it should
apply here just as it does for StatusUpdateWhenHealthyTransition.
> After NM restart,completed containers are not released by RM which are sent
> during NM registration
> --------------------------------------------------------------------------------------------------
>
> Key: YARN-3194
> URL: https://issues.apache.org/jira/browse/YARN-3194
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.6.0
> Environment: NM restart is enabled
> Reporter: Rohith
> Assignee: Rohith
> Attachments: 0001-yarn-3194-v1.patch
>
>
> On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM
> process only ContainerState.RUNNING. If container is completed when NM was
> down then those containers resources wont be release which result in
> applications to hang.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)