Jason Lowe commented on YARN-3194:

bq. Jason Lowe, I remember we discussed this case in some JIRA under YARN-1336, 
did you see this problem before?

I didn't see this problem originally, but I suspect it was because there were 
two things that masked it.  As mentioned above, this problem doesn't manifest 
before YARN-2997.  In addition, I was testing it with MapReduce applications, 
and the MR AM will explicitly kill containers for tasks that have completed (as 
reported by the umbilical connection between the AM and tasks).

I agree that we should be processing the container report sent with the NM 
registration, and it appears that is being dropped in the reconnected event.

Comments on the patch:

I noticed that the container status processing code is _almost_ a duplicate of 
the same code in StatusUpdateWhenHealthyTransition.  One difference is that we 
don't remove containers that have completed from the launchedContainers map 
which seems wrong.  I don't see why we would process container status sent 
during a reconnect differently than a regular status update from the NM.  
Therefore I think we should refactor the code to reuse this logic, as it should 
apply here just as it does for StatusUpdateWhenHealthyTransition.

> After NM restart,completed containers are not released by RM which are sent 
> during NM registration
> --------------------------------------------------------------------------------------------------
>                 Key: YARN-3194
>                 URL: https://issues.apache.org/jira/browse/YARN-3194
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>         Environment: NM restart is enabled
>            Reporter: Rohith
>            Assignee: Rohith
>         Attachments: 0001-yarn-3194-v1.patch
> On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM 
> process only ContainerState.RUNNING. If container is completed when NM was 
> down then those containers resources wont be release which result in 
> applications to hang.

This message was sent by Atlassian JIRA

Reply via email to