[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325610#comment-14325610
 ] 

Rohith commented on YARN-3194:
------------------------------

Thanks [~jlowe] [~djp] [~jianhe] for detailed review:-)

bq. the container status processing code is almost a duplicate of the same code 
in StatusUpdateWhenHealthyTransition
Agree, this has to be refactored. Majority of processing containerStatus code 
is same.

bq. we don't remove containers that have completed from the launchedContainers 
map which seems wrong
I see, yes. completed containers should be removed from launchedContainers.

bq. I don't see why we would process container status sent during a reconnect 
differently than a regular status update from the NM
IIUC it is only to deal with NMContainerStatus and containerStatus. But I am 
not sure why these both created differently. What I see is containerStatus is 
subset of NMcontainerStatus. I think containerStatus would have been inside 
NMContainerStatus. 

bq. Is below condition valid for the newly added code in 
ReconnectNodeTransition too ? 
Yes, it is applicable since we are keeping old RMNode object.

bq. Add timeout to the test, testAppCleanupWhenNMRstarts -> 
testProcessingContainerStatusesOnNMRestart ? and add more detailed comments 
about what the test is doing too ? 
Agree. 

bq. Could you add a validation that ApplicationMasterService#allocate indeed 
receives the completed container in this scenario?
Agree, I will add

bq. Question: does the 3072 include 1024 for the AM container and 2048 for the 
allocated container ? 
AM memory is 1024 and additional requested container memory is 2048. In test, 
number of request container is 1. So AllocatedMB should be AM+Requested i.e 
1024+2048=3072

> After NM restart,completed containers are not released by RM which are sent 
> during NM registration
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3194
>                 URL: https://issues.apache.org/jira/browse/YARN-3194
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.0
>         Environment: NM restart is enabled
>            Reporter: Rohith
>            Assignee: Rohith
>            Priority: Blocker
>         Attachments: 0001-yarn-3194-v1.patch
>
>
> On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM 
> process only ContainerState.RUNNING. If container is completed when NM was 
> down then those containers resources wont be release which result in 
> applications to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to