[ https://issues.apache.org/jira/browse/YARN-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523389#comment-14523389 ]
Junping Du commented on YARN-958: --------------------------------- Sounds like the problem is still valid. Shall we delay to cleanup finishedApplications in RMNodeImpl until hear back from NM in next heartbeat? However, that heartbeat could be lost also. > NM may miss a heartbeat response from RM resulting into missed finished > applications information. > ------------------------------------------------------------------------------------------------- > > Key: YARN-958 > URL: https://issues.apache.org/jira/browse/YARN-958 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Omkar Vinit Joshi > > Today whenever RM receives heartbeat from NM it computes new heartbeat > response and sends this response back to NM. Internally this response is sent > to RMNodeImpl as an RMNodeEvent via dispatcher queue. Now if for some reason > NM didn't get the older heartbeat then NM will try to heartbeat again..RM in > turn will compute another response (if it has not already handled the event > from queue) and will add this duplicate response on dispatcher queue. Today > while computing response we remove completed applications from RMNodeImpl. > Now if NM gets response without finished applications then it will never > realize that those applications finished. > Solution:- > * We should synchronously update the newly computed response. > * lastResponse should be moved out of RMNodeImpl and it should be stored in > ResourceTrackerService itself just like ApplicationMasterService. > * like YARN-744 we should introduce locking while computing response. -- This message was sent by Atlassian JIRA (v6.3.4#6332)