Junping Du commented on YARN-958:

Sounds like the problem is still valid. Shall we delay to cleanup 
finishedApplications in RMNodeImpl until hear back from NM in next heartbeat? 
However, that heartbeat could be lost also.

> NM may miss a heartbeat response from RM resulting into missed finished 
> applications information.
> -------------------------------------------------------------------------------------------------
>                 Key: YARN-958
>                 URL: https://issues.apache.org/jira/browse/YARN-958
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Omkar Vinit Joshi
> Today whenever RM receives heartbeat from NM it computes new heartbeat 
> response and sends this response back to NM. Internally this response is sent 
> to RMNodeImpl as an RMNodeEvent via dispatcher queue. Now if for some reason 
> NM didn't get the older heartbeat then NM will try to heartbeat again..RM in 
> turn will compute another response (if it has not already handled the event 
> from queue) and will add this duplicate response on dispatcher queue. Today 
> while computing response we remove completed applications from RMNodeImpl. 
> Now if NM gets response without finished applications then it will never 
> realize that those applications finished.
> Solution:-
> * We should synchronously update the newly computed response.
> * lastResponse should be moved out of RMNodeImpl and it should be stored in 
> ResourceTrackerService itself just like ApplicationMasterService.
> * like YARN-744 we should introduce locking while computing response.

This message was sent by Atlassian JIRA

Reply via email to