Omkar Vinit Joshi created YARN-958:
--------------------------------------

             Summary: NM may miss a heartbeat from RM resulting into missed 
finished applications information.
                 Key: YARN-958
                 URL: https://issues.apache.org/jira/browse/YARN-958
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Omkar Vinit Joshi


Today whenever RM receives heartbeat from NM it computes new heartbeat response 
and sends this response back to NM. Internally this response is sent to 
RMNodeImpl as an RMNodeEvent via dispatcher queue. Now if for some reason NM 
didn't get the older heartbeat then NM will try to heartbeat again..RM in turn 
will compute another response (if it has not already handled the event from 
queue) and will add this duplicate response on dispatcher queue. Today while 
computing response we remove completed applications from RMNodeImpl. Now if NM 
gets response without finished applications then it will never realize that 
those applications finished.

Solution:-
* We should synchronously update the newly computed response.
* lastResponse should be moved out of RMNodeImpl and it should be stored in 
ResourceTrackerService itself just like ApplicationMasterService.
* like YARN-744 we should introduce locking while computing response.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to