Omkar Vinit Joshi created YARN-958:
--------------------------------------
Summary: NM may miss a heartbeat from RM resulting into missed
finished applications information.
Key: YARN-958
URL: https://issues.apache.org/jira/browse/YARN-958
Project: Hadoop YARN
Issue Type: Bug
Reporter: Omkar Vinit Joshi
Today whenever RM receives heartbeat from NM it computes new heartbeat response
and sends this response back to NM. Internally this response is sent to
RMNodeImpl as an RMNodeEvent via dispatcher queue. Now if for some reason NM
didn't get the older heartbeat then NM will try to heartbeat again..RM in turn
will compute another response (if it has not already handled the event from
queue) and will add this duplicate response on dispatcher queue. Today while
computing response we remove completed applications from RMNodeImpl. Now if NM
gets response without finished applications then it will never realize that
those applications finished.
Solution:-
* We should synchronously update the newly computed response.
* lastResponse should be moved out of RMNodeImpl and it should be stored in
ResourceTrackerService itself just like ApplicationMasterService.
* like YARN-744 we should introduce locking while computing response.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira