[ 
https://issues.apache.org/jira/browse/YARN-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166923#comment-15166923
 ] 

sandflee commented on YARN-4673:
--------------------------------

Hi, [~ozawa], in ResourceTrackService we may concurrently process 
nodeHeartBeat() with same nodeId and responseId, they may both pass the 
lastResonseId check,  this will cause the lost of RM message. With the 
Nodelock, we could process one by one, and the above exception could be catched.
{code}
      if (remoteNodeStatus.getResponseId() + 1 == lastNodeHeartbeatResponse
          .getResponseId()) {
        LOG.info("Received duplicate heartbeat from node " +
            rmNode.getNodeAddress() + " responseId=" +
            remoteNodeStatus.getResponseId());
        return lastNodeHeartbeatResponse;
      }
{code}

actually I have not encounter the bug caused by this, but this may be a risk.

> race condition in ResourceTrackerService#nodeHeartBeat while processing 
> deduplicated msg
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-4673
>                 URL: https://issues.apache.org/jira/browse/YARN-4673
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: sandflee
>            Assignee: sandflee
>         Attachments: YARN-4673.01.patch
>
>
> we could add a lock like ApplicationMasterService#allocate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to