[ 
https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249671#comment-16249671
 ] 

Jason Lowe commented on YARN-7102:
----------------------------------

Thanks for updating the patches!  Unfortunately the branch-2.8 test failure in 
TestResourceTrackerService is related.  
TestResourceTrackerService#testReconnectedNode gets an NPE because the node was 
temporarily kicked out of the cluster due to the new heartbeat response ID 
handling:
{noformat}
2017-11-12 17:27:01,019 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl 
(RMNodeImpl.java:handle(626)) - Processing host1:1234 of type RECONNECTED
2017-11-12 17:27:01,020 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(175)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeRemovedSchedulerEvent.EventType:
 NODE_REMOVED
2017-11-12 17:27:01,020 INFO  [main] resourcemanager.ResourceTrackerService 
(ResourceTrackerService.java:nodeHeartbeat(505)) - Too far behind rm response 
id:0 nm response id:1
2017-11-12 17:27:01,020 INFO  [AsyncDispatcher event handler] 
capacity.CapacityScheduler (CapacityScheduler.java:removeNode(1551)) - Removed 
node host1:1234 clusterResource: <memory:15360, vCores:2>
{noformat}

I believe the issue is that ResourceTrackerService could be updating the 
response ID on an RMNodeImpl that is in the process of being swapped out by the 
AsyncDispatcher processing a reconnect event.  If that happens then the last 
response ID update could be lost.  One way to address it is to move to 
ResourceTrackerService at least the portion of the reconnect logic that decides 
whether to use the new node or keep the old one.  Then we don't have another 
thread swapping out an object as current just as another thread is trying to 
update it.

> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>
>                 Key: YARN-7102
>                 URL: https://issues.apache.org/jira/browse/YARN-7102
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Critical
>         Attachments: YARN-7102-branch-2.8.v10.patch, 
> YARN-7102-branch-2.8.v11.patch, YARN-7102-branch-2.8.v14.patch, 
> YARN-7102-branch-2.8.v14.patch, YARN-7102-branch-2.8.v9.patch, 
> YARN-7102-branch-2.v14.patch, YARN-7102-branch-2.v14.patch, 
> YARN-7102-branch-2.v9.patch, YARN-7102-branch-2.v9.patch, 
> YARN-7102-branch-2.v9.patch, YARN-7102.v1.patch, YARN-7102.v12.patch, 
> YARN-7102.v13.patch, YARN-7102.v14.patch, YARN-7102.v2.patch, 
> YARN-7102.v3.patch, YARN-7102.v4.patch, YARN-7102.v5.patch, 
> YARN-7102.v6.patch, YARN-7102.v7.patch, YARN-7102.v8.patch, YARN-7102.v9.patch
>
>
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM 
> heartbeat in YARN-6640, please refer to YARN-6640 for details. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to