[
https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249671#comment-16249671
]
Jason Lowe commented on YARN-7102:
----------------------------------
Thanks for updating the patches! Unfortunately the branch-2.8 test failure in
TestResourceTrackerService is related.
TestResourceTrackerService#testReconnectedNode gets an NPE because the node was
temporarily kicked out of the cluster due to the new heartbeat response ID
handling:
{noformat}
2017-11-12 17:27:01,019 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl
(RMNodeImpl.java:handle(626)) - Processing host1:1234 of type RECONNECTED
2017-11-12 17:27:01,020 DEBUG [AsyncDispatcher event handler]
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(175)) - Dispatching the
event
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeRemovedSchedulerEvent.EventType:
NODE_REMOVED
2017-11-12 17:27:01,020 INFO [main] resourcemanager.ResourceTrackerService
(ResourceTrackerService.java:nodeHeartbeat(505)) - Too far behind rm response
id:0 nm response id:1
2017-11-12 17:27:01,020 INFO [AsyncDispatcher event handler]
capacity.CapacityScheduler (CapacityScheduler.java:removeNode(1551)) - Removed
node host1:1234 clusterResource: <memory:15360, vCores:2>
{noformat}
I believe the issue is that ResourceTrackerService could be updating the
response ID on an RMNodeImpl that is in the process of being swapped out by the
AsyncDispatcher processing a reconnect event. If that happens then the last
response ID update could be lost. One way to address it is to move to
ResourceTrackerService at least the portion of the reconnect logic that decides
whether to use the new node or keep the old one. Then we don't have another
thread swapping out an object as current just as another thread is trying to
update it.
> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>
> Key: YARN-7102
> URL: https://issues.apache.org/jira/browse/YARN-7102
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Botong Huang
> Assignee: Botong Huang
> Priority: Critical
> Attachments: YARN-7102-branch-2.8.v10.patch,
> YARN-7102-branch-2.8.v11.patch, YARN-7102-branch-2.8.v14.patch,
> YARN-7102-branch-2.8.v14.patch, YARN-7102-branch-2.8.v9.patch,
> YARN-7102-branch-2.v14.patch, YARN-7102-branch-2.v14.patch,
> YARN-7102-branch-2.v9.patch, YARN-7102-branch-2.v9.patch,
> YARN-7102-branch-2.v9.patch, YARN-7102.v1.patch, YARN-7102.v12.patch,
> YARN-7102.v13.patch, YARN-7102.v14.patch, YARN-7102.v2.patch,
> YARN-7102.v3.patch, YARN-7102.v4.patch, YARN-7102.v5.patch,
> YARN-7102.v6.patch, YARN-7102.v7.patch, YARN-7102.v8.patch, YARN-7102.v9.patch
>
>
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM
> heartbeat in YARN-6640, please refer to YARN-6640 for details.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]