[
https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234235#comment-16234235
]
Jason Lowe commented on YARN-7102:
----------------------------------
bq. it is indeed a race condition between node heartbeat vs node remove and
add. The correct fix is for TestResourceTrackerService.testReconnect to create
MockNM by calling MockRM.registerNode, in which a RM drain is called before
return.
I do not follow the logic here. This looks like a race condition that could
happen outside the unit tests as well, so we need more than a unit test update
to address it. The problem is that both heartbeat processing a node reconnect
processing can modify the response ID. One of them is processed synchronously
and the other isn't, so heartbeats can race ahead of the reconnect. That needs
to be fixed.
One way to address it is to move at least part of the reconnect logic to be
processed synchronously in ResourceTrackerService. Seems minimally we need to
know which RMNodeImpl we're going with so we can get the right response ID
tracked for the next heartbeat from the node. That way even if the heartbeat
arrives before the reconnect event asynchronously arrives at RMNodeImpl we have
the proper response ID in place to handle the heartbeat correctly.
> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>
> Key: YARN-7102
> URL: https://issues.apache.org/jira/browse/YARN-7102
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Botong Huang
> Assignee: Botong Huang
> Priority: Critical
> Attachments: YARN-7102-branch-2.8.v10.patch,
> YARN-7102-branch-2.8.v11.patch, YARN-7102-branch-2.8.v9.patch,
> YARN-7102-branch-2.v9.patch, YARN-7102-branch-2.v9.patch,
> YARN-7102-branch-2.v9.patch, YARN-7102.v1.patch, YARN-7102.v12.patch,
> YARN-7102.v2.patch, YARN-7102.v3.patch, YARN-7102.v4.patch,
> YARN-7102.v5.patch, YARN-7102.v6.patch, YARN-7102.v7.patch,
> YARN-7102.v8.patch, YARN-7102.v9.patch
>
>
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM
> heartbeat in YARN-6640, please refer to YARN-6640 for details.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]