[
https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176756#comment-16176756
]
Botong Huang edited comment on YARN-7102 at 9/22/17 5:21 PM:
-------------------------------------------------------------
Note that in the existing code:
{code}
if ((request.getResponseId() + 1) == lastResponse.getResponseId()) {
/* heartbeat one step old, simply return lastReponse */
return lastResponse;
} else if (request.getResponseId() + 1 <
lastResponse.getResponseId()) {
(resync NM...)
}
(process the heartbeat...)
{code}
RM does *accept and process* heartbeat if {{request.getResponseId() >
lastResponse.getResponseId()}}, rather than reject and resync the NM. As I
mentioned, after my proposed fix for the overflow by adding an equality check,
the scenario you mentioned can happen, but only when reponseId just wrapped
around (when lastResponse.getResponseId() is still MAX_INT, RM will accept
responseId of 0, but will resync on responseId of 1).
was (Author: botong):
Note that in the existing code:
{code}
if ((request.getResponseId() + 1) == lastResponse.getResponseId()) {
/* heartbeat one step old, simply return lastReponse */
return lastResponse;
} else if (request.getResponseId() + 1 <
lastResponse.getResponseId()) {
(resync NM...)
}
(process the heartbeat...)
{code}
RM *does* accept and process heartbeat if {{request.getResponseId() >
lastResponse.getResponseId()}}, rather than reject and resync the NM. As I
mentioned, after my proposed fix for the overflow by adding an equality check,
the scenario you mentioned can happen, but only when reponseId just wrapped
around (when lastResponse.getResponseId() is still MAX_INT, RM will accept
responseId of 0, but will resync on responseId of 1).
> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>
> Key: YARN-7102
> URL: https://issues.apache.org/jira/browse/YARN-7102
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Botong Huang
> Assignee: Botong Huang
> Priority: Critical
> Attachments: YARN-7102.v1.patch, YARN-7102.v2.patch,
> YARN-7102.v3.patch, YARN-7102.v4.patch, YARN-7102.v5.patch, YARN-7102.v6.patch
>
>
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM
> heartbeat in YARN-6640, please refer to YARN-6640 for details.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]