[
https://issues.apache.org/jira/browse/YARN-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137475#comment-16137475
]
Wangda Tan commented on YARN-6640:
----------------------------------
Thanks [~botong] for reporting and working on this case. I think this is pretty
severe issue which we should fix and backport to previous releases. Marked as
blocker for all > 2.8 releases.
We have two choices, one is change type of response id from int to long. I
personally don't prefer that because even long could be exhausted if they have
an app runs for hundred years :), and it has compatibility issue as well.
I prefer to reuse int like what you did in the patch, if value equals MAX_INT,
we will set it back to 0 and handle the special checking logic. I don't suggest
have a special reserved-id, this makes code become confusing.
Another potential problem is, we only fail when request.responseId <
lastResponse.responseId - 1, I think we should also fail when
{{request.responseId > lastResponseId}}.
Thoughts?
+ [~jlowe].
> AM heartbeat stuck when responseId overflows MAX_INT
> -----------------------------------------------------
>
> Key: YARN-6640
> URL: https://issues.apache.org/jira/browse/YARN-6640
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Botong Huang
> Assignee: Botong Huang
> Priority: Blocker
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.2
>
> Attachments: YARN-6640.v1.patch
>
>
> The current code in {{ApplicationMasterService}}:
> if ((request.getResponseId() + 1) == lastResponse.getResponseId()) {/* old
> heartbeat */ return lastResponse;}
> else if (request.getResponseId() + 1 < lastResponse.getResponseId()) { throw
> ... }
> process the heartbeat...
> When a heartbeat comes in, in usual case we are expecting
> request.getResponseId() == lastResponse.getResponseId(). The “if“ is for the
> duplicate heartbeat that’s one step old, the “else if” is to throw and
> complain for heartbeats more than two steps old, otherwise we accept the new
> heartbeat and process it.
> So the bug is: when lastResponse.getResponseId() == MAX_INT, the newest
> heartbeat comes in with responseId == MAX_INT. However reponseId + 1 will be
> MIN_INT, and we will fall into the “else if” case and RM will throw. Then we
> are stuck here…
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]