[
https://issues.apache.org/jira/browse/YARN-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated YARN-6640:
-----------------------------
Target Version/s: 2.9.0, 3.0.0-beta1, 2.8.2
Fix Version/s: (was: 2.8.2)
(was: 3.0.0-beta1)
(was: 2.9.0)
I agree we don't need to change from int to long here, we can just manage the
wrap. It would be simpler if we didn't have to manage the special case of the
response being -1 to indicate we've never heard from this AM. Otherwise we
would only need to handle to valid cases: current == last or current + 1 ==
last. Anything else would be an error.
Given we need to reserve a value for "not registered" we can use Wangda's idea
to explicitly manage the looping at some convenient positive value rather than
allowing the response ID to ever go negative after it has registered. Then we
leave all the negative values for other "special values" if we ever need them.
It's not like we need all these possible values to know when we're out of sync
given we only expect this value or the previous.
For example, I think we could do something like the following (note: haven't
tested, straight off the top of my head):
{code}
if (((request.getResponseId() + 1) & Integer.MAX_VALUE) ==
lastResponse.getResponseId()) {
/* old heartbeat */
return lastResponse;
} else if (request.getResponseId() != lastResponse.getResponseId()) {
String message =
"Invalid responseId in AllocateRequest from application attempt: "
+ appAttemptId + ", expect responseId to be "
+ (lastResponse.getResponseId() + 1);
throw new InvalidApplicationMasterRequestException(message);
}
[...]
response.setResponseId((lastResponse.getResponseId() + 1) &
Integer.MAX_VALUE);
{code}
There should be a little helper function that takes the current response ID and
returns the next one to make it easier to read.
> AM heartbeat stuck when responseId overflows MAX_INT
> -----------------------------------------------------
>
> Key: YARN-6640
> URL: https://issues.apache.org/jira/browse/YARN-6640
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Botong Huang
> Assignee: Botong Huang
> Priority: Blocker
> Attachments: YARN-6640.v1.patch
>
>
> The current code in {{ApplicationMasterService}}:
> if ((request.getResponseId() + 1) == lastResponse.getResponseId()) {/* old
> heartbeat */ return lastResponse;}
> else if (request.getResponseId() + 1 < lastResponse.getResponseId()) { throw
> ... }
> process the heartbeat...
> When a heartbeat comes in, in usual case we are expecting
> request.getResponseId() == lastResponse.getResponseId(). The “if“ is for the
> duplicate heartbeat that’s one step old, the “else if” is to throw and
> complain for heartbeats more than two steps old, otherwise we accept the new
> heartbeat and process it.
> So the bug is: when lastResponse.getResponseId() == MAX_INT, the newest
> heartbeat comes in with responseId == MAX_INT. However reponseId + 1 will be
> MIN_INT, and we will fall into the “else if” case and RM will throw. Then we
> are stuck here…
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]