[ 
https://issues.apache.org/jira/browse/YARN-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137475#comment-16137475
 ] 

Wangda Tan commented on YARN-6640:
----------------------------------

Thanks [~botong] for reporting and working on this case. I think this is pretty 
severe issue which we should fix and backport to previous releases. Marked as 
blocker for all > 2.8 releases.

We have two choices, one is change type of response id from int to long. I 
personally don't prefer that because even long could be exhausted if they have 
an app runs for hundred years :),  and it has compatibility issue as well.

I prefer to reuse int like what you did in the patch, if value equals MAX_INT, 
we will set it back to 0 and handle the special checking logic. I don't suggest 
have a special reserved-id, this makes code become confusing.

Another potential problem is, we only fail when request.responseId < 
lastResponse.responseId - 1, I think we should also fail when 
{{request.responseId > lastResponseId}}.

Thoughts? 

+ [~jlowe]. 

>  AM heartbeat stuck when responseId overflows MAX_INT
> -----------------------------------------------------
>
>                 Key: YARN-6640
>                 URL: https://issues.apache.org/jira/browse/YARN-6640
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Blocker
>             Fix For: 2.9.0, 3.0.0-beta1, 2.8.2
>
>         Attachments: YARN-6640.v1.patch
>
>
> The current code in {{ApplicationMasterService}}: 
> if ((request.getResponseId() + 1) == lastResponse.getResponseId()) {/* old 
> heartbeat */  return lastResponse;}
> else if (request.getResponseId() + 1 < lastResponse.getResponseId()) { throw 
> ... }
> process the heartbeat...
> When a heartbeat comes in, in usual case we are expecting 
> request.getResponseId() == lastResponse.getResponseId(). The “if“ is for the 
> duplicate heartbeat that’s one step old, the “else if” is to throw and 
> complain for heartbeats more than two steps old, otherwise we accept the new 
> heartbeat and process it.
> So the bug is: when lastResponse.getResponseId() == MAX_INT, the newest 
> heartbeat comes in with responseId == MAX_INT. However reponseId + 1 will be 
> MIN_INT, and we will fall into the “else if” case and RM will throw. Then we 
> are stuck here…



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to