[
https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176821#comment-16176821
]
Jason Lowe commented on YARN-7102:
----------------------------------
Ah sorry, so maybe we're OK with this scenario in the current code as far as
throwing away heartbeats and instead trade that for not being able to always
detect a duplicate heartbeat. That's going to be less severe than a dropped
heartbeat but still potentially problematic.
ResourceTrackerService is synchronously handing the updated response to the
RMNodeImpl, so we really have no excuse why we need to wait for the
asynchronous message containing the response to arrive at the RMNodeImpl in
order to get the last response ID updated properly. As I mentioned above, we
should never return a response for the current heartbeat request until we are
ready to receive the next heartbeat request. I don't understand the appeal of
going with the "take anything greater than" approach with corner cases that
fail (like wrap-around or NM heartbeating much farther ahead and really is
out-of-sync) given we can cover all those cases in a straightforward way
without the caveats.
> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>
> Key: YARN-7102
> URL: https://issues.apache.org/jira/browse/YARN-7102
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Botong Huang
> Assignee: Botong Huang
> Priority: Critical
> Attachments: YARN-7102.v1.patch, YARN-7102.v2.patch,
> YARN-7102.v3.patch, YARN-7102.v4.patch, YARN-7102.v5.patch, YARN-7102.v6.patch
>
>
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM
> heartbeat in YARN-6640, please refer to YARN-6640 for details.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]