[
https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173875#comment-16173875
]
Jason Lowe commented on YARN-7102:
----------------------------------
Sorry for the delay.
bq. After a more strict responseId check in NM heartbeat, we need to drain the
RM dispatcher events after every MockNM heartbeat. Otherwise, two sequential
MockNM heartbeat will fail on the second heartbeat, because RM is still
processing the first heartbeat event.
This worries me. The fact that we have to go update a ton of tests makes me
think that we're susceptible to seeing incorrect behavior in a "real" cluster
when the RM goes into a full GC cycle. If that GC cycle is long enough then I
could see this change causing every nodemanager in the cluster to go through a
reboot because the RM mistakenly believes the heartbeats are out of sync with
the RM.
IMHO the response ID needs to be handled inline rather than asynchronously --
we should never return a response for the current heartbeat request until we
are ready to receive the next heartbeat request. It sounds like that's not the
case with this patch. I'm OK if we want to use the RMNode as the place where
we store this bookkeeping information for each node, but I don't think the
response ID handling should be completely asynchronous as it is today
especially since this JIRA is going to clamp down on the allowed values.
> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>
> Key: YARN-7102
> URL: https://issues.apache.org/jira/browse/YARN-7102
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Botong Huang
> Assignee: Botong Huang
> Priority: Critical
> Attachments: YARN-7102.v1.patch, YARN-7102.v2.patch,
> YARN-7102.v3.patch, YARN-7102.v4.patch, YARN-7102.v5.patch, YARN-7102.v6.patch
>
>
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM
> heartbeat in YARN-6640, please refer to YARN-6640 for details.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]