[ 
https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173875#comment-16173875
 ] 

Jason Lowe commented on YARN-7102:
----------------------------------

Sorry for the delay.

bq. After a more strict responseId check in NM heartbeat, we need to drain the 
RM dispatcher events after every MockNM heartbeat. Otherwise, two sequential 
MockNM heartbeat will fail on the second heartbeat, because RM is still 
processing the first heartbeat event.

This worries me.  The fact that we have to go update a ton of tests makes me 
think that we're susceptible to seeing incorrect behavior in a "real" cluster 
when the RM goes into a full GC cycle.  If that GC cycle is long enough then I 
could see this change causing every nodemanager in the cluster to go through a 
reboot because the RM mistakenly believes the heartbeats are out of sync with 
the RM.

IMHO the response ID needs to be handled inline rather than asynchronously -- 
we should never return a response for the current heartbeat request until we 
are ready to receive the next heartbeat request.  It sounds like that's not the 
case with this patch.  I'm OK if we want to use the RMNode as the place where 
we store this bookkeeping information for each node, but I don't think the 
response ID handling should be completely asynchronous as it is today 
especially since this JIRA is going to clamp down on the allowed values.


> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>
>                 Key: YARN-7102
>                 URL: https://issues.apache.org/jira/browse/YARN-7102
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Critical
>         Attachments: YARN-7102.v1.patch, YARN-7102.v2.patch, 
> YARN-7102.v3.patch, YARN-7102.v4.patch, YARN-7102.v5.patch, YARN-7102.v6.patch
>
>
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM 
> heartbeat in YARN-6640, please refer to YARN-6640 for details. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to