Botong Huang created YARN-8451:
----------------------------------
Summary: Multiple NM heartbeat thread created when a slow NM
resync with RM
Key: YARN-8451
URL: https://issues.apache.org/jira/browse/YARN-8451
Project: Hadoop YARN
Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
During a NM resync with RM (say RM did a master slave switch), if NM is running
slow, more than one RESYNC event may be put into the NM dispatcher by the
existing heartbeat thread before they are processed. As a result, multiple new
heartbeat thread are later created and start to hb to RM concurrently with
their own responseId. If at some point of time, one thread becomes more than
one step behind others, RM will send back a resync signal in this heartbeat
response, killing all containers in this NM.
See comments below for details on how this can happen.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]