Botong Huang created YARN-8451:
----------------------------------

             Summary: Multiple NM heartbeat thread created when a slow NM 
resync with RM
                 Key: YARN-8451
                 URL: https://issues.apache.org/jira/browse/YARN-8451
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Botong Huang
            Assignee: Botong Huang


During a NM resync with RM (say RM did a master slave switch), if NM is running 
slow, more than one RESYNC event may be put into the NM dispatcher by the 
existing heartbeat thread before they are processed. As a result, multiple new 
heartbeat thread are later created and start to hb to RM concurrently with 
their own responseId. If at some point of time, one thread becomes more than 
one step behind others, RM will send back a resync signal in this heartbeat 
response, killing all containers in this NM. 

See comments below for details on how this can happen. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to