[
https://issues.apache.org/jira/browse/YARN-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520727#comment-16520727
]
Botong Huang commented on YARN-8451:
------------------------------------
Here’s an example where more than one heartbeat thread is created:
1. YarnRM master slave switch happens, when the new YarnRM comes up, it
notifies the NM to resync (without killing its containers) upon first NM
heartbeat.
2. Every time NM heartbeats into RM and gets a resync signal, it dispatches an
NodeManagerEventType.RESYNC event and move on.
3. NodeManager.resyncWithRM() is the one listening to this event.
4. When the NM dispatcher is running slow, by the time the first event is
processed, the NM heartbeat thread has managed to heartbeat more and put more
NodeManagerEventType.RESYNC events into the dispatcher event queue.
5. Multiple threads are created inside NodeManager.resyncWithRM(), all of them
are blocked at statusUpdater.join() inside
NodeStatusUpdateImpl.rebootNodeStatusUpdaterAndRegisterWithRM().
6. When the previous heartbeat thread exits, every blocked thread gets released
and creates a new heartbeat thread.
> Multiple NM heartbeat thread created when a slow NM resync with RM
> ------------------------------------------------------------------
>
> Key: YARN-8451
> URL: https://issues.apache.org/jira/browse/YARN-8451
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Botong Huang
> Assignee: Botong Huang
> Priority: Major
>
> During a NM resync with RM (say RM did a master slave switch), if NM is
> running slow, more than one RESYNC event may be put into the NM dispatcher by
> the existing heartbeat thread before they are processed. As a result,
> multiple new heartbeat thread are later created and start to hb to RM
> concurrently with their own responseId. If at some point of time, one thread
> becomes more than one step behind others, RM will send back a resync signal
> in this heartbeat response, killing all containers in this NM.
> See comments below for details on how this can happen.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]