[jira] [Commented] (YARN-6102) On failover RM can crash due to unregistered event to AsyncDispatcher

Sunil G (JIRA) Tue, 17 Jan 2017 02:30:53 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-6102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15825802#comment-15825802
 ]


Sunil G commented on YARN-6102:
-------------------------------

In resetDispatcher,
{code}
(1)    removeService((Service)rmDispatcher);
    // Need to stop previous rmDispatcher before assigning new dispatcher
    // otherwise causes "AsyncDispatcher event handler" thread leak
(2)    ((Service) rmDispatcher).stop();
    rmDispatcher = dispatcher;
    addIfService(rmDispatcher);
(3)  rmContext.setDispatcher(rmDispatcher);
{code}

RPC is already stopped. At (2), dispatcher is also stopped. Altogether, there 
wont be any more new events cached, and all events would have been drained. 
Later at (3), we set a new dispatcher. At this time, ideally there should not 
be any events sent to old dispatcher. Could you attach more logs etc.

> On failover RM can crash due to unregistered event to AsyncDispatcher
> ---------------------------------------------------------------------
>
>                 Key: YARN-6102
>                 URL: https://issues.apache.org/jira/browse/YARN-6102
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.8.0, 2.7.3
>            Reporter: Ajith S
>            Assignee: Ajith S
>            Priority: Critical
>
> {code}2017-01-17 16:42:17,911 FATAL [AsyncDispatcher event handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:dispatch(200)) - Error in 
> dispatcher thread
> java.lang.Exception: No handler for registered for class 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:196)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:120)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-01-17 16:42:17,914 INFO  [AsyncDispatcher ShutDown handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:run(303)) - Exiting, bbye..{code}
> The same stack i was also noticed in {{TestResourceTrackerOnHA}} exits 
> abnormally, after some analysis, i was able to reproduce.
> Once the nodeHeartBeat is sent to RM, inside 
> {{org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(NodeHeartbeatRequest)}},
>  before sending it to dispatcher through
> {{this.rmContext.getDispatcher().getEventHandler().handle(nodeStatusEvent);}} 
> if RM failover is called, the dispatcher is reset
> The new dispatcher is however first started and then the events are 
> registered at 
> {{org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(boolean)}}
> So event order will look like
> 1. Send Node heartbeat to {{ResourceTrackerService}}
> 2. In {{ResourceTrackerService.nodeHeartbeat}}, before passing to dispatcher 
> call RM failover
> 3. In RM Failover, current active will reset dispatcher @reinitialize i.e ( 
> {{resetDispatcher();}} + {{createAndInitActiveServices();}} )
> Now between {{resetDispatcher();}} and {{createAndInitActiveServices();}} , 
> the {{ResourceTrackerService.nodeHeartbeat}} invokes dipatcher
> This will cause the above error as at point of time when {{STATUS_UPDATE}} 
> event is given to dispatcher in {{ResourceTrackerService}} , the new 
> dispatcher(from the failover) may be started but not yet registered for events
> Using same steps(with pausing JVM at debug), i was able to reproduce this in 
> production cluster also. for {{STATUS_UPDATE}} active service event, when the 
> service is yet to forward the event to RM dispatcher but a failover is called 
> and dispatcher reset is between {{resetDispatcher();}} & 
> {{createAndInitActiveServices();}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-6102) On failover RM can crash due to unregistered event to AsyncDispatcher

Reply via email to