Ajith S created YARN-6102:
-----------------------------

             Summary: On failover RM can crash due to unregistered event to 
AsyncDispatcher
                 Key: YARN-6102
                 URL: https://issues.apache.org/jira/browse/YARN-6102
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Ajith S
            Assignee: Ajith S
            Priority: Critical


{code}2017-01-17 16:42:17,911 FATAL [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(200)) - Error in 
dispatcher thread
java.lang.Exception: No handler for registered for class 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:196)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:120)
        at java.lang.Thread.run(Thread.java:745)
2017-01-17 16:42:17,914 INFO  [AsyncDispatcher ShutDown handler] 
event.AsyncDispatcher (AsyncDispatcher.java:run(303)) - Exiting, bbye..{code}

The same stack i was also noticed in {{TestResourceTrackerOnHA}} exits 
abnormally, after some analysis, i was able to reproduce.

Once the nodeHeartBeat is sent to RM, inside 
{{org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(NodeHeartbeatRequest)}},
 before sending it to dispatcher through
{{this.rmContext.getDispatcher().getEventHandler().handle(nodeStatusEvent);}} 
if RM failover is called, the dispatcher is reset
The new dispatcher is however first started and then the events are registered 
at 
{{org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(boolean)}}

So event order will look like
1. Send Node heartbeat to {{ResourceTrackerService}}
2. In {{ResourceTrackerService.nodeHeartbeat}}, before passing to dispatcher 
call RM failover
3. In RM Failover, current active will reset dispatcher @reinitialize i.e ( 
{{resetDispatcher();}} + {{createAndInitActiveServices();}} )

Now between {{resetDispatcher();}} and {{createAndInitActiveServices();}} , the 
{{ResourceTrackerService.nodeHeartbeat}} invokes dipatcher

This will cause the above error as at point of time when {{STATUS_UPDATE}} 
event is given to dispatcher in {{ResourceTrackerService}} , the new 
dispatcher(from the failover) may be started but not yet registered for events
Using same steps(with pausing JVM at debug), i was able to reproduce this in 
production cluster also. for {{STATUS_UPDATE}} active service event, when the 
service is yet to forward the event to RM dispatcher but a failover is called 
and dispatcher reset is between {{resetDispatcher();}} & 
{{createAndInitActiveServices();}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to