[
https://issues.apache.org/jira/browse/YARN-6102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Subru Krishnan reopened YARN-6102:
----------------------------------
Reopening as [~rohithsharma] has another addendum patch due to ATSv2 merge to
branch-2.
> RMActiveService context to be updated with new RMContext on failover
> --------------------------------------------------------------------
>
> Key: YARN-6102
> URL: https://issues.apache.org/jira/browse/YARN-6102
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.8.0, 2.7.3
> Reporter: Ajith S
> Assignee: Rohith Sharma K S
> Priority: Critical
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: YARN-6102-YARN-5355-branch-2.addendum.patch,
> YARN-6102-branch-2.001.patch, YARN-6102-branch-2.002-addednum.patch,
> YARN-6102-branch-2.002.patch, YARN-6102.01.patch, YARN-6102.02.patch,
> YARN-6102.03.patch, YARN-6102.04.patch, YARN-6102.05.patch,
> YARN-6102.06.patch, YARN-6102.07.patch, eventOrder.JPG
>
>
> {code}2017-01-17 16:42:17,911 FATAL [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:dispatch(200)) - Error in
> dispatcher thread
> java.lang.Exception: No handler for registered for class
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:196)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:120)
> at java.lang.Thread.run(Thread.java:745)
> 2017-01-17 16:42:17,914 INFO [AsyncDispatcher ShutDown handler]
> event.AsyncDispatcher (AsyncDispatcher.java:run(303)) - Exiting, bbye..{code}
> The same stack i was also noticed in {{TestResourceTrackerOnHA}} exits
> abnormally, after some analysis, i was able to reproduce.
> Once the nodeHeartBeat is sent to RM, inside
> {{org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(NodeHeartbeatRequest)}},
> before sending it to dispatcher through
> {{this.rmContext.getDispatcher().getEventHandler().handle(nodeStatusEvent);}}
> if RM failover is called, the dispatcher is reset
> The new dispatcher is however first started and then the events are
> registered at
> {{org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(boolean)}}
> So event order will look like
> 1. Send Node heartbeat to {{ResourceTrackerService}}
> 2. In {{ResourceTrackerService.nodeHeartbeat}}, before passing to dispatcher
> call RM failover
> 3. In RM Failover, current active will reset dispatcher @reinitialize i.e (
> {{resetDispatcher();}} + {{createAndInitActiveServices();}} )
> Now between {{resetDispatcher();}} and {{createAndInitActiveServices();}} ,
> the {{ResourceTrackerService.nodeHeartbeat}} invokes dipatcher
> This will cause the above error as at point of time when {{STATUS_UPDATE}}
> event is given to dispatcher in {{ResourceTrackerService}} , the new
> dispatcher(from the failover) may be started but not yet registered for events
> Using same steps(with pausing JVM at debug), i was able to reproduce this in
> production cluster also. for {{STATUS_UPDATE}} active service event, when the
> service is yet to forward the event to RM dispatcher but a failover is called
> and dispatcher reset is between {{resetDispatcher();}} &
> {{createAndInitActiveServices();}}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]