[ 
https://issues.apache.org/jira/browse/YARN-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326965#comment-14326965
 ] 

Rohith commented on YARN-3222:
------------------------------

Attaching the logs which gives more information about issue. In the below log, 
RM has shutdown with NPE while updating node_resource. And observe scheduler 
events dispatched from AsyncDispatcher in 
*org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.\**. Here the 
order is NODE_REMOVED --> NODE_RESOURCE_UPDATE --> NODE_ADDED --> 
NODE_LABELS_UPDATE
{noformat}
2015-02-19 09:14:57,212 INFO  [main] util.RackResolver 
(RackResolver.java:coreResolve(109)) - Resolved 127.0.0.1 to /default-rack
2015-02-19 09:14:57,213 INFO  [main] resourcemanager.ResourceTrackerService 
(ResourceTrackerService.java:registerNodeManager(313)) - Reconnect from the 
node at: 127.0.0.1
2015-02-19 09:14:57,215 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeReconnectEvent.EventType:
 RECONNECTED
2015-02-19 09:14:57,215 INFO  [main] resourcemanager.ResourceTrackerService 
(ResourceTrackerService.java:registerNodeManager(343)) - NodeManager from node 
127.0.0.1(cmPort: 1234 httpPort: 3) registered with capability: <memory:16384, 
vCores:16>, assigned nodeId 127.0.0.1:1234
2015-02-19 09:14:57,215 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl 
(RMNodeImpl.java:handle(412)) - Processing 127.0.0.1:1234 of type RECONNECTED
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeRemovedSchedulerEvent.EventType:
 NODE_REMOVED
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStartedEvent.EventType:
 STARTED
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl 
(RMNodeImpl.java:handle(412)) - Processing 127.0.0.1:1234 of type STARTED
2015-02-19 09:14:57,266 INFO  [AsyncDispatcher event handler] rmnode.RMNodeImpl 
(RMNodeImpl.java:handle(424)) - 127.0.0.1:1234 Node Transitioned from NEW to 
RUNNING
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEvent.EventType: 
NODE_USABLE
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeResourceUpdateSchedulerEvent.EventType:
 NODE_RESOURCE_UPDATE
2015-02-19 09:14:57,267 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeAddedSchedulerEvent.EventType:
 NODE_ADDED
2015-02-19 09:14:57,267 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEvent.EventType: 
NODE_USABLE
2015-02-19 09:14:57,267 DEBUG [AsyncDispatcher event handler] 
event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the 
event 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeLabelsUpdateSchedulerEvent.EventType:
 NODE_LABELS_UPDATE
2015-02-19 09:14:57,267 INFO  [ResourceManager Event Processor] 
capacity.CapacityScheduler (CapacityScheduler.java:removeNode(1267)) - Removed 
node 127.0.0.1:1234 clusterResource: <memory:0, vCores:0>
2015-02-19 09:14:57,267 FATAL [ResourceManager Event Processor] 
resourcemanager.ResourceManager (ResourceManager.java:run(688)) - Error in 
handling event type NODE_RESOURCE_UPDATE to the scheduler
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:548)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:992)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1119)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:120)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:679)
        at java.lang.Thread.run(Thread.java:745)
2015-02-19 09:14:57,280 INFO  [ResourceManager Event Processor] 
resourcemanager.ResourceManager (ResourceManager.java:run(692)) - Exiting, 
bbye..
{noformat}

> RMNodeImpl#ReconnectNodeTransition should send scheduler events in sequential 
> order
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-3222
>                 URL: https://issues.apache.org/jira/browse/YARN-3222
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Rohith
>            Assignee: Rohith
>            Priority: Critical
>
> When a node is reconnected,RMNodeImpl#ReconnectNodeTransition notifies the 
> scheduler in a events node_added,node_removed or node_resource_update. These 
> events should be notified in an sequential order i.e node_added event and 
> next node_resource_update events.
> But if the node is reconnected with different http port, the oder of 
> scheduler events are node_removed --> node_resource_update --> node_added 
> which causes scheduler does not find the node and throw NPE and RM exit.
> Node_Resource_update event should be always should be triggered via 
> RMNodeEventType.RESOURCE_UPDATE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to