[ 
https://issues.apache.org/jira/browse/YARN-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14344468#comment-14344468
 ] 

Rohith commented on YARN-3222:
------------------------------

bq. I think we may not need to call sendNodeUsableEventIfNodeStateIsNotRunning 
to send the node_usable event in ReconnectEvent. As you said earlier, the next 
heartbeat will trigger this event based on the node's own health report. 
Right.. It is not required. I will remove this

bq. The transition is invoked only at running and unhealthy state, so I think 
this is not possible? 
I see. 

bq. Even by sending an event it's still possible that removeNode was removing 
new capability from cluster resource ?
I see a potential risk even if RMNodeResourceUpdateEvent has sent because say 
Asyndispatcher has events Node_removed,RMNodeResourceUpdate. AsyncDispatcher 
fetch Node_removed and put it SchedulerEventDispatcher queue. IAC, if 
SchedulerEventDispatcher is dealyed processing the node_removed may be because 
of more scheduler events, then RMNodeResourceUpdate is processed first. So 
there is chance of removing new capability from cluster resource. 
Any thoughts for handling this issue?

> RMNodeImpl#ReconnectNodeTransition should send scheduler events in sequential 
> order
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-3222
>                 URL: https://issues.apache.org/jira/browse/YARN-3222
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Rohith
>            Assignee: Rohith
>            Priority: Critical
>         Attachments: 0001-YARN-3222.patch, 0002-YARN-3222.patch, 
> 0003-YARN-3222.patch
>
>
> When a node is reconnected,RMNodeImpl#ReconnectNodeTransition notifies the 
> scheduler in a events node_added,node_removed or node_resource_update. These 
> events should be notified in an sequential order i.e node_added event and 
> next node_resource_update events.
> But if the node is reconnected with different http port, the oder of 
> scheduler events are node_removed --> node_resource_update --> node_added 
> which causes scheduler does not find the node and throw NPE and RM exit.
> Node_Resource_update event should be always should be triggered via 
> RMNodeEventType.RESOURCE_UPDATE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to