[ 
https://issues.apache.org/jira/browse/YARN-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864989#comment-13864989
 ] 

Junping Du commented on YARN-1506:
----------------------------------

Hi [~jianhe], Thanks again for your review and comments:
bq. Instead of the check here, I think we can send the event and make 
RMNodeTransition to ignore this event. This can prevent the case that 
isUnusable return true right before the node is about to become usable, since 
the events will be processed sequentially.
Good point. We should just let event mechanism to handle this concurrent issue.
bq. Did we have an overall test for testing AdminService to send the request 
and verify RMNode and schedulerNode are changed accordingly?
No system test yet with this patch but just some unit tests. However, I did 
some integration tests on previous patches in YARN-291 with a raw patch of 
YARN-313 (patch with admin CLI) and found it works well. More integration tests 
will come with YARN-313 (the next and last patch on YARN-291 that target for 
2.4 branch). Make sense?
bq.  [REBOOT -> RUNNING] not sure about this. A restart node seems only trigger 
the RECONNECT event on register and RMNode stays on RUNNING when receiving this 
event.
The interesting thing here is DeactivateNodeTransition will be trigged from 
RUNNING -> REBOOT, so node will be removed from RMContext.nodes and put to 
RMContext.inactiveNodes. So for next time registration, the event is sent as 
START instead of RECONNECT and nothing happens as we don't have state machine 
trigged from REBOOT with START event. We should fix it. Isn't it?
bq. [DECOMMISSIONED -> RUNNING] simply because we are not supporting 
recommission?
Yes. IMO, Recommission is a *must* to have if we claim YARN support 
decommission.
bq. [LOST -> NEW/UNHELATHY/DECOMMISSIONED] from the code, I can see the node is 
actually gone from RM's point of view once the node expires
node is just go to RMContext.inactiveNodes. But it is possible for node to 
heartbeat with status update again (cases like: network outage and come back, 
node VM are suspended or freeze, clock unsynchronized, etc.) when its status is 
put into LOST, and we don't have any code to handle this. We should fix it. 
Isn't it?
It seems to me that many state transitions are missing in above discuss cases, 
we can file a separate JIRA to address this. Thoughts?

> Replace set resource change on RMNode/SchedulerNode directly with event 
> notification.
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-1506
>                 URL: https://issues.apache.org/jira/browse/YARN-1506
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, scheduler
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Blocker
>         Attachments: YARN-1506-v1.patch, YARN-1506-v2.patch, 
> YARN-1506-v3.patch, YARN-1506-v4.patch, YARN-1506-v5.patch
>
>
> According to Vinod's comments on YARN-312 
> (https://issues.apache.org/jira/browse/YARN-312?focusedCommentId=13846087&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13846087),
>  we should replace RMNode.setResourceOption() with some resource change event.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to