[
https://issues.apache.org/jira/browse/YARN-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864989#comment-13864989
]
Junping Du commented on YARN-1506:
----------------------------------
Hi [~jianhe], Thanks again for your review and comments:
bq. Instead of the check here, I think we can send the event and make
RMNodeTransition to ignore this event. This can prevent the case that
isUnusable return true right before the node is about to become usable, since
the events will be processed sequentially.
Good point. We should just let event mechanism to handle this concurrent issue.
bq. Did we have an overall test for testing AdminService to send the request
and verify RMNode and schedulerNode are changed accordingly?
No system test yet with this patch but just some unit tests. However, I did
some integration tests on previous patches in YARN-291 with a raw patch of
YARN-313 (patch with admin CLI) and found it works well. More integration tests
will come with YARN-313 (the next and last patch on YARN-291 that target for
2.4 branch). Make sense?
bq. [REBOOT -> RUNNING] not sure about this. A restart node seems only trigger
the RECONNECT event on register and RMNode stays on RUNNING when receiving this
event.
The interesting thing here is DeactivateNodeTransition will be trigged from
RUNNING -> REBOOT, so node will be removed from RMContext.nodes and put to
RMContext.inactiveNodes. So for next time registration, the event is sent as
START instead of RECONNECT and nothing happens as we don't have state machine
trigged from REBOOT with START event. We should fix it. Isn't it?
bq. [DECOMMISSIONED -> RUNNING] simply because we are not supporting
recommission?
Yes. IMO, Recommission is a *must* to have if we claim YARN support
decommission.
bq. [LOST -> NEW/UNHELATHY/DECOMMISSIONED] from the code, I can see the node is
actually gone from RM's point of view once the node expires
node is just go to RMContext.inactiveNodes. But it is possible for node to
heartbeat with status update again (cases like: network outage and come back,
node VM are suspended or freeze, clock unsynchronized, etc.) when its status is
put into LOST, and we don't have any code to handle this. We should fix it.
Isn't it?
It seems to me that many state transitions are missing in above discuss cases,
we can file a separate JIRA to address this. Thoughts?
> Replace set resource change on RMNode/SchedulerNode directly with event
> notification.
> -------------------------------------------------------------------------------------
>
> Key: YARN-1506
> URL: https://issues.apache.org/jira/browse/YARN-1506
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager, scheduler
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Blocker
> Attachments: YARN-1506-v1.patch, YARN-1506-v2.patch,
> YARN-1506-v3.patch, YARN-1506-v4.patch, YARN-1506-v5.patch
>
>
> According to Vinod's comments on YARN-312
> (https://issues.apache.org/jira/browse/YARN-312?focusedCommentId=13846087&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13846087),
> we should replace RMNode.setResourceOption() with some resource change event.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)