[
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113307#comment-15113307
]
Brook Zhou commented on YARN-3223:
----------------------------------
Spoke offline with Junping, we will move forward with async approach in
general. I will move any remaining to-dos to a separate JIRA.
Going back to my previous point,
[YARN-4344|https://issues.apache.org/jira/browse/YARN-4344] seems to have
removed the dependency of using the RMNode.getTotalCapability() call inside the
scheduler. Instead, the scheduler will directly use
SchedulerNode.getTotalResource() for updating clusterResource on
add/removeNode. In that case, we can simplify the scheduler's nodeUpdate change
to simply
{code:title=CapacityScheduler.java|borderStyle=solid}
private synchronized void nodeUpdate(RMNode nm) {...
+ if (nm.getState() == NodeState.DECOMMISSIONING) {
+ this.updateNodeAndQueueResource(nm, ResourceOption.newInstance(
+ getSchedulerNode(nm.getNodeID()).getUsedResource(), 0));
+ }
...
}
{code}
At this point RMNodeImpl already has saved the originalTotalCapability of the
node. This will also immediately update the SchedulerNode resources which will
make scheduling consistent. The costs of locking should be minimal since the
function just performs a few updates. This should resolve the issues you have
brought up. Do you agree?
Otherwise, I will just keep what I have in v3.patch and upload another patch
with the same nodeUpdate code for Fifo and Fair schedulers, then create another
JIRA to track the possible scheduler inconsistencies.
> Resource update during NM graceful decommission
> -----------------------------------------------
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: graceful, nodemanager, resourcemanager
> Affects Versions: 2.7.1
> Reporter: Junping Du
> Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch,
> YARN-3223-v2.patch, YARN-3223-v3.patch
>
>
> During NM graceful decommission, we should handle resource update properly,
> include: make RMNode keep track of old resource for possible rollback, keep
> available resource to 0 and used resource get updated when
> container finished.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)