[
https://issues.apache.org/jira/browse/YARN-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972233#comment-14972233
]
Wangda Tan commented on YARN-4287:
----------------------------------
[~nroberts],
Thanks for updating, some thinkings regarding to your comments:
bq. This is a behavior change, but I can't think of any good cases where
someone would prefer the old behavior to the new. Let me know if you can think
of some.
Agree with you, most of your changes are good, I prefer to enable it to get
better performance. But I can still think some edge cases, and I'd prefer to
keep old one to avoid some magic things happen :). Let me explain more:
There're several behavior changes in your patch,
1. rack-delay = min (computed-offswitch-delay, configured-rack-delay)
When large configured-rack-delay specified, it uses old behavior. So this is
safe to me. And I think what you mentioned before:
bq. I didn't separate them in this version of the patch because I still want to
be able to specify rack-locality-delay BUT have the computed delay take effect
when an application is not asking for locality OR is really small.
Makes sense to me, I just feel current way to compute offswitch delay need to
be improved, I will add an example below.
2. node-delay = min(rack-delay, node-delay).
If a cluster has 40 nodes, user requests 3 containers on node1:
{code}
Assume the configured-rack-delay=50,
rack-delay = min(3 (#requested-container) * 1 (#requested-resource-name) / 40,
50) = 0.
So:
node-delay = min(rack-delay, 40) = 0
{code}
In above example, no matter how rack-delay specified/computed, if we can keep
the node-delay to 40, we have better chance to get node-local containers
allocated.
3. Don't restore missed-opportunity if rack-local container allocated.
The benefit of this change is obvious - we can get faster rack-local container
allocation. But I feel this can also affect node-local container allocation (If
the application asks only a small subset of nodes in a rack), may lead to some
performance regression for locality I/O sensitive applications.
> Capacity Scheduler: Rack Locality improvement
> ---------------------------------------------
>
> Key: YARN-4287
> URL: https://issues.apache.org/jira/browse/YARN-4287
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: capacityscheduler
> Affects Versions: 2.7.1
> Reporter: Nathan Roberts
> Assignee: Nathan Roberts
> Attachments: YARN-4287-v2.patch, YARN-4287-v3.patch, YARN-4287.patch
>
>
> YARN-4189 does an excellent job describing the issues with the current delay
> scheduling algorithms within the capacity scheduler. The design proposal also
> seems like a good direction.
> This jira proposes a simple interim solution to the key issue we've been
> experiencing on a regular basis:
> - rackLocal assignments trickle out due to nodeLocalityDelay. This can have
> significant impact on things like CombineFileInputFormat which targets very
> specific nodes in its split calculations.
> I'm not sure when YARN-4189 will become reality so I thought a simple interim
> patch might make sense. The basic idea is simple:
> 1) Separate delays for rackLocal, and OffSwitch (today there is only 1)
> 2) When we're getting rackLocal assignments, subsequent rackLocal assignments
> should not be delayed
> Patch will be uploaded shortly. No big deal if the consensus is to go
> straight to YARN-4189.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)