[ 
https://issues.apache.org/jira/browse/YARN-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972233#comment-14972233
 ] 

Wangda Tan commented on YARN-4287:
----------------------------------

[~nroberts], 
Thanks for updating, some thinkings regarding to your comments:
bq. This is a behavior change, but I can't think of any good cases where 
someone would prefer the old behavior to the new. Let me know if you can think 
of some.
Agree with you, most of your changes are good, I prefer to enable it to get 
better performance. But I can still think some edge cases, and I'd prefer to 
keep old one to avoid some magic things happen :). Let me explain more:

There're several behavior changes in your patch,
1. rack-delay = min (computed-offswitch-delay, configured-rack-delay)
When large configured-rack-delay specified, it uses old behavior. So this is 
safe to me. And I think what you mentioned before:
bq. I didn't separate them in this version of the patch because I still want to 
be able to specify rack-locality-delay BUT have the computed delay take effect 
when an application is not asking for locality OR is really small.
Makes sense to me, I just feel current way to compute offswitch delay need to 
be improved, I will add an example below.

2. node-delay = min(rack-delay, node-delay).
If a cluster has 40 nodes, user requests 3 containers on node1:
{code}
Assume the configured-rack-delay=50, 
rack-delay = min(3 (#requested-container) * 1 (#requested-resource-name)  / 40, 
50) = 0.
So:
node-delay = min(rack-delay, 40) = 0
{code}
In above example, no matter how rack-delay specified/computed, if we can keep 
the node-delay to 40, we have better chance to get node-local containers 
allocated.

3. Don't restore missed-opportunity if rack-local container allocated.
The benefit of this change is obvious - we can get faster rack-local container 
allocation. But I feel this can also affect node-local container allocation (If 
the application asks only a small subset of nodes in a rack), may lead to some 
performance regression for locality I/O sensitive applications.

> Capacity Scheduler: Rack Locality improvement
> ---------------------------------------------
>
>                 Key: YARN-4287
>                 URL: https://issues.apache.org/jira/browse/YARN-4287
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacityscheduler
>    Affects Versions: 2.7.1
>            Reporter: Nathan Roberts
>            Assignee: Nathan Roberts
>         Attachments: YARN-4287-v2.patch, YARN-4287-v3.patch, YARN-4287.patch
>
>
> YARN-4189 does an excellent job describing the issues with the current delay 
> scheduling algorithms within the capacity scheduler. The design proposal also 
> seems like a good direction.
> This jira proposes a simple interim solution to the key issue we've been 
> experiencing on a regular basis:
>  - rackLocal assignments trickle out due to nodeLocalityDelay. This can have 
> significant impact on things like CombineFileInputFormat which targets very 
> specific nodes in its split calculations.
> I'm not sure when YARN-4189 will become reality so I thought a simple interim 
> patch might make sense. The basic idea is simple: 
> 1) Separate delays for rackLocal, and OffSwitch (today there is only 1)
> 2) When we're getting rackLocal assignments, subsequent rackLocal assignments 
> should not be delayed
> Patch will be uploaded shortly. No big deal if the consensus is to go 
> straight to YARN-4189. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to