[ 
https://issues.apache.org/jira/browse/YARN-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974512#comment-14974512
 ] 

Nathan Roberts commented on YARN-4287:
--------------------------------------

Thanks [~leftnoteasy] for the comments. 

{quote}
2. node-delay = min(rack-delay, node-delay).
If a cluster has 40 nodes, user requests 3 containers on node1:

Assume the configured-rack-delay=50, 
rack-delay = min(3 (#requested-container) * 1 (#requested-resource-name)  / 40, 
50) = 0.
So:
node-delay = min(rack-delay, 40) = 0

In above example, no matter how rack-delay specified/computed, if we can keep 
the node-delay to 40, we have better chance to get node-local containers 
allocated.
{quote}
It is true that we won't get good locality in this example. iiuc, we didn't get 
good locality before the patch either. i.e. canAssign() would return true for 
NODE-LOCAL and OFF-SWITCH without delay. With the patch, canAssign() will 
return true for NODE-LOCAL, RACK-LOCAL, and OFF-SWITCH without delay. I believe 
the original intent of using localityWaitFactor was to avoid delaying small 
resource asks (could be a small job, or could be the tail of a large job). 
Unfortunately the algorithm still delayed RACK-LOCAL assignments. This made no 
sense to me - Accept OFF-SWITCH without delay, yet don't accept RACK-LOCAL?? I 
agree that we could change things here to get better locality for small 
requests, but to me this could have significant impact on small job latency so 
it would make me nervous to do so as part of this jira. 

{quote}
3. Don't restore missed-opportunity if rack-local container allocated.
The benefit of this change is obvious - we can get faster rack-local container 
allocation. But I feel this can also affect node-local container allocation (If 
the application asks only a small subset of nodes in a rack), may lead to some 
performance regression for locality I/O sensitive applications.
{quote}
You're correct that it can affect node local container allocation. I will make 
this behavior configurable. The reason I didn't in the first place was that I 
felt the circumstances where we lose out are rare (not currently getting 
NODE-LOCAL assignments because otherwise missedOpportunities resets, AND not 
getting OFF-SWITCH assignments because missedOpportunities doesn't reset for 
OFF-SWITCH so it will quickly allocated everything to OFF-SWITCH as soon as it 
hits that threshold). On the other hand, the effects of not doing it are 
dramatic. We have been having cases where 5% of NMs are down for maintenance 
and some jobs take about an order of magnitude longer to run than normal. 

So, here are the changes I propose:
1) I need to change the way rackLocalityDelay is specified because it doesn't 
handle the case where the configuration value is larger than the cluster size. 
I was thinking of just scaling it. Let's say node-locality-delay=5000, 
rack-locality-delay=5100, cluster_size is 3000. In the existing code, 
node-locality-delay would automatically get lowered to 3000. Instead, it will 
lower rack-locality-delay to 3000, and node-locality-delay will be 
proportionally adjusted (5000 * 3000 / 5100) = 2941. 
2) Add a configurable boolean that controls whether a rack-local assignment 
resets missed_opportunities to 0 (old behavior), OR node-locality-delay (new 
behavior). Default of new behavior. 

Let me know what you think of that approach.


> Capacity Scheduler: Rack Locality improvement
> ---------------------------------------------
>
>                 Key: YARN-4287
>                 URL: https://issues.apache.org/jira/browse/YARN-4287
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacityscheduler
>    Affects Versions: 2.7.1
>            Reporter: Nathan Roberts
>            Assignee: Nathan Roberts
>         Attachments: YARN-4287-v2.patch, YARN-4287-v3.patch, YARN-4287.patch
>
>
> YARN-4189 does an excellent job describing the issues with the current delay 
> scheduling algorithms within the capacity scheduler. The design proposal also 
> seems like a good direction.
> This jira proposes a simple interim solution to the key issue we've been 
> experiencing on a regular basis:
>  - rackLocal assignments trickle out due to nodeLocalityDelay. This can have 
> significant impact on things like CombineFileInputFormat which targets very 
> specific nodes in its split calculations.
> I'm not sure when YARN-4189 will become reality so I thought a simple interim 
> patch might make sense. The basic idea is simple: 
> 1) Separate delays for rackLocal, and OffSwitch (today there is only 1)
> 2) When we're getting rackLocal assignments, subsequent rackLocal assignments 
> should not be delayed
> Patch will be uploaded shortly. No big deal if the consensus is to go 
> straight to YARN-4189. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to