[ https://issues.apache.org/jira/browse/YARN-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974512#comment-14974512 ]
Nathan Roberts commented on YARN-4287: -------------------------------------- Thanks [~leftnoteasy] for the comments. {quote} 2. node-delay = min(rack-delay, node-delay). If a cluster has 40 nodes, user requests 3 containers on node1: Assume the configured-rack-delay=50, rack-delay = min(3 (#requested-container) * 1 (#requested-resource-name) / 40, 50) = 0. So: node-delay = min(rack-delay, 40) = 0 In above example, no matter how rack-delay specified/computed, if we can keep the node-delay to 40, we have better chance to get node-local containers allocated. {quote} It is true that we won't get good locality in this example. iiuc, we didn't get good locality before the patch either. i.e. canAssign() would return true for NODE-LOCAL and OFF-SWITCH without delay. With the patch, canAssign() will return true for NODE-LOCAL, RACK-LOCAL, and OFF-SWITCH without delay. I believe the original intent of using localityWaitFactor was to avoid delaying small resource asks (could be a small job, or could be the tail of a large job). Unfortunately the algorithm still delayed RACK-LOCAL assignments. This made no sense to me - Accept OFF-SWITCH without delay, yet don't accept RACK-LOCAL?? I agree that we could change things here to get better locality for small requests, but to me this could have significant impact on small job latency so it would make me nervous to do so as part of this jira. {quote} 3. Don't restore missed-opportunity if rack-local container allocated. The benefit of this change is obvious - we can get faster rack-local container allocation. But I feel this can also affect node-local container allocation (If the application asks only a small subset of nodes in a rack), may lead to some performance regression for locality I/O sensitive applications. {quote} You're correct that it can affect node local container allocation. I will make this behavior configurable. The reason I didn't in the first place was that I felt the circumstances where we lose out are rare (not currently getting NODE-LOCAL assignments because otherwise missedOpportunities resets, AND not getting OFF-SWITCH assignments because missedOpportunities doesn't reset for OFF-SWITCH so it will quickly allocated everything to OFF-SWITCH as soon as it hits that threshold). On the other hand, the effects of not doing it are dramatic. We have been having cases where 5% of NMs are down for maintenance and some jobs take about an order of magnitude longer to run than normal. So, here are the changes I propose: 1) I need to change the way rackLocalityDelay is specified because it doesn't handle the case where the configuration value is larger than the cluster size. I was thinking of just scaling it. Let's say node-locality-delay=5000, rack-locality-delay=5100, cluster_size is 3000. In the existing code, node-locality-delay would automatically get lowered to 3000. Instead, it will lower rack-locality-delay to 3000, and node-locality-delay will be proportionally adjusted (5000 * 3000 / 5100) = 2941. 2) Add a configurable boolean that controls whether a rack-local assignment resets missed_opportunities to 0 (old behavior), OR node-locality-delay (new behavior). Default of new behavior. Let me know what you think of that approach. > Capacity Scheduler: Rack Locality improvement > --------------------------------------------- > > Key: YARN-4287 > URL: https://issues.apache.org/jira/browse/YARN-4287 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler > Affects Versions: 2.7.1 > Reporter: Nathan Roberts > Assignee: Nathan Roberts > Attachments: YARN-4287-v2.patch, YARN-4287-v3.patch, YARN-4287.patch > > > YARN-4189 does an excellent job describing the issues with the current delay > scheduling algorithms within the capacity scheduler. The design proposal also > seems like a good direction. > This jira proposes a simple interim solution to the key issue we've been > experiencing on a regular basis: > - rackLocal assignments trickle out due to nodeLocalityDelay. This can have > significant impact on things like CombineFileInputFormat which targets very > specific nodes in its split calculations. > I'm not sure when YARN-4189 will become reality so I thought a simple interim > patch might make sense. The basic idea is simple: > 1) Separate delays for rackLocal, and OffSwitch (today there is only 1) > 2) When we're getting rackLocal assignments, subsequent rackLocal assignments > should not be delayed > Patch will be uploaded shortly. No big deal if the consensus is to go > straight to YARN-4189. -- This message was sent by Atlassian JIRA (v6.3.4#6332)