[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780101#comment-16780101 ] Wilfred Spiegelenburg edited comment on YARN-9278 at 3/12/19 12:22 PM: --- Two things: * I still think limiting the number of nodes is something we need to approach with care. * randomising a 10,000 entry long list each time we pre-empt will also become expensive. I was thinking more of something like this: {code:java} int preEmptionBatchSize = conf.getPreEmptionBatchSize(); List potentialNodes = scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName()); int size = potentialNodes.size(); int stop = 0; int current = 0; // find a start point somewhere in the list if it is long if (size > preEmptionBatchSize) { Random rand = new Random(); current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize; stop = current; } do { FSSchedulerNode mine = potentialNodes.get(current); // Identify the containers current++; // flip at the end of the list if (current > size) { current = 0; } } while (current != stop); {code} Pre-emption runs in a loop and we could be considering different applications one after the other. Shuffling that node list continually is not good from a performance perspective. A simple cut in like above gives the same kind of behaviour. We could then still limit the number of "batches" we process. With some more smarts the stop condition could be based on the fact that we have processed as an example 10 * the batch size in nodes (a batch of nodes could be deemed equivalent with the number of nodes in a rack): {code} stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * preEmptionBatchSize) + current) % size);); {code} That gives a lot of flexibility and still a decent performance in a large cluster. was (Author: wilfreds): Two things: * I still think limiting the number of nodes is something we need to approach with care. * randomising a 10,000 entry long list each time we pre-empt will also become expensive. I was thinking more of something like this: {code:java} int preEmptionBatchSize = conf.getPreEmptionBatchSize(); List potentialNodes = scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName()); int size = potentialNodes.size(); int stop = 0; int current = 0; // find a start point somewhere in the list if it is long if (size > preEmptionBatchSize) { Random rand = new Random(); current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize; } do { FSSchedulerNode mine = potentialNodes.get(current); // Identify the containers current++; // flip at the end of the list if (current > size) { current = 0; } } while (current != stop); {code} Pre-emption runs in a loop and we could be considering different applications one after the other. Shuffling that node list continually is not good from a performance perspective. A simple cut in like above gives the same kind of behaviour. We could then still limit the number of "batches" we process. With some more smarts the stop condition could be based on the fact that we have processed as an example 10 * the batch size in nodes (a batch of nodes could be deemed equivalent with the number of nodes in a rack): {code} stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * preEmptionBatchSize) + current) % size);); {code} That gives a lot of flexibility and still a decent performance in a large cluster. > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-9278.001.patch > > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780101#comment-16780101 ] Wilfred Spiegelenburg edited comment on YARN-9278 at 3/12/19 12:25 PM: --- Two things: * I still think limiting the number of nodes is something we need to approach with care. * randomising a 10,000 entry long list each time we pre-empt will also become expensive. I was thinking more of something like this: {code:java} int preEmptionBatchSize = conf.getPreEmptionBatchSize(); List potentialNodes = scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName()); int size = potentialNodes.size(); int stop = 0; int current = 0; // find a start point somewhere in the list if it is long if (size > preEmptionBatchSize) { Random rand = new Random(); current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize; stop = (preEmptionBatchSize > size) ? current : ((current + preEmptionBatchSize) % size); } do { FSSchedulerNode mine = potentialNodes.get(current); // Identify the containers current++; // flip at the end of the list if (current > size) { current = 0; } } while (current != stop); {code} Pre-emption runs in a loop and we could be considering different applications one after the other. Shuffling that node list continually is not good from a performance perspective. A simple cut in like above gives the same kind of behaviour. We could then still limit the number of "batches" we process. With some more smarts the stop condition could be based on the fact that we have processed as an example 10 * the batch size in nodes (a batch of nodes could be deemed equivalent with the number of nodes in a rack): {code} stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * preEmptionBatchSize) + current) % size);); {code} That gives a lot of flexibility and still a decent performance in a large cluster. was (Author: wilfreds): Two things: * I still think limiting the number of nodes is something we need to approach with care. * randomising a 10,000 entry long list each time we pre-empt will also become expensive. I was thinking more of something like this: {code:java} int preEmptionBatchSize = conf.getPreEmptionBatchSize(); List potentialNodes = scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName()); int size = potentialNodes.size(); int stop = 0; int current = 0; // find a start point somewhere in the list if it is long if (size > preEmptionBatchSize) { Random rand = new Random(); current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize; stop = current; } do { FSSchedulerNode mine = potentialNodes.get(current); // Identify the containers current++; // flip at the end of the list if (current > size) { current = 0; } } while (current != stop); {code} Pre-emption runs in a loop and we could be considering different applications one after the other. Shuffling that node list continually is not good from a performance perspective. A simple cut in like above gives the same kind of behaviour. We could then still limit the number of "batches" we process. With some more smarts the stop condition could be based on the fact that we have processed as an example 10 * the batch size in nodes (a batch of nodes could be deemed equivalent with the number of nodes in a rack): {code} stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * preEmptionBatchSize) + current) % size);); {code} That gives a lot of flexibility and still a decent performance in a large cluster. > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-9278.001.patch > > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To
[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776207#comment-16776207 ] Zhaohui Xin edited comment on YARN-9278 at 2/24/19 10:14 AM: - Thanks for your reply, [~yufeigu]. I think another solution is to stop looking for nodes when we find a suitable one. was (Author: uranus): Thanks for your reply, [~yufeigu]. Another solution is to stop looking for nodes when we find a suitable one. > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773299#comment-16773299 ] Yufei Gu edited comment on YARN-9278 at 2/20/19 7:47 PM: - Hi [~uranus], this seems a perf issue for a busy large cluster due to the preemption implementation, which is iteration and check. The idea of setting a node # threshhold doesn't look elegant, but reasonable if we can't change the iteration-and-check way to identify preemptable containers. It may not be the only idea though. Without introduce more complexity to FS preemption, it is already very complicated, there are some workarounds you can try: To increase FairShare Preemption Timeout and FairShare Preemption Threshold to reduce the chance of preemption. This is specially useful for a large cluster, since there is more chance to get resources just by waiting. was (Author: yufeigu): Hi [~uranus], this seems a perf issue for a busy large cluster due to the preemption implementation, which is iteration and check. I would suggest lower {{yarn.scheduler.fair.preemption.cluster-utilization-threshold}} to let preemption kick in earlier for a large cluster. The default value is 80%, which means preemption won't kick in until 80% resources of the whole cluster have been used. Please be aware that low utilization threshold may cause an unnecessary container churn, so you don't want it to be too low. > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772638#comment-16772638 ] Zhaohui Xin edited comment on YARN-9278 at 2/20/19 5:32 AM: Hi, [~yufeigu]. When preemption thread satisfies a starved container with ANY as resource name, it will find a best node in all nodes of this cluster. This will be costly when this cluster has more than 10k nodes. I think we should limit the number of nodes in such a situation. How do you think this? :D was (Author: uranus): Hi, [~yufeigu]. When preemption thread satisfies a starved container with ANY as resource name, it will find a best node in all nodes of this cluster. This will be costly when this cluster has more than 10k nodes. I think we should limit the number of nodes in such a situation. How do you think this? :D > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org