[
https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780101#comment-16780101
]
Wilfred Spiegelenburg edited comment on YARN-9278 at 3/12/19 12:22 PM:
-----------------------------------------------------------------------
Two things:
* I still think limiting the number of nodes is something we need to approach
with care.
* randomising a 10,000 entry long list each time we pre-empt will also become
expensive.
I was thinking more of something like this:
{code:java}
int preEmptionBatchSize = conf.getPreEmptionBatchSize();
List<FSSchedulerNode> potentialNodes =
scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName());
int size = potentialNodes.size();
int stop = 0;
int current = 0;
// find a start point somewhere in the list if it is long
if (size > preEmptionBatchSize) {
Random rand = new Random();
current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize;
stop = current;
}
do {
FSSchedulerNode mine = potentialNodes.get(current);
// Identify the containers
....
current++;
// flip at the end of the list
if (current > size) {
current = 0;
}
} while (current != stop);
{code}
Pre-emption runs in a loop and we could be considering different applications
one after the other. Shuffling that node list continually is not good from a
performance perspective. A simple cut in like above gives the same kind of
behaviour.
We could then still limit the number of "batches" we process. With some more
smarts the stop condition could be based on the fact that we have processed as
an example 10 * the batch size in nodes (a batch of nodes could be deemed
equivalent with the number of nodes in a rack):
{code} stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 *
preEmptionBatchSize) + current) % size););
{code}
That gives a lot of flexibility and still a decent performance in a large
cluster.
was (Author: wilfreds):
Two things:
* I still think limiting the number of nodes is something we need to approach
with care.
* randomising a 10,000 entry long list each time we pre-empt will also become
expensive.
I was thinking more of something like this:
{code:java}
int preEmptionBatchSize = conf.getPreEmptionBatchSize();
List<FSSchedulerNode> potentialNodes =
scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName());
int size = potentialNodes.size();
int stop = 0;
int current = 0;
// find a start point somewhere in the list if it is long
if (size > preEmptionBatchSize) {
Random rand = new Random();
current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize;
}
do {
FSSchedulerNode mine = potentialNodes.get(current);
// Identify the containers
....
current++;
// flip at the end of the list
if (current > size) {
current = 0;
}
} while (current != stop);
{code}
Pre-emption runs in a loop and we could be considering different applications
one after the other. Shuffling that node list continually is not good from a
performance perspective. A simple cut in like above gives the same kind of
behaviour.
We could then still limit the number of "batches" we process. With some more
smarts the stop condition could be based on the fact that we have processed as
an example 10 * the batch size in nodes (a batch of nodes could be deemed
equivalent with the number of nodes in a rack):
{code} stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 *
preEmptionBatchSize) + current) % size););
{code}
That gives a lot of flexibility and still a decent performance in a large
cluster.
> Shuffle nodes when selecting to be preempted nodes
> --------------------------------------------------
>
> Key: YARN-9278
> URL: https://issues.apache.org/jira/browse/YARN-9278
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: fairscheduler
> Reporter: Zhaohui Xin
> Assignee: Zhaohui Xin
> Priority: Major
> Attachments: YARN-9278.001.patch
>
>
> We should *shuffle* the nodes to avoid some nodes being preempted frequently.
> Also, we should *limit* the num of nodes to make preemption more efficient.
> Just like this,
> {code:java}
> // we should not iterate all nodes, that will be very slow
> long maxTryNodeNum =
> context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce();
> if (potentialNodes.size() > maxTryNodeNum){
> Collections.shuffle(potentialNodes);
> List<FSSchedulerNode> newPotentialNodes = new ArrayList<FSSchedulerNode>();
> for (int i = 0; i < maxTryNodeNum; i++){
> newPotentialNodes.add(potentialNodes.get(i));
> }
> potentialNodes = newPotentialNodes;
> {code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]