[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes

2019-03-12 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780101#comment-16780101
 ] 

Wilfred Spiegelenburg edited comment on YARN-9278 at 3/12/19 12:22 PM:
---

Two things:
* I still think limiting the number of nodes is something we need to approach 
with care.
* randomising a 10,000 entry long list each time we pre-empt will also become 
expensive.
 
I was thinking more of something like this:
{code:java}
  int preEmptionBatchSize = conf.getPreEmptionBatchSize();
  List potentialNodes = 
scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName());
  int size = potentialNodes.size();
  int stop = 0;
  int current = 0;
  // find a start point somewhere in the list if it is long
  if (size > preEmptionBatchSize) {
Random rand = new Random();
current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize;
stop = current;
  }
  do {
FSSchedulerNode mine = potentialNodes.get(current);
// Identify the containers

current++;
// flip at the end of the list  
if (current > size) {
  current = 0;
}
  } while (current != stop);
{code}

Pre-emption runs in a loop and we could be considering different applications 
one after the other. Shuffling that node list continually is not good from a 
performance perspective. A simple cut in like above gives the same kind of 
behaviour. 
We could then still limit the number of "batches" we process. With some more 
smarts the stop condition could be based on the fact that we have processed as 
an example 10 * the batch size in nodes (a batch of nodes could be deemed 
equivalent with the number of nodes in a rack):
{code}  stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * 
preEmptionBatchSize) + current) % size););
{code}  

That gives a lot of flexibility and still a decent performance in a large 
cluster.


was (Author: wilfreds):
Two things:
* I still think limiting the number of nodes is something we need to approach 
with care.
* randomising a 10,000 entry long list each time we pre-empt will also become 
expensive.
 
I was thinking more of something like this:
{code:java}
  int preEmptionBatchSize = conf.getPreEmptionBatchSize();
  List potentialNodes = 
scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName());
  int size = potentialNodes.size();
  int stop = 0;
  int current = 0;
  // find a start point somewhere in the list if it is long
  if (size > preEmptionBatchSize) {
Random rand = new Random();
current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize;
  }
  do {
FSSchedulerNode mine = potentialNodes.get(current);
// Identify the containers

current++;
// flip at the end of the list  
if (current > size) {
  current = 0;
}
  } while (current != stop);
{code}

Pre-emption runs in a loop and we could be considering different applications 
one after the other. Shuffling that node list continually is not good from a 
performance perspective. A simple cut in like above gives the same kind of 
behaviour. 
We could then still limit the number of "batches" we process. With some more 
smarts the stop condition could be based on the fact that we have processed as 
an example 10 * the batch size in nodes (a batch of nodes could be deemed 
equivalent with the number of nodes in a rack):
{code}  stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * 
preEmptionBatchSize) + current) % size););
{code}  

That gives a lot of flexibility and still a decent performance in a large 
cluster.

> Shuffle nodes when selecting to be preempted nodes
> --
>
> Key: YARN-9278
> URL: https://issues.apache.org/jira/browse/YARN-9278
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Zhaohui Xin
>Assignee: Zhaohui Xin
>Priority: Major
> Attachments: YARN-9278.001.patch
>
>
> We should *shuffle* the nodes to avoid some nodes being preempted frequently. 
> Also, we should *limit* the num of nodes to make preemption more efficient.
> Just like this,
> {code:java}
> // we should not iterate all nodes, that will be very slow
> long maxTryNodeNum = 
> context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce();
> if (potentialNodes.size() > maxTryNodeNum){
>   Collections.shuffle(potentialNodes);
>   List newPotentialNodes = new ArrayList();
> for (int i = 0; i < maxTryNodeNum; i++){
>   newPotentialNodes.add(potentialNodes.get(i));
> }
> potentialNodes = newPotentialNodes;
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes

2019-03-12 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780101#comment-16780101
 ] 

Wilfred Spiegelenburg edited comment on YARN-9278 at 3/12/19 12:25 PM:
---

Two things:
* I still think limiting the number of nodes is something we need to approach 
with care.
* randomising a 10,000 entry long list each time we pre-empt will also become 
expensive.
 
I was thinking more of something like this:
{code:java}
  int preEmptionBatchSize = conf.getPreEmptionBatchSize();
  List potentialNodes = 
scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName());
  int size = potentialNodes.size();
  int stop = 0;
  int current = 0;
  // find a start point somewhere in the list if it is long
  if (size > preEmptionBatchSize) {
Random rand = new Random();
current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize;
stop = (preEmptionBatchSize > size) ? current : ((current + 
preEmptionBatchSize) % size);
  }
  do {
FSSchedulerNode mine = potentialNodes.get(current);
// Identify the containers

current++;
// flip at the end of the list  
if (current > size) {
  current = 0;
}
  } while (current != stop);
{code}

Pre-emption runs in a loop and we could be considering different applications 
one after the other. Shuffling that node list continually is not good from a 
performance perspective. A simple cut in like above gives the same kind of 
behaviour. 
We could then still limit the number of "batches" we process. With some more 
smarts the stop condition could be based on the fact that we have processed as 
an example 10 * the batch size in nodes (a batch of nodes could be deemed 
equivalent with the number of nodes in a rack):
{code}  stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * 
preEmptionBatchSize) + current) % size););
{code}  

That gives a lot of flexibility and still a decent performance in a large 
cluster.


was (Author: wilfreds):
Two things:
* I still think limiting the number of nodes is something we need to approach 
with care.
* randomising a 10,000 entry long list each time we pre-empt will also become 
expensive.
 
I was thinking more of something like this:
{code:java}
  int preEmptionBatchSize = conf.getPreEmptionBatchSize();
  List potentialNodes = 
scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName());
  int size = potentialNodes.size();
  int stop = 0;
  int current = 0;
  // find a start point somewhere in the list if it is long
  if (size > preEmptionBatchSize) {
Random rand = new Random();
current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize;
stop = current;
  }
  do {
FSSchedulerNode mine = potentialNodes.get(current);
// Identify the containers

current++;
// flip at the end of the list  
if (current > size) {
  current = 0;
}
  } while (current != stop);
{code}

Pre-emption runs in a loop and we could be considering different applications 
one after the other. Shuffling that node list continually is not good from a 
performance perspective. A simple cut in like above gives the same kind of 
behaviour. 
We could then still limit the number of "batches" we process. With some more 
smarts the stop condition could be based on the fact that we have processed as 
an example 10 * the batch size in nodes (a batch of nodes could be deemed 
equivalent with the number of nodes in a rack):
{code}  stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * 
preEmptionBatchSize) + current) % size););
{code}  

That gives a lot of flexibility and still a decent performance in a large 
cluster.

> Shuffle nodes when selecting to be preempted nodes
> --
>
> Key: YARN-9278
> URL: https://issues.apache.org/jira/browse/YARN-9278
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Zhaohui Xin
>Assignee: Zhaohui Xin
>Priority: Major
> Attachments: YARN-9278.001.patch
>
>
> We should *shuffle* the nodes to avoid some nodes being preempted frequently. 
> Also, we should *limit* the num of nodes to make preemption more efficient.
> Just like this,
> {code:java}
> // we should not iterate all nodes, that will be very slow
> long maxTryNodeNum = 
> context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce();
> if (potentialNodes.size() > maxTryNodeNum){
>   Collections.shuffle(potentialNodes);
>   List newPotentialNodes = new ArrayList();
> for (int i = 0; i < maxTryNodeNum; i++){
>   newPotentialNodes.add(potentialNodes.get(i));
> }
> potentialNodes = newPotentialNodes;
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To 

[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes

2019-02-24 Thread Zhaohui Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776207#comment-16776207
 ] 

Zhaohui Xin edited comment on YARN-9278 at 2/24/19 10:14 AM:
-

Thanks for your reply, [~yufeigu]. I think another solution is to stop looking 
for nodes when we find a suitable one. 


was (Author: uranus):
Thanks for your reply, [~yufeigu]. Another solution is to stop looking for 
nodes when we find a suitable one. 

> Shuffle nodes when selecting to be preempted nodes
> --
>
> Key: YARN-9278
> URL: https://issues.apache.org/jira/browse/YARN-9278
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Zhaohui Xin
>Assignee: Zhaohui Xin
>Priority: Major
>
> We should *shuffle* the nodes to avoid some nodes being preempted frequently. 
> Also, we should *limit* the num of nodes to make preemption more efficient.
> Just like this,
> {code:java}
> // we should not iterate all nodes, that will be very slow
> long maxTryNodeNum = 
> context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce();
> if (potentialNodes.size() > maxTryNodeNum){
>   Collections.shuffle(potentialNodes);
>   List newPotentialNodes = new ArrayList();
> for (int i = 0; i < maxTryNodeNum; i++){
>   newPotentialNodes.add(potentialNodes.get(i));
> }
> potentialNodes = newPotentialNodes;
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes

2019-02-20 Thread Yufei Gu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773299#comment-16773299
 ] 

Yufei Gu edited comment on YARN-9278 at 2/20/19 7:47 PM:
-

Hi [~uranus], this seems a perf issue for a busy large cluster due to the 
preemption implementation, which is iteration and check. 
The idea of setting a node # threshhold doesn't look elegant, but reasonable if 
we can't change the iteration-and-check way to identify preemptable containers. 
It may not be the only idea though.

Without introduce more complexity to FS preemption, it is already very 
complicated, there are some workarounds you can try: To increase FairShare 
Preemption Timeout and FairShare Preemption Threshold to reduce the chance of 
preemption. This is specially useful for a large cluster, since there is more 
chance to get resources just by waiting. 



was (Author: yufeigu):
Hi [~uranus], this seems a perf issue for a busy large cluster due to the 
preemption implementation, which is iteration and check. 

I would suggest lower 
{{yarn.scheduler.fair.preemption.cluster-utilization-threshold}} to let 
preemption kick in earlier for a large cluster. The default value is 80%, which 
means preemption won't kick in until 80% resources of the whole cluster have 
been used. Please be aware that low utilization threshold may cause an 
unnecessary container churn, so you don't want it to be too low. 

> Shuffle nodes when selecting to be preempted nodes
> --
>
> Key: YARN-9278
> URL: https://issues.apache.org/jira/browse/YARN-9278
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Zhaohui Xin
>Assignee: Zhaohui Xin
>Priority: Major
>
> We should *shuffle* the nodes to avoid some nodes being preempted frequently. 
> Also, we should *limit* the num of nodes to make preemption more efficient.
> Just like this,
> {code:java}
> // we should not iterate all nodes, that will be very slow
> long maxTryNodeNum = 
> context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce();
> if (potentialNodes.size() > maxTryNodeNum){
>   Collections.shuffle(potentialNodes);
>   List newPotentialNodes = new ArrayList();
> for (int i = 0; i < maxTryNodeNum; i++){
>   newPotentialNodes.add(potentialNodes.get(i));
> }
> potentialNodes = newPotentialNodes;
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes

2019-02-19 Thread Zhaohui Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772638#comment-16772638
 ] 

Zhaohui Xin edited comment on YARN-9278 at 2/20/19 5:32 AM:


Hi, [~yufeigu]. When preemption thread satisfies a starved container with ANY 
as resource name, it will find a best node in all nodes of this cluster. This 
will be costly when this cluster has more than 10k nodes.

I think we should limit the number of nodes in such a situation. How do you 
think this? :D


was (Author: uranus):
Hi, [~yufeigu]. When preemption thread satisfies a starved container with ANY 
as resource name, it will find a best node in all nodes of this cluster. This 
will be costly when this cluster has more than 10k nodes.

I think we should limit the number of nodes in such a situation. How do you 
think this? :D

 

> Shuffle nodes when selecting to be preempted nodes
> --
>
> Key: YARN-9278
> URL: https://issues.apache.org/jira/browse/YARN-9278
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Zhaohui Xin
>Assignee: Zhaohui Xin
>Priority: Major
>
> We should *shuffle* the nodes to avoid some nodes being preempted frequently. 
> Also, we should *limit* the num of nodes to make preemption more efficient.
> Just like this,
> {code:java}
> // we should not iterate all nodes, that will be very slow
> long maxTryNodeNum = 
> context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce();
> if (potentialNodes.size() > maxTryNodeNum){
>   Collections.shuffle(potentialNodes);
>   List newPotentialNodes = new ArrayList();
> for (int i = 0; i < maxTryNodeNum; i++){
>   newPotentialNodes.add(potentialNodes.get(i));
> }
> potentialNodes = newPotentialNodes;
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org