[jira] [Updated] (YARN-10660) YARN Web UI have problem when show node partitions resource

2021-03-25 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10660:
--
Fix Version/s: (was: 3.2.2)
   (was: 3.2.1)
   (was: 3.1.1)
   (was: 3.1.0)

[~tuyu], I'm removing the entries in the Fix Version field. Values are only 
entered in that field by the committer when the JIRA is resolved.

> YARN Web UI have problem when show node partitions resource
> ---
>
> Key: YARN-10660
> URL: https://issues.apache.org/jira/browse/YARN-10660
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 3.1.0, 3.1.1, 3.2.1, 3.2.2
>Reporter: tuyu
>Priority: Minor
> Attachments: 2021-03-01 19-56-02 的屏幕截图.png, YARN-10660.patch
>
>
> when enable yarn label function, Yarn UI will show queue resource base on 
> partitions,but there have some problem when click expand button. The url will 
> increase very long, like  this 
> {code:java}
> 127.0.0.1:20701/cluster/scheduler?openQueues=Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20#Partition:%20DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96DEFAULT_PARTITION%20memory:491520,%20vCores:96
> {code}
> The root cause is
> {code:java}
>origin url is:
>   Partition:  
>htmlencode is:
>   Partition: DEFAULT_PARTITION memory:491520, vCores:96
>   SchedulerPageUtil have some javascript code
>  storeExpandedQueue
> tmpCurrentParam = tmpCurrentParam.split('&');",
>the  Partition: DEFAULT_PARTITION memory:491520, vCores:96 
>  will split and len > 1, the problem logic is here, if click  expand button 
> close, the function will clear params, but it the split array is not match 
> orgin url 
> {code}
> when click expand button close, lt;DEFAULT_PARTITION memory:491520, 
> vCores:96  will append, if click expand multi times, the length will 
> increase too long
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10517) QueueMetrics has incorrect Allocated Resource when labelled partitions updated

2021-03-25 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308849#comment-17308849
 ] 

Eric Payne commented on YARN-10517:
---

[~sibyl.lv] / [~zhuqi]. I can't seem to reproduce this issue. Can you please 
provide your config property values for the following?
{{yarn.scheduler.capacity.root.accessible-node-labels.tpcds.capacity}}
{{yarn.scheduler.capacity.root.accessible-node-labels.tpcds.maximum-capacity}}
{{yarn.scheduler.capacity.root..accessible-node-labels}}
{{yarn.scheduler.capacity.root..accessible-node-labels.tpcds.capacity}}
{{yarn.scheduler.capacity.root..accessible-node-labels.tpcds.maximum-capacity}}
{{yarn.scheduler.capacity.root..default-node-label-expression}}

bq. 3. Add label "tpcds" to cluster and replace label on node1 and node2 to be 
"tpcds" when the above application is running
Also, in step 3, can you provide the exact commands that you ran? I assume they 
are as follows, but I want to make sure we are on the same page:
{code:bash}
$ yarn rmadmin -addToClusterNodeLabels "tpcds"
$ yarn rmadmin -replaceLabelsOnNode "Node1:Port1=tpcds Node2:Port2=tpcds"
{code}

> QueueMetrics has incorrect Allocated Resource when labelled partitions updated
> --
>
> Key: YARN-10517
> URL: https://issues.apache.org/jira/browse/YARN-10517
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0, 3.3.0
>Reporter: sibyl.lv
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10517-branch-3.2.001.patch, YARN-10517.001.patch, 
> wrong metrics.png
>
>
> After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has 
> incorrect allocated jmx, such as  {color:#660e7a}allocatedMB, 
> {color}{color:#660e7a}allocatedVCores and 
> {color}{color:#660e7a}allocatedContainers, {color}when the node partition is 
> updated from "DEFAULT" to other label and there are  running applications.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Submit one application to default partition and run
>  # Add label "tpcds" to cluster and replace label on node1 and node2 to be 
> "tpcds" when the above application is running
>  # Note down "VCores Used" at Web UI
>  # When the application is finished, the metrics get wrong (screenshots 
> attached).
> ==
>  
> FiCaSchedulerApp doesn't update queue metrics when CapacityScheduler handles 
> this event {color:#660e7a}NODE_LABELS_UPDATE.{color}
> So we should release container resource from old partition and add used 
> resource to new partition, just as updating queueUsage.
> {code:java}
> // code placeholder
> public void nodePartitionUpdated(RMContainer rmContainer, String oldPartition,
> String newPartition) {
>   Resource containerResource = rmContainer.getAllocatedResource();
>   this.attemptResourceUsage.decUsed(oldPartition, containerResource);
>   this.attemptResourceUsage.incUsed(newPartition, containerResource);
>   getCSLeafQueue().decUsedResource(oldPartition, containerResource, this);
>   getCSLeafQueue().incUsedResource(newPartition, containerResource, this);
>   // Update new partition name if container is AM and also update AM resource
>   if (rmContainer.isAMContainer()) {
> setAppAMNodePartitionName(newPartition);
> this.attemptResourceUsage.decAMUsed(oldPartition, containerResource);
> this.attemptResourceUsage.incAMUsed(newPartition, containerResource);
> getCSLeafQueue().decAMUsedResource(oldPartition, containerResource, this);
> getCSLeafQueue().incAMUsedResource(newPartition, containerResource, this);
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10517) QueueMetrics has incorrect Allocated Resource when labelled partitions updated

2021-03-24 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308158#comment-17308158
 ] 

Eric Payne commented on YARN-10517:
---

I'll try to look this afternoon.

> QueueMetrics has incorrect Allocated Resource when labelled partitions updated
> --
>
> Key: YARN-10517
> URL: https://issues.apache.org/jira/browse/YARN-10517
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0, 3.3.0
>Reporter: sibyl.lv
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10517-branch-3.2.001.patch, YARN-10517.001.patch, 
> wrong metrics.png
>
>
> After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has 
> incorrect allocated jmx, such as  {color:#660e7a}allocatedMB, 
> {color}{color:#660e7a}allocatedVCores and 
> {color}{color:#660e7a}allocatedContainers, {color}when the node partition is 
> updated from "DEFAULT" to other label and there are  running applications.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Submit one application to default partition and run
>  # Add label "tpcds" to cluster and replace label on node1 and node2 to be 
> "tpcds" when the above application is running
>  # Note down "VCores Used" at Web UI
>  # When the application is finished, the metrics get wrong (screenshots 
> attached).
> ==
>  
> FiCaSchedulerApp doesn't update queue metrics when CapacityScheduler handles 
> this event {color:#660e7a}NODE_LABELS_UPDATE.{color}
> So we should release container resource from old partition and add used 
> resource to new partition, just as updating queueUsage.
> {code:java}
> // code placeholder
> public void nodePartitionUpdated(RMContainer rmContainer, String oldPartition,
> String newPartition) {
>   Resource containerResource = rmContainer.getAllocatedResource();
>   this.attemptResourceUsage.decUsed(oldPartition, containerResource);
>   this.attemptResourceUsage.incUsed(newPartition, containerResource);
>   getCSLeafQueue().decUsedResource(oldPartition, containerResource, this);
>   getCSLeafQueue().incUsedResource(newPartition, containerResource, this);
>   // Update new partition name if container is AM and also update AM resource
>   if (rmContainer.isAMContainer()) {
> setAppAMNodePartitionName(newPartition);
> this.attemptResourceUsage.decAMUsed(oldPartition, containerResource);
> this.attemptResourceUsage.incAMUsed(newPartition, containerResource);
> getCSLeafQueue().decAMUsedResource(oldPartition, containerResource, this);
> getCSLeafQueue().incAMUsedResource(newPartition, containerResource, this);
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6538) Inter Queue preemption is not happening when DRF is configured

2021-03-19 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305144#comment-17305144
 ] 

Eric Payne commented on YARN-6538:
--

[~novaboy], please provide a specific use case to reproduce this issue. For 
example, please provide cluster size and applicable queue configuration 
parameters:
number of queues, queue capacities, queue max capacities, queue user limit 
factors, queue minimum user limit percents, queue ordering policies, preemption 
parameters for each queue, etc.

> Inter Queue preemption is not happening when DRF is configured
> --
>
> Key: YARN-6538
> URL: https://issues.apache.org/jira/browse/YARN-6538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Major
>
> Cluster capacity of . Here memory is more and vcores 
> are less. If applications have more demand, vcores might be exhausted. 
> Inter queue preemption ideally has to be kicked in once vcores is over 
> utilized. However preemption is not happening.
> Analysis:
> In {{AbstractPreemptableResourceCalculator.computeFixpointAllocation}}, 
> {code}
> // assign all cluster resources until no more demand, or no resources are
> // left
> while (!orderedByNeed.isEmpty() && Resources.greaterThan(rc, totGuarant,
> unassigned, Resources.none())) {
> {code}
>  will loop even when vcores are 0 (because memory is still +ve). Hence we are 
> having more vcores in idealAssigned which cause no-preemption cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with gpu resourceType.

2021-03-19 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304984#comment-17304984
 ] 

Eric Payne commented on YARN-10503:
---

[~leftnoteasy] and [~sunilg], is there a reason custom resources were not 
included when the absolute resource feature was added?

[~zhuqi], I would prefer that custom resources be treated in a generic way  for 
calculating absolute queue resources. I would rather not treat GPU as a special 
case. However, I think YARN-9936 is going beyond this requirement. Can we use 
this JIRA (YARN-10503) to extend the absolute queue resource feature in a 
general way for custom resources?

> Support queue capacity in terms of absolute resources with gpu resourceType.
> 
>
> Key: YARN-10503
> URL: https://issues.apache.org/jira/browse/YARN-10503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10503.001.patch, YARN-10503.002.patch
>
>
> Now the absolute resources are memory and cores.
> {code:java}
> /**
>  * Different resource types supported.
>  */
> public enum AbsoluteResourceType {
>   MEMORY, VCORES;
> }{code}
> But in our GPU production clusters, we need to support more resourceTypes.
> It's very import for cluster scaling when with different resourceType 
> absolute demands.
>  
> This Jira will handle GPU first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-03-15 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10588:
--
Fix Version/s: 3.4.0

> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch, YARN-10588.004.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-03-15 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301831#comment-17301831
 ] 

Eric Payne commented on YARN-10588:
---

OK. Thanks [~BilwaST] and [~Jim_Brennan]!
+1
I will commit this afternoon.

> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch, YARN-10588.004.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-03-13 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300876#comment-17300876
 ] 

Eric Payne commented on YARN-10588:
---

bq.  I wonder if we should be looping over the first 
ResourceUtils.getNumberOfCountableResourceTypes() resource types instead of all 
of them.
[~Jim_Brennan], good catch, thanks for pointing that out. I think that would be 
a good change to make to the methods. To me, that sounds like a separate JIRA. 
Do you agree?

> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch, YARN-10588.004.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-03-09 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298332#comment-17298332
 ] 

Eric Payne commented on YARN-10588:
---

Thanks [~BilwaST] for reporting the issue and the fixes! The changes LGTM. 
[~Jim_Brennan], do you want to weigh in?

> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch, YARN-10588.004.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-03-09 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298159#comment-17298159
 ] 

Eric Payne commented on YARN-10588:
---

[~BilwaST], sorry for the delay. I will look at this today.

> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch, YARN-10588.004.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10559) Fair sharing intra-queue preemption support in Capacity Scheduler

2021-03-04 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295609#comment-17295609
 ] 

Eric Payne commented on YARN-10559:
---

[~ananyo_rao], I am still walking through the changes of the latest patch. 
However, I would like to let you know that when I installed the patch and ran 
it on a 6-node cluster, intra-queue preemption did not work at all, not even 
for different users. I verified that it was not my configuration by installing 
a build of trunk and doing the same manual test. I have not been able to try 
and debug the patch yet.

> Fair sharing intra-queue preemption support in Capacity Scheduler
> -
>
> Key: YARN-10559
> URL: https://issues.apache.org/jira/browse/YARN-10559
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 3.1.4
>Reporter: VADAGA ANANYO RAO
>Assignee: VADAGA ANANYO RAO
>Priority: Major
> Attachments: FairOP_preemption-design_doc_v1.pdf, 
> FairOP_preemption-design_doc_v2.pdf, YARN-10559.0001.patch, 
> YARN-10559.0002.patch, YARN-10559.0003.patch, YARN-10559.0004.patch, 
> YARN-10559.0005.patch, YARN-10559.0006.patch, YARN-10559.0007.patch, 
> YARN-10559.0008.patch, YARN-10559.0009.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Usecase:
> Due to the way Capacity Scheduler preemption works, If a single user submits 
> a large application to a queue (using 100% of resources), that job will not 
> be preempted by future applications from the same user within the same queue. 
> This implies that the later applications will be forced to wait for 
> completion of the long running application. This prevents multiple long 
> running, large, applications from running concurrently.
> Support fair sharing among apps while preempting applications from same queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10559) Fair sharing intra-queue preemption support in Capacity Scheduler

2021-03-03 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294871#comment-17294871
 ] 

Eric Payne commented on YARN-10559:
---

[~ananyo_rao], sorry for the delay. I'm trying to get my head around what I 
think the proper solution should be for this problem. I think that the crux of 
it is that in 
{{FifoIntraQueuePreemptionPlugin#validateOutSameAppPriorityFromDemand}}, it 
does not allow preemption if from the same user, and in 
{{FifoIntraQueuePreemptionPlugin#skipContainerBasedOnIntraQueuePolicy}} it 
doesn't allow the user to get below it's user limit.

In this case, we don't care if the preemption will cause the user to go below 
it's user limit because we expect the container to go back to the same user, 
just in a different app. However, since the state of the queue and cluster is 
always in flux, there is no guarantee that the preempted container will go to 
the app we expect it to.

Simply skipping these 2 checks is not sufficient either, since that will cause 
over-preemption to happen, with containers being preempted and then being 
assigned back to the same app it preempted them from.

> Fair sharing intra-queue preemption support in Capacity Scheduler
> -
>
> Key: YARN-10559
> URL: https://issues.apache.org/jira/browse/YARN-10559
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 3.1.4
>Reporter: VADAGA ANANYO RAO
>Assignee: VADAGA ANANYO RAO
>Priority: Major
> Attachments: FairOP_preemption-design_doc_v1.pdf, 
> FairOP_preemption-design_doc_v2.pdf, YARN-10559.0001.patch, 
> YARN-10559.0002.patch, YARN-10559.0003.patch, YARN-10559.0004.patch, 
> YARN-10559.0005.patch, YARN-10559.0006.patch, YARN-10559.0007.patch, 
> YARN-10559.0008.patch, YARN-10559.0009.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Usecase:
> Due to the way Capacity Scheduler preemption works, If a single user submits 
> a large application to a queue (using 100% of resources), that job will not 
> be preempted by future applications from the same user within the same queue. 
> This implies that the later applications will be forced to wait for 
> completion of the long running application. This prevents multiple long 
> running, large, applications from running concurrently.
> Support fair sharing among apps while preempting applications from same queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-03-01 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293221#comment-17293221
 ] 

Eric Payne commented on YARN-10588:
---

[~BilwaST], sorry for the delay in replying.

bq. Changing {{DominantResourceCalculator#isInvalidDivisor}} to 
{{DominantResourceCalculator#isAllInvalidDivisor}} would solve problem. What do 
you think?

I verified that {{isAllInvalidDivisor}} is only true if all resources are 0. 
So, I agree that you could just replace with that.

[~Jim_Brennan], do you agree?

> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-25 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291154#comment-17291154
 ] 

Eric Payne edited comment on YARN-10613 at 2/25/21, 7:15 PM:
-

Thanks [~Jim_Brennan]. I have uploaded the branch-3.2 patch.
It backports cleanly and compiles and preemtion tests pass.


was (Author: eepayne):
Thanks [~Jim_Brennan]. I have uploaded the branch-3.2 patch.

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.branch-2.10.002.patch, 
> YARN-10613.branch-3.2.002.patch, YARN-10613.trunk.001.patch, 
> YARN-10613.trunk.002.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-25 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291154#comment-17291154
 ] 

Eric Payne commented on YARN-10613:
---

Thanks [~Jim_Brennan]. I have uploaded the branch-3.2 patch.

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.branch-2.10.002.patch, 
> YARN-10613.branch-3.2.002.patch, YARN-10613.trunk.001.patch, 
> YARN-10613.trunk.002.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-25 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10613:
--
Attachment: YARN-10613.branch-3.2.002.patch

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.branch-2.10.002.patch, 
> YARN-10613.branch-3.2.002.patch, YARN-10613.trunk.001.patch, 
> YARN-10613.trunk.002.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-25 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291022#comment-17291022
 ] 

Eric Payne commented on YARN-10613:
---

I don't think the unit test failures were related. It looks like a build 
environment issue. This is from the UT log:

{panel:title=https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/671/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt}
[INFO] Results:
[INFO] 
[WARNING] Tests run: 2161, Failures: 0, Errors: 0, Skipped: 8
...
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 1
[ERROR] Crashed tests:
[ERROR] 
org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA
...
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 1
[ERROR] Crashed tests:
[ERROR] 
org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA
...
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 1
[ERROR] Crashed tests:
[ERROR] 
org.apache.hadoop.yarn.server.resourcemanager.TestReservationSystemWithRMHA
...
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 1
[ERROR] Crashed tests:
[ERROR] org.apache.hadoop.yarn.server.resourcemanager.TestRMStoreCommands
{panel}

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.branch-2.10.002.patch, 
> YARN-10613.trunk.001.patch, YARN-10613.trunk.002.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-24 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290587#comment-17290587
 ] 

Eric Payne commented on YARN-10613:
---

The {{TestRMRestart}} failure is unrelated to this patch.

I attached the branch-2.10 patch.

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.branch-2.10.002.patch, 
> YARN-10613.trunk.001.patch, YARN-10613.trunk.002.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-24 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10613:
--
Attachment: YARN-10613.branch-2.10.002.patch

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.branch-2.10.002.patch, 
> YARN-10613.trunk.001.patch, YARN-10613.trunk.002.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-24 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290483#comment-17290483
 ] 

Eric Payne commented on YARN-10613:
---

Thanks a lot, [~Jim_Brennan], for the review!

I have attached version 002 of the patch. This patch backports fairly cleanly 
(with minor import conflicts) back to branch-3.1. The patch has quite a few 
conflicts with branch-2.10, so I will need to put up a separate patch for that.

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.trunk.001.patch, YARN-10613.trunk.002.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-24 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10613:
--
Attachment: YARN-10613.trunk.002.patch

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.trunk.001.patch, YARN-10613.trunk.002.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-23 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10613:
--
Attachment: YARN-10613.trunk.001.patch

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-10613.trunk.001.patch
>
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-23 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289407#comment-17289407
 ] 

Eric Payne commented on YARN-10613:
---

I was wrong in my original assessment. To be consistent with the existing 
property names, one would have the {{intra-queue-preemption}} prefix and the 
other would not.
So, it would look like this:
{code}
yarn.resourcemanager.monitor.capacity.preemption.conservative-drf
yarn.resourcemanager.monitor.capacity.preemption.inter-queue-preemption.conservative-drf
{code}
We should probably be consistent, even though it's ugly.

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-23 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289393#comment-17289393
 ] 

Eric Payne commented on YARN-10613:
---

Actually, I want to change my suggestions to the following, since thee is no 
need to have the word "preemption" twice:
{code}
yarn.resourcemanager.monitor.capacity.preemption.in-queue.conservative-drf
yarn.resourcemanager.monitor.capacity.preemption.cross-queue.conservative-drf
{code}

bq. For the sake of readability, I suggest the following instead:
However, adding these property names has its own set of messiness.
With my changes, we will have both of the following:
{code}
  private static final String INTRA_QUEUE_PREEMPTION_CONFIG_PREFIX =
  "intra-queue-preemption.";
  private static final String IN_QUEUE_PREEMPTION_CONFIG_PREFIX =
  "in-queue.";
{code}


> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-23 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289323#comment-17289323
 ] 

Eric Payne commented on YARN-10613:
---

As far as property names go, the logical thing to do would be to add the 
following two properties (one that affects inter-queue and one that affectsr 
intra-queue preemption):
{code}
yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.conservative-drf
yarn.resourcemanager.monitor.capacity.preemption.inter-queue-preemption.conservative-drf
{code}
However, those very long names are exactly the same except for 1 character. For 
the sake of readability, I suggest the following instead:
{code}
yarn.resourcemanager.monitor.capacity.preemption.in-queue-preemption.conservative-drf
yarn.resourcemanager.monitor.capacity.preemption.cross-queue-preemption.conservative-drf
{code}


> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-23 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289319#comment-17289319
 ] 

Eric Payne commented on YARN-10613:
---

Thanks for the suggestion, [~Jim_Brennan]! I have updated the summary and 
description to reflect that conservative DRF should be configurable for both 
inter and intra queue preemption.

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-23 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10613:
--
Description: 
YARN-8292 added code that prevents CS intra-queue preemption from preempting 
containers from an app unless all of the major resources used by the app are 
greater than the user limit for that user.

Ex:
| Used | User Limit |
| <58GB, 58> | <30GB, 300> |

In this example, only used memory is above the user limit, not used vcores. So, 
intra-queue preemption will not occur.

YARN-8292 added the {{conservativeDRF}} flag to 
{{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
If {{conservativeDRF}} is false, containers will be preempted from apps in the 
example state. If true, containers will not be preempted.

This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
true for intra-queue (in-queue) preemption.

I propose that in some cases, we want intra-queue preemption to be more 
aggressive and preempt in the example case. To accommodate that, I propose the 
addition of a config property.

Also, we may want inter-queue (cross-queue) preemption to be more conservative, 
so I propose also making that a configuration property:

  was:
YARN-8292 added code that prevents CS intra-queue preemption from preempting 
containers from an app unless all of the major resources used by the app are 
greater than the user limit for that user.

Ex:
| Used | User Limit |
| <58GB, 58> | <30GB, 300> |

In this example, only used memory is above the user limit, not used vcores. So, 
intra-queue preemption will not occur.

YARN-8292 added the {{conservativeDRF}} flag to 
{{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
If {{conservativeDRF}} is false, containers will be preempted from apps in the 
example state. If true, containers will not be preempted.

This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
true for intra-queue (in-queue) preemption.

I propose that in some cases, we want intra-queue preemption to be more 
aggressive and preempt in the example case. To accommodate that, I propose the 
addition of the following config property:
{code:xml}
  

yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.conservative-drf
true
  
{code}


> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of a config property.
> Also, we may want inter-queue (cross-queue) preemption to be more 
> conservative, so I propose also making that a configuration property:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10613) Config to allow Intra- and Inter-queue preemption to enable/disable conservativeDRF

2021-02-23 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10613:
--
Summary: Config to allow Intra- and Inter-queue preemption to  
enable/disable conservativeDRF  (was: Config to allow Intra-queue preemption to 
 enable/disable conservativeDRF)

> Config to allow Intra- and Inter-queue preemption to  enable/disable 
> conservativeDRF
> 
>
> Key: YARN-10613
> URL: https://issues.apache.org/jira/browse/YARN-10613
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> YARN-8292 added code that prevents CS intra-queue preemption from preempting 
> containers from an app unless all of the major resources used by the app are 
> greater than the user limit for that user.
> Ex:
> | Used | User Limit |
> | <58GB, 58> | <30GB, 300> |
> In this example, only used memory is above the user limit, not used vcores. 
> So, intra-queue preemption will not occur.
> YARN-8292 added the {{conservativeDRF}} flag to 
> {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
> If {{conservativeDRF}} is false, containers will be preempted from apps in 
> the example state. If true, containers will not be preempted.
> This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
> true for intra-queue (in-queue) preemption.
> I propose that in some cases, we want intra-queue preemption to be more 
> aggressive and preempt in the example case. To accommodate that, I propose 
> the addition of the following config property:
> {code:xml}
>   
> 
> yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.conservative-drf
> true
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-02-10 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282737#comment-17282737
 ] 

Eric Payne commented on YARN-10588:
---

I see. Thanks [~BilwaST] for the explanation. After looking at the code and 
talking it over with [~Jim_Brennan], it does look like a better solution would 
be to modify {{DominantResourceCalculator#isInvalidDivisor}} so that its 
behavior matches the logic of {{DominantResourceCalculator#divide".

> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10588) Percentage of queue and cluster is zero in WebUI

2021-02-09 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282097#comment-17282097
 ] 

Eric Payne commented on YARN-10588:
---

[~Jim_Brennan], it does seem like DominantResourceCalculator.isInvalidDivisor() 
should match the logic of  DominantResourceCalculator.divide(). However, I 
wouldn't do that as part of this JIRA. I think just taking out the 
isInvalidDivisor check is fine in this JIRA.

[~BilwaST], I am curious why the change was necessary in FiCaSchedulerApp.java? 
I'm nervous about making any change to FiCaSchedulerApp.java, even one for 
getResourceUsageReport.

> Percentage of queue and cluster is zero in WebUI 
> -
>
> Key: YARN-10588
> URL: https://issues.apache.org/jira/browse/YARN-10588
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10588.001.patch, YARN-10588.002.patch, 
> YARN-10588.003.patch
>
>
> Steps to reproduce:
> Configure below property in resource-types.xml
> {code:java}
> 
>  yarn.resource-types
>  yarn.io/gpu
>  {code}
> Submit a job
> In UI you can see % Of Queue and % Of Cluster is zero for the submitted 
> application
>  
> This is because in SchedulerApplicationAttempt has below check for 
> calculating queueUsagePerc and clusterUsagePerc
> {code:java}
> if (!calc.isInvalidDivisor(cluster)) {
> float queueCapacityPerc = queue.getQueueInfo(false, false)
> .getCapacity();
> queueUsagePerc = calc.divide(cluster, usedResourceClone,
> Resources.multiply(cluster, queueCapacityPerc)) * 100;
> if (Float.isNaN(queueUsagePerc) || Float.isInfinite(queueUsagePerc)) {
>   queueUsagePerc = 0.0f;
> }
> clusterUsagePerc =
> calc.divide(cluster, usedResourceClone, cluster) * 100;
>   }
> {code}
> calc.isInvalidDivisor(cluster) always returns true as gpu resource is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10613) Config to allow Intra-queue preemption to enable/disable conservativeDRF

2021-02-03 Thread Eric Payne (Jira)
Eric Payne created YARN-10613:
-

 Summary: Config to allow Intra-queue preemption to  enable/disable 
conservativeDRF
 Key: YARN-10613
 URL: https://issues.apache.org/jira/browse/YARN-10613
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, scheduler preemption
Affects Versions: 2.10.1, 3.1.4, 3.2.2, 3.3.0
Reporter: Eric Payne
Assignee: Eric Payne


YARN-8292 added code that prevents CS intra-queue preemption from preempting 
containers from an app unless all of the major resources used by the app are 
greater than the user limit for that user.

Ex:
| Used | User Limit |
| <58GB, 58> | <30GB, 300> |

In this example, only used memory is above the user limit, not used vcores. So, 
intra-queue preemption will not occur.

YARN-8292 added the {{conservativeDRF}} flag to 
{{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
If {{conservativeDRF}} is false, containers will be preempted from apps in the 
example state. If true, containers will not be preempted.

This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
true for intra-queue (in-queue) preemption.

I propose that in some cases, we want intra-queue preemption to be more 
aggressive and preempt in the example case. To accommodate that, I propose the 
addition of the following config property:
{code:xml}
  

yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.conservative-drf
true
  
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10559) Fair sharing intra-queue preemption support in Capacity Scheduler

2021-02-01 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276404#comment-17276404
 ] 

Eric Payne commented on YARN-10559:
---

[~ananyo_rao], sorry for the delay. I'm still reviewing the changes, but I have 
one concern with the requirements.

{code:title=FifoIntraQueuePreemptionPlugin#setFairShareForApps}
+   * we firstly ensure all the apps in the queue get equal resources.
{code}
I don't think this is exactly correct. When a queue in the Capacity Scheduler 
has FairOrderingPolicy set, it will grow each user's share of the resources at 
a fair pace. If user1 has app1 and user2 has app2 and app3, and if all 3 apps 
are requesting resources, app1 will receive resources faster than app2. app2 
and app3 together will receive resources at roughly the same rate as app1.
The total of resources assigned to user1 and user2 will grow at roughly the 
same amount, but the apps themselves will not receive resources at the same 
rate.

So, when we preempt, we want to mimic that same behavior. I'm still trying to 
fully understand the code, so it may be that the code actually does what I 
said, but at the very least, the statement is misleading and I want to make 
sure we are on the same page regarding the requirements.



> Fair sharing intra-queue preemption support in Capacity Scheduler
> -
>
> Key: YARN-10559
> URL: https://issues.apache.org/jira/browse/YARN-10559
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 3.1.4
>Reporter: VADAGA ANANYO RAO
>Assignee: VADAGA ANANYO RAO
>Priority: Major
> Attachments: FairOP_preemption-design_doc_v1.pdf, 
> FairOP_preemption-design_doc_v2.pdf, YARN-10559.0001.patch, 
> YARN-10559.0002.patch, YARN-10559.0003.patch, YARN-10559.0004.patch, 
> YARN-10559.0005.patch, YARN-10559.0006.patch, YARN-10559.0007.patch, 
> YARN-10559.0008.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Usecase:
> Due to the way Capacity Scheduler preemption works, If a single user submits 
> a large application to a queue (using 100% of resources), that job will not 
> be preempted by future applications from the same user within the same queue. 
> This implies that the later applications will be forced to wait for 
> completion of the long running application. This prevents multiple long 
> running, large, applications from running concurrently.
> Support fair sharing among apps while preempting applications from same queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10164) Allow NM to start even when custom resource type not defined

2021-01-25 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-10164.
---
Resolution: Won't Do

> Allow NM to start even when custom resource type not defined
> 
>
> Key: YARN-10164
> URL: https://issues.apache.org/jira/browse/YARN-10164
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> In the [custom resource 
> documentation|https://hadoop.apache.org/docs/r3.2.1/hadoop-yarn/hadoop-yarn-site/ResourceModel.html],
>  it tells you to add the number of custom resources to a property called 
> {{yarn.nodemanager.resource-type.}} in a file called 
> {{node-resources.xml}}.
> For GPU resources, this would look something like
> {code:xml}
>   
> yarn.nodemanager.resource-type.gpu
> 16
>   
> {code}
> A corresponding config property must also be in {{resource-types.xml}} called 
> yarn.resource-types:
> {code:xml}
>   
> yarn.resource-types
> gpu
> Custom resources to be used for scheduling. 
>   
> {code}
> If the yarn.nodemanager.resource-type.gpu property exists without the 
> corresponding yarn.resource-types property, the nodemanager fails to start.
> I would like the option to automatically create the node-resources.xml on all 
> new nodes regardless of whether or not the cluster supports GPU resources so 
> that if I deploy a GPU node into an existing cluster that does not (yet) 
> support GPU resources, the nodemanager will at least start. Even though it 
> doesn't support the GPU resource, the other supported resources will still be 
> available to be used by the apps in the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10559) Fair sharing intra-queue preemption support in Capacity Scheduler

2021-01-21 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269420#comment-17269420
 ] 

Eric Payne commented on YARN-10559:
---

Thanks [~ananyo_rao] for bringing up this issue and for providing a patch. 
Sorry for my delay in responding.

I think the overall requirements seem reasonable. Preemption is supposed to 
mimic what the capacity scheduler would do when a queue has FairOrderingPolicy 
set if all jobs were launched simultaneously. So preempting from a user's first 
app to give to the same user's second app makes sense.

The only caveat I would add is that (as is the case with all preempted 
containers) just because the preemption monitor decides to preempt a container 
because of a particular request, that doesn't mean that the capacity scheduler 
will then assign the container as expected. The states of the queue and cluster 
are constantly in flux and so a container preempted for one app could easily go 
to a different app in the same queue or a different queue.

I will try to look at the changes next week.

> Fair sharing intra-queue preemption support in Capacity Scheduler
> -
>
> Key: YARN-10559
> URL: https://issues.apache.org/jira/browse/YARN-10559
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 3.1.4
>Reporter: VADAGA ANANYO RAO
>Assignee: VADAGA ANANYO RAO
>Priority: Major
> Attachments: FairOP_preemption-design_doc_v1.pdf, 
> FairOP_preemption-design_doc_v2.pdf, YARN-10559.0001.patch, 
> YARN-10559.0002.patch, YARN-10559.0003.patch, YARN-10559.0004.patch, 
> YARN-10559.0005.patch, YARN-10559.0006.patch, YARN-10559.0007.patch, 
> YARN-10559.0008.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Usecase:
> Due to the way Capacity Scheduler preemption works, If a single user submits 
> a large application to a queue (using 100% of resources), that job will not 
> be preempted by future applications from the same user within the same queue. 
> This implies that the later applications will be forced to wait for 
> completion of the long running application. This prevents multiple long 
> running, large, applications from running concurrently.
> Support fair sharing among apps while preempting applications from same queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-13 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264302#comment-17264302
 ] 

Eric Payne commented on YARN-4589:
--

[~Jim_Brennan], the 005 patch doesn't backport cleanly to 3.2. Can you please 
take a look?

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.005.patch, 
> YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-13 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264255#comment-17264255
 ] 

Eric Payne commented on YARN-4589:
--

bq. I don't think I need to add a unit test for this, as it is only adding a 
log message.
Agreed. The changes LGTM.
+1
I will commit today.

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.005.patch, 
> YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-12 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263731#comment-17263731
 ] 

Eric Payne commented on YARN-4589:
--

I verified that the unit tests are either not failing or failing in the same 
way with and without the patch.

The ASF license error was probably due to the stray file in patch 004.

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.005.patch, 
> YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-11 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262966#comment-17262966
 ] 

Eric Payne commented on YARN-4589:
--

[~Jim_Brennan], the patch LGTM. I will run a few manual tests, wait for the 
precommit build, and hopefully commit tomorrow.

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.2.patch, 
> YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2020-12-10 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247504#comment-17247504
 ] 

Eric Payne commented on YARN-9785:
--

I backported this to 2.10.

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Fix For: 3.3.0, 3.2.1, 3.1.3, 2.10.2
>
> Attachments: YARN-9785-001.patch, YARN-9785-branch-3.1.001.patch, 
> YARN-9785.002.patch, YARN-9785.003.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2020-12-10 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-9785:
-
Fix Version/s: 2.10.2

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Fix For: 3.3.0, 3.2.1, 3.1.3, 2.10.2
>
> Attachments: YARN-9785-001.patch, YARN-9785-branch-3.1.001.patch, 
> YARN-9785.002.patch, YARN-9785.003.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2020-12-09 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246750#comment-17246750
 ] 

Eric Payne commented on YARN-9785:
--

I would like to backport this to 2.10. It comes back cleanly.

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9785-001.patch, YARN-9785-branch-3.1.001.patch, 
> YARN-9785.002.patch, YARN-9785.003.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10496) [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler

2020-12-02 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242686#comment-17242686
 ] 

Eric Payne commented on YARN-10496:
---

Thanks [~wangda] for putting this proposal together. I have a couple of 
comments.

First, I think option #1 would be the way to go. With option #1, it's clear 
whether you want percentages or weights, but with option #2, you lose the 
ability to check whether or not the percentages add up to 100%. For people 
coming from a FS perspective, this may not seem like a loss, but for admins 
used to CS, it is important for the CS bringup to check if you misconfigured 
the properites.
Also, with option #1, my guess is that the code will be more straightforward 
because once the weights are mapped to relative percentages, the calculations 
for user headroom, am limit, etc should remain the same.

For design option #1, I have a couple of concerns:
- From the design doc, one proposal is to define max capacity for weighted 
queues in terms of percentage of the cluster rather than percentage of the 
immediate parent. I would oppose this since max capacity in CS has always been 
in relative to the immediate parent.
- Proposal #1 recommends to support a different percentage/weight/value for 
each resource type (memory/vcores/GPUs/etc.). I feel like that is a major 
change and could affect the way that the DRC works in the CS, so I feel that if 
we decide to implement that feature, we should separate it out into it's own 
design, and possibly even separate it from this effort.


> [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler
> -
>
> Key: YARN-10496
> URL: https://issues.apache.org/jira/browse/YARN-10496
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Priority: Major
>
> CapacityScheduler today doesn’t support an auto queue creation which is 
> flexible enough. The current constraints: 
>  * Only leaf queues can be auto-created
>  * A parent can only have either static queues or dynamic ones. This causes 
> multiple constraints. For example:
>  * It isn’t possible to have a VIP user like Alice with a static queue 
> root.user.alice with 50% capacity while the other user queues (under 
> root.user) are created dynamically and they share the remaining 50% of 
> resources.
>  
>  * In comparison, FairScheduler allows the following scenarios, Capacity 
> Scheduler doesn’t:
>  ** This implies that there is no possibility to have both dynamically 
> created and static queues at the same time under root
>  * A new queue needs to be created under an existing parent, while the parent 
> already has static queues
>  * Nested queue mapping policy, like in the following example: 
> |
> 
> |
>  * Here two levels of queues may need to be created 
> If an application belongs to user _alice_ (who has the primary_group of 
> _engineering_), the scheduler checks whether _root.engineering_ exists, if it 
> doesn’t,  it’ll be created. Then scheduler checks whether 
> _root.engineering.alice_ exists, and creates it if it doesn't.
>  
> When we try to move users from FairScheduler to CapacityScheduler, we face 
> feature gaps which blocks users migrate from FS to CS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2020-12-02 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242684#comment-17242684
 ] 

Eric Payne commented on YARN-10504:
---

OK. After reviewing the design in the umbrella JIRA, it is clear what the 
requirements are for this JIRA.

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10278) CapacityScheduler test framework ProportionalCapacityPreemptionPolicyMockFramework need some review

2020-12-02 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10278:
--
Fix Version/s: 3.2.3
   3.1.5
   3.3.1
   3.4.0

> CapacityScheduler test framework 
> ProportionalCapacityPreemptionPolicyMockFramework need some review
> ---
>
> Key: YARN-10278
> URL: https://issues.apache.org/jira/browse/YARN-10278
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Gergely Pollak
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10278.001.patch, YARN-10278.002.patch, 
> YARN-10278.002.patch, YARN-10278.002.patch, YARN-10278.branch-3.1.001.patch, 
> YARN-10278.branch-3.1.002.patch, YARN-10278.branch-3.1.003.patch, 
> YARN-10278.branch-3.2.001.patch, YARN-10278.branch-3.2.002.patch, 
> YARN-10278.branch-3.2.002.patch, YARN-10278.branch-3.3.001.patch
>
>
> This test framework class mocks a bit too heavily, and simulates CS internal 
> behaviour with the mock methods over a point it is reasonably maintainable, 
> any internal change in CS is a major headscratch.
> A lot of tests depend on this class, so we should approach it carefully, but 
> I think it's wroth to examine this class if it can be made a bit more 
> resilient to changes, and easier to maintain. Or at least document it better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10278) CapacityScheduler test framework ProportionalCapacityPreemptionPolicyMockFramework need some review

2020-12-01 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241919#comment-17241919
 ] 

Eric Payne commented on YARN-10278:
---

Thanks [~snemeth] for the work done on this issue.

The unit tests that failed as part of the pre-commit trunk build for patch 002 
were not related to this patch. I have committed patch 002 to trunk. I will 
commit the other branches later today or tomorrow.

> CapacityScheduler test framework 
> ProportionalCapacityPreemptionPolicyMockFramework need some review
> ---
>
> Key: YARN-10278
> URL: https://issues.apache.org/jira/browse/YARN-10278
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Gergely Pollak
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10278.001.patch, YARN-10278.002.patch, 
> YARN-10278.002.patch, YARN-10278.002.patch, YARN-10278.branch-3.1.001.patch, 
> YARN-10278.branch-3.1.002.patch, YARN-10278.branch-3.1.003.patch, 
> YARN-10278.branch-3.2.001.patch, YARN-10278.branch-3.2.002.patch, 
> YARN-10278.branch-3.2.002.patch, YARN-10278.branch-3.3.001.patch
>
>
> This test framework class mocks a bit too heavily, and simulates CS internal 
> behaviour with the mock methods over a point it is reasonably maintainable, 
> any internal change in CS is a major headscratch.
> A lot of tests depend on this class, so we should approach it carefully, but 
> I think it's wroth to examine this class if it can be made a bit more 
> resilient to changes, and easier to maintain. Or at least document it better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2020-11-30 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241118#comment-17241118
 ] 

Eric Payne commented on YARN-10504:
---

[~bteke], does the Queue Priorities feature (YARN-5864) meet your requirements?

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10503) Support queue capacity in terms of absolute resources with more resourceTypes.

2020-11-30 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10503:
--
   Fix Version/s: (was: 3.4.1)
  (was: 3.3.1)
Target Version/s: 3.3.1, 3.4.1

[~zhuqi], please use the "Target Version" field for the desired fix version. 
The "Fix Version" field is used after the ticket is resolved to indicate which 
releases have the fix.

> Support queue capacity in terms of absolute resources with more resourceTypes.
> --
>
> Key: YARN-10503
> URL: https://issues.apache.org/jira/browse/YARN-10503
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> Now the absolute resources are memory and cores.
> {code:java}
> /**
>  * Different resource types supported.
>  */
> public enum AbsoluteResourceType {
>   MEMORY, VCORES;
> }{code}
> But in our GPU production clusters, we need to support more resourceTypes.
> It's very import for cluster scaling when with different resourceType 
> absolute demands.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10431) [Umbrella] Job group management

2020-11-30 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240906#comment-17240906
 ] 

Eric Payne commented on YARN-10431:
---

[~wjlei], thank you for your reply.
bq. this design is a inner yarn level to organize different jobs. And it can 
unit different platform submitting jobs to yarn.
Can you please be more specific about uniting different platforms submitting 
jobs to yarn? I see the following from the above design document:
"one batch job may trigger several sub-jobs to running at the same time, like 
one job to process the data and another one monitor job metrics. And when we 
want to cancel these jobs, we must kill them one by one in current design. I 
proposal a job group concept to handle such parent-child jobs as one unit."

I believe that Oozie (oozie.apache.org) can provide this type of support. It 
can launch many different types of jobs such as pig, tez, hive, spark, python, 
shell actions, and probably many others. Oozie is a well-established 
coordinator that launches regularly recurring jobs. Oozie can group jobs 
together with upstream and downstream dependencies in a directed graph. Oozie 
has a community of support in Apache.

If I have misunderstood your requirements, please help me to understand how 
Oozie does not meet them.

I suggest that you take a look at the Oozie documentation and reach out to the 
us...@oozie.apache.org or d...@oozie.apache.org mailing lists.

> [Umbrella] Job group management
> ---
>
> Key: YARN-10431
> URL: https://issues.apache.org/jira/browse/YARN-10431
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.9.2
>Reporter: jialei weng
>Priority: Major
> Attachments: YarnJobGroupImpl design.pdf
>
>
> In current yarn job management, we don't have an efficient mechanism to 
> manage several jobs together. For example, one batch job may trigger several 
> sub-jobs to running at the same time, like one job to process the data and 
> another one monitor job metrics. And when we want to cancel these jobs, we 
> have to kill them one by one in current design. I proposal a job group 
> concept to handle such parent-child jobs as one unit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions

2020-11-05 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227063#comment-17227063
 ] 

Eric Payne commented on YARN-10479:
---

Thanks [~Jim_Brennan]. I committed this to 3.1 through trunk.

> RMProxy should retry on SocketTimeout Exceptions
> 
>
> Key: YARN-10479
> URL: https://issues.apache.org/jira/browse/YARN-10479
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10479.001.patch, YARN-10479.002.patch, 
> YARN-10479.003.patch
>
>
> During an incident involving a DNS outage, a large number of nodemanagers 
> failed to come back into service because they hit a socket timeout when 
> trying to re-register with the RM.
> SocketTimeoutException is not currently one of the exceptions that the 
> RMProxy will retry.  Based on this incident, it seems like it should be.  We 
> made this change internally about a year ago and it has been running in 
> production since.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions

2020-11-05 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226968#comment-17226968
 ] 

Eric Payne commented on YARN-10479:
---

Thanks for raising this issue and providing the patch, [~Jim_Brennan].

The changes LGTM

+1

> RMProxy should retry on SocketTimeout Exceptions
> 
>
> Key: YARN-10479
> URL: https://issues.apache.org/jira/browse/YARN-10479
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10479.001.patch, YARN-10479.002.patch, 
> YARN-10479.003.patch
>
>
> During an incident involving a DNS outage, a large number of nodemanagers 
> failed to come back into service because they hit a socket timeout when 
> trying to re-register with the RM.
> SocketTimeoutException is not currently one of the exceptions that the 
> RMProxy will retry.  Based on this incident, it seems like it should be.  We 
> made this change internally about a year ago and it has been running in 
> production since.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10278) CapacityScheduler test framework ProportionalCapacityPreemptionPolicyMockFramework need some review

2020-11-05 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226857#comment-17226857
 ] 

Eric Payne commented on YARN-10278:
---

{quote}
bq. It seems like branch-2.10 would involve more work as it has some conflicts. 
Do you want to stick to the 2.10 patch?
How much work is it? I would really like this to be consistent back to 2.10.
{quote}
[~snemeth], I don't think we need to pull this back to 2.10. I am fine with 
backporting only as far as 3.1

Since it has been several months, the patches have gone stale. We will need to 
upmerge them to the HEAD of the branches.

> CapacityScheduler test framework 
> ProportionalCapacityPreemptionPolicyMockFramework need some review
> ---
>
> Key: YARN-10278
> URL: https://issues.apache.org/jira/browse/YARN-10278
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Gergely Pollak
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10278.001.patch, YARN-10278.002.patch, 
> YARN-10278.002.patch, YARN-10278.branch-3.1.001.patch, 
> YARN-10278.branch-3.1.002.patch, YARN-10278.branch-3.1.003.patch, 
> YARN-10278.branch-3.2.001.patch, YARN-10278.branch-3.2.002.patch, 
> YARN-10278.branch-3.3.001.patch
>
>
> This test framework class mocks a bit too heavily, and simulates CS internal 
> behaviour with the mock methods over a point it is reasonably maintainable, 
> any internal change in CS is a major headscratch.
> A lot of tests depend on this class, so we should approach it carefully, but 
> I think it's wroth to examine this class if it can be made a bit more 
> resilient to changes, and easier to maintain. Or at least document it better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions

2020-11-05 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226818#comment-17226818
 ] 

Eric Payne commented on YARN-10479:
---

Sure thing

> RMProxy should retry on SocketTimeout Exceptions
> 
>
> Key: YARN-10479
> URL: https://issues.apache.org/jira/browse/YARN-10479
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10479.001.patch, YARN-10479.002.patch, 
> YARN-10479.003.patch
>
>
> During an incident involving a DNS outage, a large number of nodemanagers 
> failed to come back into service because they hit a socket timeout when 
> trying to re-register with the RM.
> SocketTimeoutException is not currently one of the exceptions that the 
> RMProxy will retry.  Based on this incident, it seems like it should be.  We 
> made this change internally about a year ago and it has been running in 
> production since.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-30 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223818#comment-17223818
 ] 

Eric Payne commented on YARN-10475:
---

Thanks [~Jim_Brennan] for providing resolutions for this issue, and thanks 
[~bibinchundatt] for your reviews.
The changes LGTM.

+1

I am in favor of committing this patch as-is and creating a separate JIRA for 
adding a plug-able architecture for adjusting the heartbeat based on other 
factors.

[~bibinchundatt], I await your opinion.

> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475-branch-3.2.003.patch, 
> YARN-10475-branch-3.3.003.patch, YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-29 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223200#comment-17223200
 ] 

Eric Payne commented on YARN-10471:
---

Thanks a lot, [~Jim_Brennan]!

I don't think it's necessary to port this back to 3.1 or 2.10. What do you 
think?

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Fix For: 3.3.1, 3.4.1, 3.2.3
>
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.004.patch, YARN.10471.005.patch, 
> YARN.10471.branch-3.2.003.patch, YARN.10471.branch-3.2.005.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-29 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223196#comment-17223196
 ] 

Eric Payne commented on YARN-10475:
---

[~Jim_Brennan], Thanks for working on this feature and providing the patch.

The code patch looks good to me. Once you provide the documentation of the new 
properties, I am ready to provide my +1.

> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475.001.patch, YARN-10475.002.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-29 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10471:
--
Attachment: YARN.10471.branch-3.2.005.patch

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.004.patch, YARN.10471.005.patch, 
> YARN.10471.branch-3.2.003.patch, YARN.10471.branch-3.2.005.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-28 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222481#comment-17222481
 ] 

Eric Payne commented on YARN-10475:
---

[~Jim_Brennan], please add documentation for the new config properties.

> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475.001.patch, YARN-10475.002.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-28 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222471#comment-17222471
 ] 

Eric Payne commented on YARN-10471:
---

Thanks [~Jim_Brennan]. I have uploaded a new patch.

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.004.patch, YARN.10471.005.patch, 
> YARN.10471.branch-3.2.003.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-28 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10471:
--
Attachment: YARN.10471.005.patch

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.004.patch, YARN.10471.005.patch, 
> YARN.10471.branch-3.2.003.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-28 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222396#comment-17222396
 ] 

Eric Payne commented on YARN-10471:
---

Thanks a lot, [~Jim_Brennan], for reviewing these patches.

I uploaded version 004. There is no difference from version 003 except that I 
added some documentation describing the new config properties in NodeManager.md.

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.004.patch, YARN.10471.branch-3.2.003.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-28 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10471:
--
Attachment: YARN.10471.004.patch

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.004.patch, YARN.10471.branch-3.2.003.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-28 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1715#comment-1715
 ] 

Eric Payne commented on YARN-10471:
---

The unit tests that failed in the branch 2 pre-commit build also fail without 
this patch in the same way.

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.branch-3.2.003.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-27 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221802#comment-17221802
 ] 

Eric Payne commented on YARN-10471:
---

Needed a branch-3.2 patch to resolve conflicts.

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.branch-3.2.003.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-27 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10471:
--
Attachment: YARN.10471.branch-3.2.003.patch

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.branch-3.2.003.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-27 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221621#comment-17221621
 ] 

Eric Payne commented on YARN-10471:
---

I attached version 003 of the patch.

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-27 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10471:
--
Attachment: YARN.10471.003.patch

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-27 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221515#comment-17221515
 ] 

Eric Payne commented on YARN-10471:
---

I just noticed the javac warnings, so I will have to create and upload a 
version 003.

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-26 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220982#comment-17220982
 ] 

Eric Payne commented on YARN-10471:
---

I attached version 002 of the patch. This patch addressed the checkstyle and 
findbugs warnings.

The unit tests are not cause by this patch:
- Although one test from {{TestContainersMonitor}}  failed, it was not one of 
the new ones which were added in this patch. 
{{testContainerKillOnExcessLogDirectory}} and 
{{testContainerKillOnExcessTotalLogs}} were the new ones, and they passed. 
{{testContainerKillOnMemoryOverflow}} fails without this patch as well.
- {{TestDeletionService}} is not failing for me.
- {{TestNodeManagerReboot}} fails intermittently even in trunk without this 
patch.
- The rest are also failing in trunk without the patch: 
{{TestContainerLaunch}}, {{TestContainerManager}}, {{TestNodeManagerResync}}, 
{{TestNodeManagerShutdown}}

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-26 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10471:
--
Attachment: YARN.10471.002.patch

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10431) [Umbrella] Job group management

2020-10-26 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220868#comment-17220868
 ] 

Eric Payne commented on YARN-10431:
---

[~wjlei], does [Oozie|http://oozie.apache.org] meet the same requirements as 
this Umbrella JIRA? If not, can you please provide input on what Oozie is 
lacking?

> [Umbrella] Job group management
> ---
>
> Key: YARN-10431
> URL: https://issues.apache.org/jira/browse/YARN-10431
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.9.2
>Reporter: jialei weng
>Priority: Major
> Attachments: YarnJobGroupImpl design.pdf
>
>
> In current yarn job management, we don't have an efficient mechanism to 
> manage several jobs together. For example, one batch job may trigger several 
> sub-jobs to running at the same time, like one job to process the data and 
> another one monitor job metrics. And when we want to cancel these jobs, we 
> have to kill them one by one in current design. I proposal a job group 
> concept to handle such parent-child jobs as one unit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-23 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10471:
--
Attachment: YARN.10471.001.patch

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-23 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219901#comment-17219901
 ] 

Eric Payne commented on YARN-10471:
---

I propose a solution that includes the option to limit the disk space for a 
single container log directory or for all of a container's logs. We have been 
running with this solution internally for the past 3 years with positive 
results.

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-950) Ability to limit or avoid aggregating logs beyond a certain size

2020-10-23 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219874#comment-17219874
 ] 

Eric Payne commented on YARN-950:
-

{quote}1. Prevent logs for any container from becoming larger than a 
configurable size.
{quote}
I opened YARN-10471 to cover item 1.

 
{quote}2. Truncate everything from the middle of large log files, leaving a 
configurable amount of the head and the tail.
{quote}
Let's use this ticket (YARN-950) to cover item 2:

> Ability to limit or avoid aggregating logs beyond a certain size
> 
>
> Key: YARN-950
> URL: https://issues.apache.org/jira/browse/YARN-950
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager
>Affects Versions: 0.23.9, 2.6.0
>Reporter: Jason Darrell Lowe
>Priority: Major
>
> It would be nice if ops could configure a cluster such that any container log 
> beyond a configured size would either only have a portion of the log 
> aggregated or not aggregated at all.  This would help speed up the recovery 
> path for cases where a container creates an enormous log and fills a disk, as 
> currently it tries to aggregate the entire, enormous log rather than only 
> aggregating a small portion or simply deleting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-23 Thread Eric Payne (Jira)
Eric Payne created YARN-10471:
-

 Summary: Prevent logs for any container from becoming larger than 
a configurable size.
 Key: YARN-10471
 URL: https://issues.apache.org/jira/browse/YARN-10471
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.1.4, 3.2.1
Reporter: Eric Payne
Assignee: Eric Payne


Configure a cluster such that a task attempt will be killed if any container 
log exceeds a configured size. This would help prevent logs from filling disks 
and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10278) CapacityScheduler test framework ProportionalCapacityPreemptionPolicyMockFramework need some review

2020-10-23 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219786#comment-17219786
 ] 

Eric Payne commented on YARN-10278:
---

bq.  Will check once again how much work would be to backport the patch to 2.10 
soon.
[~snemeth], have you had a chance to investigate the backport to 2.10?

> CapacityScheduler test framework 
> ProportionalCapacityPreemptionPolicyMockFramework need some review
> ---
>
> Key: YARN-10278
> URL: https://issues.apache.org/jira/browse/YARN-10278
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Gergely Pollak
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10278.001.patch, YARN-10278.002.patch, 
> YARN-10278.002.patch, YARN-10278.branch-3.1.001.patch, 
> YARN-10278.branch-3.1.002.patch, YARN-10278.branch-3.1.003.patch, 
> YARN-10278.branch-3.2.001.patch, YARN-10278.branch-3.2.002.patch, 
> YARN-10278.branch-3.3.001.patch
>
>
> This test framework class mocks a bit too heavily, and simulates CS internal 
> behaviour with the mock methods over a point it is reasonably maintainable, 
> any internal change in CS is a major headscratch.
> A lot of tests depend on this class, so we should approach it carefully, but 
> I think it's wroth to examine this class if it can be made a bit more 
> resilient to changes, and easier to maintain. Or at least document it better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10452) YARN scheduler response returns invalid values for capacity, maxCapacity and absoluteMaxCapacity

2020-10-20 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217943#comment-17217943
 ] 

Eric Payne commented on YARN-10452:
---

Actually, I think this is expected.

> YARN scheduler response returns invalid values for capacity, maxCapacity and 
> absoluteMaxCapacity
> 
>
> Key: YARN-10452
> URL: https://issues.apache.org/jira/browse/YARN-10452
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Akhil PB
>Priority: Major
> Attachments: yarn_scheduler_response_incorrect_partition_capacity.json
>
>
> When there are no nodes in the default partition, YARN scheduler response 
> returns invalid values for capacities as listed below.
> - capacity is INF
> - maxCapacity is NaN
> - absoluteMaxCapacity is NaN
> Attached the YARN scheduler response json.
> cc: [~sunilg] [~wangda]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10456) RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics registry

2020-10-16 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215416#comment-17215416
 ] 

Eric Payne commented on YARN-10456:
---

Both {{CSQueueMetrics}} and {{FSQueueMetrics}} are also children of 
{{QueueMetrics}}, and when those classes output {{QueueMetrics}} records, their 
RECORDNAME is {{QueueMetrics}}. These two child classes are mutually exclusive 
since Capacity Scheduler and Fair Scheduler will never be running at the same 
time. However, what makes {{PartitionQueueMetrics}} different is that it will 
be writing out records intermingled with {{QueueMetrics}}. If they both have 
the same RECORDNAME, it confuses the Simon reader.

> RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics 
> registry
> -
>
> Key: YARN-10456
> URL: https://issues.apache.org/jira/browse/YARN-10456
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.3.0, 3.2.1, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> Several queue metrics (such as AppsRunning, PendingContainers, etc.) stopped 
> working after we upgraded to 2.10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10456) RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics registry

2020-10-09 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211439#comment-17211439
 ] 

Eric Payne commented on YARN-10456:
---

We use hadoop-metrics2.properties to set up Simon metrics aggregation.

The format of the output of the RM aggregation metrics begins with:
{noformat}
EPOCH CONTEXT.RECORDNAME ...
{noformat}
Here, {{CONTEXT=yarn}} and {{RECORDNAME=QueueMetrics}} for both 
{{QueueMetrics}} and {{PartitionQueueMetrics}}. This is incorrect and is 
confusing the Simon aggregator and causing the numbers for several metrics to 
be incorrect.

The {{RECORDNAME}} is coming from the {{MetricsInfo}} object in the 
{{MetricsRegistry}} in each {{*Metrics}} class. In this case, 
{{PartitionQueueMetrics}} is a child of the {{QueueMetrics}} class, and when 
{{PartitionQueueMetrics}} is constructed, the {{MetricsInfo}} name for 
{{PartitionQueueMetrics}} is assigned "{{QueueMetrics}}" instead of 
"{{PartitionQueueMetrics}}".

> RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics 
> registry
> -
>
> Key: YARN-10456
> URL: https://issues.apache.org/jira/browse/YARN-10456
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.3.0, 3.2.1, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> Several queue metrics (such as AppsRunning, PendingContainers, etc.) stopped 
> working after we upgraded to 2.10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10456) RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics registry

2020-10-09 Thread Eric Payne (Jira)
Eric Payne created YARN-10456:
-

 Summary: RM PartitionQueueMetrics records are named QueueMetrics 
in Simon metrics registry
 Key: YARN-10456
 URL: https://issues.apache.org/jira/browse/YARN-10456
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.10.1, 3.1.4, 3.2.1, 3.3.0
Reporter: Eric Payne
Assignee: Eric Payne


Several queue metrics (such as AppsRunning, PendingContainers, etc.) stopped 
working after we upgraded to 2.10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-06 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209004#comment-17209004
 ] 

Eric Payne commented on YARN-10451:
---

I attached the branch-3.2 patch. In trunk and branch-3,  
{{CustomResourceTypesConfigurationProvider#initResourceTypes}} allows code to 
add a resource type. Prior to this, that functionality was in 
{{TestResourceUtils#addNewTypesToResources}}

> RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
> ---
>
> Key: YARN-10451
> URL: https://issues.apache.org/jira/browse/YARN-10451
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10451.001.patch, YARN-10451.002.patch, 
> YARN-10451.003.patch, YARN-10451.branch-3.2.003.patch
>
>
> The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
> property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-06 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10451:
--
Attachment: YARN-10451.branch-3.2.003.patch

> RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
> ---
>
> Key: YARN-10451
> URL: https://issues.apache.org/jira/browse/YARN-10451
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10451.001.patch, YARN-10451.002.patch, 
> YARN-10451.003.patch, YARN-10451.branch-3.2.003.patch
>
>
> The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
> property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-06 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208794#comment-17208794
 ] 

Eric Payne commented on YARN-10451:
---

Thanks for the review, [~Jim_Brennan].

bq. Why did you need the test to use dominant resource calculator?  Was the 
test not failing in trunk unless that was defined?

Yes. YARN-7789 added a check in the CapacityScheduler#initScheduler that 
prevents the addition of a third resource (GPU, in this case) if 
DefaultResourceCalculator is used. This was in 3.x but not 2.x.

Also, I think I will need a branch-2.10 patch because the test utils are 
different.

> RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
> ---
>
> Key: YARN-10451
> URL: https://issues.apache.org/jira/browse/YARN-10451
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10451.001.patch, YARN-10451.002.patch, 
> YARN-10451.003.patch
>
>
> The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
> property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-05 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208308#comment-17208308
 ] 

Eric Payne commented on YARN-10451:
---

Thanks [~Jim_Brennan] and [~sunilg] for the reviews.

I have attached version 003 of the patch. I had to add some more interfaces for 
the mock RM/CS/etc. utilities in order to cause the mocked scheduler to use the 
dominant resource calculator. I tried several different was to reach into the 
mockRM after it was created and change the calculator, but I couldn't make that 
work.

> RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
> ---
>
> Key: YARN-10451
> URL: https://issues.apache.org/jira/browse/YARN-10451
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10451.001.patch, YARN-10451.002.patch, 
> YARN-10451.003.patch
>
>
> The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
> property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-05 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10451:
--
Attachment: YARN-10451.003.patch

> RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
> ---
>
> Key: YARN-10451
> URL: https://issues.apache.org/jira/browse/YARN-10451
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10451.001.patch, YARN-10451.002.patch, 
> YARN-10451.003.patch
>
>
> The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
> property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-02 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10451:
--
Attachment: YARN-10451.002.patch

> RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
> ---
>
> Key: YARN-10451
> URL: https://issues.apache.org/jira/browse/YARN-10451
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10451.001.patch, YARN-10451.002.patch
>
>
> The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
> property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-02 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10451:
--
Attachment: YARN-10451.001.patch

> RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
> ---
>
> Key: YARN-10451
> URL: https://issues.apache.org/jira/browse/YARN-10451
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10451.001.patch
>
>
> The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
> property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-02 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206398#comment-17206398
 ] 

Eric Payne commented on YARN-10451:
---

{code:title=NodePages#NodesBlock}
if (gpuIndex != null) {
  usedGPUs = info.getUsedResource().getResource()
  .getResourceValue(ResourceInformation.GPU_URI);
  availableGPUs = info.getAvailableResource().getResource()
  .getResourceValue(ResourceInformation.GPU_URI);
}
{code}
If yarn.io/gpu is defined and a node is either decommissioned, lost, unhealthy, 
or shut down, {{NodeInfo#getUsedResource}} and 
{{NodeInfo#getAvailableResource}} can return null.

> RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
> ---
>
> Key: YARN-10451
> URL: https://issues.apache.org/jira/browse/YARN-10451
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
> property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-02 Thread Eric Payne (Jira)
Eric Payne created YARN-10451:
-

 Summary: RM (v1) UI NodesPage can NPE when yarn.io/gpu resource 
type is defined.
 Key: YARN-10451
 URL: https://issues.apache.org/jira/browse/YARN-10451
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Eric Payne


The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-02 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne reassigned YARN-10451:
-

Assignee: Eric Payne

> RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
> ---
>
> Key: YARN-10451
> URL: https://issues.apache.org/jira/browse/YARN-10451
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
> property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-28 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203473#comment-17203473
 ] 

Eric Payne commented on YARN-9809:
--

I have committed this to branch-3.3 and branch-3.2. It looks like there is some 
additional work necessary if we want this to be backported to 3.1. I, for one, 
don't think that is necessary, but please comment if you disagree.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, 
> YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, 
> YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, 
> YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-28 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203346#comment-17203346
 ] 

Eric Payne commented on YARN-9809:
--

The latest branch-3.2 precommit build looks fine. The unit test failures are 
the same ones that are failing on branch-3.2 without the patch _except_ 
{{TestRaceWhenRelogin}}, which is not failing for me in my local build with or 
without the patch.

+1. I will commit this today.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, 
> YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, 
> YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, 
> YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-25 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202390#comment-17202390
 ] 

Eric Payne commented on YARN-9809:
--

Version 009 LGTM. +1

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, 
> YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, 
> YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, 
> YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-25 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202388#comment-17202388
 ] 

Eric Payne commented on YARN-9809:
--

Thanks a lot, [~ebadger] for the backport, and thank you [~Jim_Brennan] for the 
great reviews.

I have verified that the following unit tests are also failing in branch-3.2:
{noformat}
TestYarnConfigurationFields
TestZKConfigurationStore
TestSystemMetricsPublisherForV2
TestFSSchedulerConfigurationStore
TestCombinedSystemMetricsPublisher
{noformat}

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, 
> YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, 
> YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, 
> YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-23 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201061#comment-17201061
 ] 

Eric Payne commented on YARN-9809:
--

Thanks a lot [~ebadger] for putting upt the 3.2 backport patch. I'm still going 
through it, but I had one question after my first pass:
{code:java|title=RMNodeImpl#AddNodeTransition#transition}
RMNodeStatusEvent rmNodeStatusEvent =
new RMNodeStatusEvent(nodeId, nodeStatus);

NodeHealthStatus nodeHealthStatus =
updateRMNodeFromStatusEvents(rmNode, rmNodeStatusEvent);

if (nodeHealthStatus.getIsNodeHealthy()) {
{code}

Do we run the risk of {{nodeHealthStatus}} being null?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, 
> YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, 
> YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-950) Ability to limit or avoid aggregating logs beyond a certain size

2020-09-11 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194507#comment-17194507
 ] 

Eric Payne commented on YARN-950:
-

[~adam.antal], are you still planning on working on this JIRA?

bq. Ran into another case where a user filled a disk with a large 
stdout/stderr, and the NM took forever to recover the disk
bq. +1 on the idea of limiting the size of log aggregation. We should have 
someway to truncate (front & tail) the log contents if it is really large.

It seems that there are a couple of requirements for this feature.
# Prevent logs for any container from becoming larger than a configurable size.
# Truncate everything from the middle of large log files, leaving a 
configurable amount of the head and the tail.

Internally, we have implemented the former by allowing system admins to specify 
a configurable max log size per container.


> Ability to limit or avoid aggregating logs beyond a certain size
> 
>
> Key: YARN-950
> URL: https://issues.apache.org/jira/browse/YARN-950
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager
>Affects Versions: 0.23.9, 2.6.0
>Reporter: Jason Darrell Lowe
>Assignee: Adam Antal
>Priority: Major
>
> It would be nice if ops could configure a cluster such that any container log 
> beyond a configured size would either only have a portion of the log 
> aggregated or not aggregated at all.  This would help speed up the recovery 
> path for cases where a container creates an enormous log and fills a disk, as 
> currently it tries to aggregate the entire, enormous log rather than only 
> aggregating a small portion or simply deleting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls

2020-09-11 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194279#comment-17194279
 ] 

Eric Payne commented on YARN-10390:
---

Thanks [~samkhan] for this important performance enhancement.
 +1. LGTM. I'll be committing this shortly

> LeafQueue: retain user limits cache across assignContainers() calls
> ---
>
> Key: YARN-10390
> URL: https://issues.apache.org/jira/browse/YARN-10390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: YARN-10390-branch-3.1.002.patch, 
> YARN-10390-branch-3.2.002.patch, YARN-10390.002.patch, user limit caching 
> profile.pdf
>
>
> Currently, user limits are cached locally in leafQueue.assignContainers call 
> to avoid repeating some steps. This cache can be retained across the calls.
> Will put up a PR soon. Profiling was done using the proposed changes in 
> TestCapacitySchedulerPerf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls

2020-09-10 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10390:
--
Attachment: YARN-10390-branch-3.1.002.patch

> LeafQueue: retain user limits cache across assignContainers() calls
> ---
>
> Key: YARN-10390
> URL: https://issues.apache.org/jira/browse/YARN-10390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: YARN-10390-branch-3.1.002.patch, 
> YARN-10390-branch-3.2.002.patch, YARN-10390.002.patch, user limit caching 
> profile.pdf
>
>
> Currently, user limits are cached locally in leafQueue.assignContainers call 
> to avoid repeating some steps. This cache can be retained across the calls.
> Will put up a PR soon. Profiling was done using the proposed changes in 
> TestCapacitySchedulerPerf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls

2020-09-10 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193841#comment-17193841
 ] 

Eric Payne commented on YARN-10390:
---

Uploaded branch-3.2 patch because of merge conflicts in unit tests.

> LeafQueue: retain user limits cache across assignContainers() calls
> ---
>
> Key: YARN-10390
> URL: https://issues.apache.org/jira/browse/YARN-10390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: YARN-10390-branch-3.2.002.patch, YARN-10390.002.patch, 
> user limit caching profile.pdf
>
>
> Currently, user limits are cached locally in leafQueue.assignContainers call 
> to avoid repeating some steps. This cache can be retained across the calls.
> Will put up a PR soon. Profiling was done using the proposed changes in 
> TestCapacitySchedulerPerf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls

2020-09-10 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10390:
--
Attachment: YARN-10390-branch-3.2.002.patch

> LeafQueue: retain user limits cache across assignContainers() calls
> ---
>
> Key: YARN-10390
> URL: https://issues.apache.org/jira/browse/YARN-10390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: YARN-10390-branch-3.2.002.patch, YARN-10390.002.patch, 
> user limit caching profile.pdf
>
>
> Currently, user limits are cached locally in leafQueue.assignContainers call 
> to avoid repeating some steps. This cache can be retained across the calls.
> Will put up a PR soon. Profiling was done using the proposed changes in 
> TestCapacitySchedulerPerf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls

2020-09-10 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193833#comment-17193833
 ] 

Eric Payne commented on YARN-10390:
---

Unit test failure for {{TestFairSchedulerPreemption}} is the same as YARN-9333.

> LeafQueue: retain user limits cache across assignContainers() calls
> ---
>
> Key: YARN-10390
> URL: https://issues.apache.org/jira/browse/YARN-10390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: YARN-10390.002.patch, user limit caching profile.pdf
>
>
> Currently, user limits are cached locally in leafQueue.assignContainers call 
> to avoid repeating some steps. This cache can be retained across the calls.
> Will put up a PR soon. Profiling was done using the proposed changes in 
> TestCapacitySchedulerPerf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >