[jira] [Commented] (YARN-9019) Ratio calculation of ResourceCalculator implementations could return NaN

2022-09-06 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600907#comment-17600907
 ] 

Eric Payne commented on YARN-9019:
--

I backported this to branch-3.2 and branch-2.10

> Ratio calculation of ResourceCalculator implementations could return NaN
> 
>
> Key: YARN-9019
> URL: https://issues.apache.org/jira/browse/YARN-9019
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.3.0, 2.10.3, 3.2.5
>
> Attachments: YARN-9019.001.patch
>
>
> Found out that ResourceCalculator.ratio (with implementors 
> DefaultResourceCalculator and DominantResourceCalculator) can produce NaN 
> (Not-A-Number) as a result.
> This is because [IEEE 754|http://grouper.ieee.org/groups/754/] defines {{1.0 
> / 0.0}} as Infinity and {{-1.0 / 0.0}} as -Infinity and {{0.0 / 0.0}} as NaN, 
> see here: [https://stackoverflow.com/a/14138032/1106893] 
> I think it's very dangerous to rely on NaN can be returned from ratio 
> calculations and this could have side-effects.
> When ratio calculates the result and if both the numerator and the 
> denominator is zero, we should use 0 as a result, I think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9019) Ratio calculation of ResourceCalculator implementations could return NaN

2022-09-06 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-9019:
-
Fix Version/s: 2.10.3
   3.2.5

> Ratio calculation of ResourceCalculator implementations could return NaN
> 
>
> Key: YARN-9019
> URL: https://issues.apache.org/jira/browse/YARN-9019
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.3.0, 2.10.3, 3.2.5
>
> Attachments: YARN-9019.001.patch
>
>
> Found out that ResourceCalculator.ratio (with implementors 
> DefaultResourceCalculator and DominantResourceCalculator) can produce NaN 
> (Not-A-Number) as a result.
> This is because [IEEE 754|http://grouper.ieee.org/groups/754/] defines {{1.0 
> / 0.0}} as Infinity and {{-1.0 / 0.0}} as -Infinity and {{0.0 / 0.0}} as NaN, 
> see here: [https://stackoverflow.com/a/14138032/1106893] 
> I think it's very dangerous to rely on NaN can be returned from ratio 
> calculations and this could have side-effects.
> When ratio calculates the result and if both the numerator and the 
> denominator is zero, we should use 0 as a result, I think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9019) Ratio calculation of ResourceCalculator implementations could return NaN

2022-09-01 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599133#comment-17599133
 ] 

Eric Payne commented on YARN-9019:
--

I'd like to backport this to 3.2 and 2.10.

> Ratio calculation of ResourceCalculator implementations could return NaN
> 
>
> Key: YARN-9019
> URL: https://issues.apache.org/jira/browse/YARN-9019
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9019.001.patch
>
>
> Found out that ResourceCalculator.ratio (with implementors 
> DefaultResourceCalculator and DominantResourceCalculator) can produce NaN 
> (Not-A-Number) as a result.
> This is because [IEEE 754|http://grouper.ieee.org/groups/754/] defines {{1.0 
> / 0.0}} as Infinity and {{-1.0 / 0.0}} as -Infinity and {{0.0 / 0.0}} as NaN, 
> see here: [https://stackoverflow.com/a/14138032/1106893] 
> I think it's very dangerous to rely on NaN can be returned from ratio 
> calculations and this could have side-effects.
> When ratio calculates the result and if both the numerator and the 
> denominator is zero, we should use 0 as a result, I think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10997) Revisit allocation and reservation logging

2022-07-08 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10997:
--
Fix Version/s: 3.2.4
   3.3.9
   2.10.3

> Revisit allocation and reservation logging
> --
>
> Key: YARN-10997
> URL: https://issues.apache.org/jira/browse/YARN-10997
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.9, 2.10.3
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Accepted allocation proposal and reserved container logs are two exceedingly 
> frequent events. Numerous user reported that these log entries had quickly 
> filled the logs on a busy cluster and seen these entries only as noise.
> It would be worthwhile to reduce the log level of these entries to DEBUG 
> level.
> Examples:
> {noformat}
> 2021-10-30 02:28:57,409 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Allocation proposal accepted
> 2021-10-30 02:28:57,439 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_1635478503131_0069_01_78, on node=host: 
> node:8041 #containers=1 available= used= vCores:1> with resource=
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10997) Revisit allocation and reservation logging

2022-07-07 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563977#comment-17563977
 ] 

Eric Payne commented on YARN-10997:
---

If it's okay with everyone, I'm going to backport this through previous 
branches to 2.10.

> Revisit allocation and reservation logging
> --
>
> Key: YARN-10997
> URL: https://issues.apache.org/jira/browse/YARN-10997
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Accepted allocation proposal and reserved container logs are two exceedingly 
> frequent events. Numerous user reported that these log entries had quickly 
> filled the logs on a busy cluster and seen these entries only as noise.
> It would be worthwhile to reduce the log level of these entries to DEBUG 
> level.
> Examples:
> {noformat}
> 2021-10-30 02:28:57,409 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Allocation proposal accepted
> 2021-10-30 02:28:57,439 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_1635478503131_0069_01_78, on node=host: 
> node:8041 #containers=1 available= used= vCores:1> with resource=
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11082) Use node label reosurce as denominator to decide which resource is dominated

2022-03-18 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-11082:
--
Fix Version/s: (was: 3.1.1)

I removed the Fix Version. That field should only be filled in by the committer 
when they resolve the ticket.

> Use node label reosurce as  denominator to decide which resource is dominated
> -
>
> Key: YARN-11082
> URL: https://issues.apache.org/jira/browse/YARN-11082
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.1
>Reporter: Bo Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-11082.001.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We ued cluster resource as denominator to decide which resoure is dominated 
> in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are 
> configed differently.
> {quote}2021-12-09 10:24:37,069 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application 
> attempt=appattempt_1637412555366_1588993_01 container=null 
> queue=root.a.a1.a2 clusterResource= 
> type=RACK_LOCAL requestedPartition=x
> 2021-12-09 10:24:37,069 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-12-09 10:24:37,069 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> {quote}
> We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the 
> following code in AbstrctQueue#canAssignToThisQueue still return false
> {quote}Resources.greaterThanOrEqual(resourceCalculator, clusterResource,
> usedExceptKillable, currentLimitResource)
> {quote}
> clusterResource = 
> usedExceptKillable =  
> currentLimitResource = 
> currentLimitResource:
> memory : 3381248/175117312 = 0.01930847362
> vCores : 687/40222 = 0.01708020486
> usedExceptKillable:
> memory : 3384320/175117312 = 0.01932601615
> vCores : 688/40222 = 0.01710506687
> DRF will think memory is dominated resource and return false in this scenario



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8509) Total pending resource calculation in preemption should use user-limit factor instead of minimum-user-limit-percent

2022-02-10 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490464#comment-17490464
 ] 

Eric Payne commented on YARN-8509:
--

I feel that the current implementation is flawed, and my opinion is that this 
JIRA is not required. Can this ticket be closed?

> Total pending resource calculation in preemption should use user-limit factor 
> instead of minimum-user-limit-percent
> ---
>
> Key: YARN-8509
> URL: https://issues.apache.org/jira/browse/YARN-8509
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
>  Labels: capacityscheduler
> Attachments: YARN-8509.001.patch, YARN-8509.002.patch, 
> YARN-8509.003.patch, YARN-8509.004.patch, YARN-8509.005.patch
>
>
> In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total 
> pending resource based on user-limit percent and user-limit factor which will 
> cap pending resource for each user to the minimum of user-limit pending and 
> actual pending. This will prevent queue from taking more pending resource to 
> achieve queue balance after all queue satisfied with its ideal allocation.
>   
>  We need to change the logic to let queue pending can go beyond userlimit.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption

2022-02-10 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490455#comment-17490455
 ] 

Eric Payne commented on YARN-10821:
---

[~gandras], do you still believe we need this JIRA? My opinion is that user 
limit calculations _are_ correct for preemption.

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10821.001.patch
>
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8222) Fix potential NPE when gets RMApp from RM context

2022-02-09 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-8222:
-
Fix Version/s: 2.10.2

> Fix potential NPE when gets RMApp from RM context
> -
>
> Key: YARN-8222
> URL: https://issues.apache.org/jira/browse/YARN-8222
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1, 3.0.3, 2.10.2
>
> Attachments: YARN-8222.001.patch
>
>
> Recently we did some performance tests and found two NPE problems when 
> calling rmContext.getRMApps().get(appId).get...
> These NPE problems occasionally happened when doing performance tests with 
> large number and fast-finished applications. We have checked other places 
> which call rmContext.getRMApps().get(...), most of them have null check and 
> some does not need (The process can guarantee that the return result will not 
> be null). 
> To fix these problems, We can add a null check for application before getting 
> attempt form it.
> (1) NPE in RMContainerImpl$FinishedTransition#updateAttemptMetrics
> {noformat}
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.updateAttemptMetrics(RMContainerImpl.java:742)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:715)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:699)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:482)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:64)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.containerCompleted(FiCaSchedulerApp.java:195)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1793)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:2624)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:663)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1514)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:2396)
>         at 
> org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:205)
>         at 
> org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:60)
>         at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>         at java.lang.Thread.run(Thread.java:834)
> {noformat}
> This NPE looks like happen when node heartbeat delay and try to update 
> attempt metrics for a non-exist app. 
> Reference code of RMContainerImpl$FinishedTransition#updateAttemptMetrics:
> {code:java}
> private static void updateAttemptMetrics(RMContainerImpl container) {
>   Resource resource = container.getContainer().getResource();
>   RMAppAttempt rmAttempt = container.rmContext.getRMApps()
>   .get(container.getApplicationAttemptId().getApplicationId())
>   .getCurrentAppAttempt();
>   if (rmAttempt != null) {
>  //
>   }
> }
> {code}
> (2) NPE in SchedulerApplicationAttempt#incNumAllocatedContainers
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.incNumAllocatedContainers(SchedulerApplicationAttempt.java:1268)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:638)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:3589)
> at 
> org.apache.hadoop.yarn.sls.scheduler.SLSCap

[jira] [Updated] (YARN-10824) Title not set for JHS and NM webpages

2021-12-22 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10824:
--
Fix Version/s: 3.4.0
   2.10.2
   3.2.3
   3.3.2

> Title not set for JHS and NM webpages
> -
>
> Key: YARN-10824
> URL: https://issues.apache.org/jira/browse/YARN-10824
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rajshree Mishra
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.4.0, 2.10.2, 3.2.3, 3.3.2
>
> Attachments: JHS URL.jpg, NM URL.jpg, YARN-10824.001.patch, 
> YARN-10824.002.patch
>
>
> The following issue was reported by one of our internal web security check 
> tools: 
> Passing a title to the jobHistoryServer(jhs) or Nodemanager(nm) pages using a 
> url similar to:
> [https://[hostname]:[jhs_port]/jobhistory/about?title=12345%27%22]
> or 
> [https://[hostname]:[nm_port]/node?title=12345]
> sets the page title to be set to the value mentioned.
> [Image attached]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11021) Define Hadoop YARN term "vcore"

2021-12-21 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463460#comment-17463460
 ] 

Eric Payne commented on YARN-11021:
---

https://hadoop.apache.org/docs/r3.3.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
This defines the property {{yarn.nodemanager.resource.cpu-vcores}}. This 
property must be placed in the configuration of each node manager. It 
advertises to the resource manager the number of virtual cores that can be used 
on the node. This value is going to vary depending on your use cases, but 
usually it is set to (#CPUs X 10).

> Define Hadoop YARN term "vcore"
> ---
>
> Key: YARN-11021
> URL: https://issues.apache.org/jira/browse/YARN-11021
> Project: Hadoop YARN
>  Issue Type: Wish
>  Components: docs, documentation
>Affects Versions: 3.3.1
>Reporter: Aleksey Tsalolikhin
>Priority: Major
>
> Hello,
> This is a request to define the Hadoop YARN term "vCore".  It's clearly 
> different than vCPU as in the number of virtual CPUs (or CPU cores) a system 
> has as per /proc/cpuinfo. What is a YARN vcore, please?
> {*}Background{*}: I am running Hadoop YARN on 24 AWS EC2 instances from the 
> R5 family (memory-intensive) with the instance size of 24 XLarge (96 vCPUs 
> and 768 GB RAM each), plus the cluster master.
> I've launched a Spark application with the following spark-submit parameters:
> {{    --executor-memory 224G}}
> {{    --conf spark.executor.memoryOverhead=23901M}}
> {{    --executor-cores 32}}
> That sets a ratio of about 250 GB of RAM (combined) to 32 vCPUs per executor; 
> I have Spark dynamic resource allocation enabled, so I expect to see three 
> executors per instance, and that's how it turns out.
> 24 nodes x 3 executors per node = 72 executors
> Plus the Application Master running on the Master node makes 73 executors.
> This matches the "73 allocated" I see in "yarn top" output in the 
> "Containers" line:
> {{    YARN top - 11:03:57, up 0d, 18:9, 1 active users, queue(s): root}}
> {{    NodeManager(s): 24 total, 24 active, 0 unhealthy, 44 decommissioned, 0 
> lost, 0 rebooted}}
> {{    Queue(s) Applications: 1 running, 1 submitted, 0 pending, 0 completed, 
> 0 killed, 0 failed}}
> {{    Queue(s) Mem(GB): 183 available, 17809 allocated, 69008 pending, 247 
> reserved}}
> {{    Queue(s) VCores: 2230 available, 73 allocated, 279 pending, 1 reserved}}
> {{    Queue(s) Containers: 73 allocated, 279 pending, 1 reserved}}
> Most of the memory is allocated, which is as expected.
> But why does the "Queue(s) VCores" line say "73 allocated"?
> Looks like 1 VCore = 32 vCPUs?
> I looked in /etc/hadoop/conf/yarn-site.xml on one of the 24XL task
> instances with 96 vCPUs to double check how many virtual CPUs YARN thinks
> the node has, and it is 96 as expected:
> {{  }}
> {{    yarn.nodemanager.resource.cpu-vcores}}
> {{    96}}
> {{  }}
> I looked through all the Hadoop YARN documentation linked from 
> https://hadoop.apache.org/docs/stable/index.html looking for a definition of 
> a Hadoop YARN vCore and I couldn't find one.
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
>  uses "virtual cores" and "computation based resource" when talking about 
> vCores.
> What is a Hadoop YARN vCore?  How does it relate to virtual CPUs I see in 
> e.g., /proc/cpuinfo on Linux?
> There are many mentions of "vcore" in Hadoop YARN documentation; could we 
> please add a definition of this term?
> Thanks,
> Aleksey



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2021-12-21 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne reassigned YARN-10178:
-

Assignee: Andras Gyori  (was: Qi Zhu)

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> --
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.10.001.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
>   if (app.accept(cluster, request, updatePending)
>   && app

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2021-12-21 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463335#comment-17463335
 ] 

Eric Payne commented on YARN-10178:
---

[~gandras], I will commit it today. Thanks!

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> --
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.10.001.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
>   if (app.accept(clus

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2021-12-20 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462883#comment-17462883
 ] 

Eric Payne commented on YARN-10178:
---

No significant difference between most of the performance test suites in 
TestCapacitySchedulerPerf when compared between with and without this patch. 
One suite, however, showed _improvement_ with the patch:

With 10 threads enabled, 200 apps on 100% of 50 queues, this test showed an 
average 2x improvement after applying this patch.

Since there doesn't seem to be any other noticeable differences among the other 
test suites, I give my
+1.

Thank you [~tuyu], [~zhuqi], [~gandras], and others for your good work on this 
issue. This is a vital fix for a serious flaw.

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> --
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.10.001.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2021-12-20 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462835#comment-17462835
 ] 

Eric Payne commented on YARN-10178:
---

[~gandras], I have reviewed the trunk and branch-2.10 patches. The code changes 
look good to me. I have manually tested the branch-2.10 patch multiple times in 
a 100-node cluster and found it to be stable with no thread crashes.

I am currently running the TestCapacitySchedulerPerf performance tests with and 
without the patch, with and without threading enabled. I will post the results 
later today.



> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> --
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.10.001.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
>

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-12-16 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461023#comment-17461023
 ] 

Eric Payne commented on YARN-10178:
---

[~gandras], Thank you very much for the updated patch. The changes LGTM. I 
testing this manually is difficult but it seems to be working fine.

It backports cleanly and builds in branch-3.2 and branch-3.3.

However, it is not a clean backport for branch-2.10. Do you want to work on 
that patch as well?

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enab

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-12-15 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460145#comment-17460145
 ] 

Eric Payne commented on YARN-10178:
---

[~gandras], if you do decide to pick this up, please post a new patch with the 
following changes:
- Remove the unit test {{TestPriorityUtilizationQueueOrderingPolicy.java}}
- Remove the changes supporting the unit test in 
{{PriorityUtilizationQueueOrderingPolicy.java}}:

{code:java}
+  // Just for test performance side effect/regression
+  public Iterator getOldAssignmentIterator(String partition) {
+// Since partitionToLookAt is a thread local variable, and every time we
+// copy and sort queues, so it's safe for multi-threading environment.
+PriorityUtilizationQueueOrderingPolicy.partitionToLookAt.set(partition);
 List sortedQueue = new ArrayList<>(queues);
-Collections.sort(sortedQueue, new PriorityQueueComparator());
+Collections.sort(sortedQueue, new PriorityQueueComparatorOld());
 return sortedQueue.iterator();
   }

+  // Just for test performance side effect/regression
+  private class PriorityQueueComparatorOld implements Comparator {
+
+@Override
+public int compare(CSQueue q1, CSQueue q2) {
+  String p = partitionToLookAt.get();
... (etc)
{code}

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Schedule

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-12-14 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459423#comment-17459423
 ] 

Eric Payne commented on YARN-10178:
---

[~gandras], would you like to take this over? I think we are close. Here are a 
couple of action items
- The trunk patch looks good to me except for the unit test.
- I want to spend another day or so looking at the unit tests, but I don't 
think there is an easy way to create a unit test since it involves injecting a 
fault in between {{(IF q1 > q2 AND q2 > q3 THEN)}} and ({{q1 > q3}}).
- I made a "first-pass" at backporting the attached patch to branch-2.10. It 
was not quite straight-forward, but I tested it in our sandbox cluster and it 
seems to be working. I'll do a little more adjusting and testing and then post 
it here for review.

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-12-09 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456774#comment-17456774
 ] 

Eric Payne commented on YARN-10178:
---

bq. I believe it is not a problem, as we are not making a copy, but creating 
new objects out of queues,
[~gandras], after looking more closely, I see that you are correct. The 
following copies each object from {{queues}} into a new object of 
{{PriorityQueueResourcesForSorting}} and puts it in the new list.
{code:java}
List sortedQueueResources =
queues.stream().
map(queue -> new PriorityQueueResourcesForSorting(queue)).
collect(Collectors.toList());
{code}
And as you pointed out, {{absoluteUsedCapacity}}, {{usedCapacity}}, and 
{{absoluteCapacity}} are all floats so their value is copied in the constructor 
of {{PriorityQueueResourcesForSorting}}. The only remaining value used for 
comparison is {{configuredMinResource}}, but I don't expect that to change 
unless manually refreshed.

I would feel better if there were a way to test this with a unit test. I'll 
keep thinking about it.

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest func

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-12-08 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456058#comment-17456058
 ] 

Eric Payne commented on YARN-10178:
---

This is a complicated problem, and I'm still trying to get my brain around what 
exactly is happening and what would fix it. So, if I get some of the details 
wrong here, please correct me.
[~gandras], 
bq. I was wondering whether we could avoid creating the snapshot altogether, by 
modifying the original comparator to acquire the necessary values immediately
I think that the problem is happening within the TimSort.sort() is that when 
the queue list is being sorted, the resources of the elements that have already 
been sorted are changing. So when TimSort.sort() tries to find the correct 
location for the new element, the sort order is wrong. So, I think the copy is 
needed so that a static list of queues is being sorted.

[~zhuqi]/[~wangda]/others, I read online that even the stream method of List is 
not a deep copy. Is that true? If we are only making a reference of the queue 
list, then the resource usages of each queue can change and cause the sorted 
list to be wrong during sorting.

bq. We should not use the Stream API because of older branches. I suggest 
rewriting getAssignmentIterator: 
I believe that the Stream API was introduced in JDK 8. If we choose to use it, 
we would not be able to backport this fix to anything prior to Hadoop 2.10. I 
am fine with that, but I am interested in others' opinions.

{quote}
Measuring performance is a delicate procedure. Including it in a unit test is 
incredibly volatile (On my local machine I have not been able to pass the test 
for example) especially when naive time measurement is involved. Not sure if we 
can easily reproduce it, but I think in this case the no test is better than a 
potentially intermittent test.
{quote}
I agree with [~gandras]. I have been trying to determine a way to write a unit 
test that can reproduce this, but so far I have had no luck. But I think a unit 
test that doesn't reproduce the error _and_ could fail intermittently is not 
ideal.


> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.s

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-12-03 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453248#comment-17453248
 ] 

Eric Payne commented on YARN-10178:
---

[~tuyu], [~zhuqi], [~wangda], [~bteke], [~pbacsko], [~Tao Yang], [~tuyu],
Thank you all for the great work you have done to investigate this issue. We 
have encountered this same problem when enabling the async capacity scheduler 
in our sandbox cluster. If we can get this resolved soon, I would appreciate 
it. I also hope that a branch-2.10 patch may also be made available. I will 
investigate further in the next few days.

Also, I notice that this same issue has been raised in the following JIRAs. All 
of them are in the Patch Available state and all have very different code 
changes from this one. I believe that the most thorough solution is defined in 
this JIRA (YARN-10178) My recommendation would be for the following JIRAs to be 
duplicated to this one.
YARN-8737
YARN-8764
YARN-10058

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,ther

[jira] [Commented] (YARN-10939) The task status of the same application on the yarn jobhistory is inconsistent with that on the yarn Web UI

2021-11-11 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442458#comment-17442458
 ] 

Eric Payne commented on YARN-10939:
---

It's hard to debug this issue with just the pictures. However, I think the time 
difference can be explained by the fact that the first screenshot shows CST 
(which is GMT -5 in September) and the second shows GMT +8.
Also, a job can complete successfully and then be killed later, in which case 
it will show up as KILLED.

> The task status of the same application on the yarn jobhistory is 
> inconsistent with that on the yarn Web UI
> ---
>
> Key: YARN-10939
> URL: https://issues.apache.org/jira/browse/YARN-10939
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: hao
>Priority: Major
> Attachments: 企业微信截图_16311769849737.png, 企业微信截图_1631177020581.png
>
>
> The task status of yarn on jobhistory is inconsistent with that on yarn UI



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator

2021-11-11 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-10848.
---
Resolution: Not A Problem

I am closing this JIRA based on the above discussion.

> Vcore allocation problem with DefaultResourceCalculator
> ---
>
> Key: YARN-10848
> URL: https://issues.apache.org/jira/browse/YARN-10848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Minni Mittal
>Priority: Major
>  Labels: pull-request-available
> Attachments: TestTooManyContainers.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
> containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is 
> {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
> if (calculator.computeAvailableContainers(Resources
> .add(node.getUnallocatedResource(), 
> node.getTotalKillableResources()),
> minimumAllocation) <= 0) {
>   LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
>   + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in 
> {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
> if (!Resources.fitsIn(rc, capability, totalResource)) {
>   LOG.warn("Node : " + node.getNodeID()
>   + " does not have sufficient resource for ask : " + pendingAsk
>   + " node total capability : " + node.getTotalResource());
>   // Skip this locality request
>   ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
>   activitiesManager, node, application, schedulerKey,
>   ActivityDiagnosticConstant.
>   NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
>   + getResourceDiagnostics(capability, totalResource),
>   ActivityLevel.NODE);
>   return ContainerAllocation.LOCALITY_SKIPPED;
> }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
> Resource capability = pendingAsk.getPerAllocationResource();
> Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the 
> problem. The root cause is that we pass the resource calculator to 
> {{Resource.fitsIn()}}. Instead, we should use an overridden version, just 
> like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
>// Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
>   // Inform the application of the new container for this request
>   RMContainer allocatedContainer =
>   allocate(type, node, schedulerKey, pendingAsk,
>   reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use 
> {{Resources.fitsIn()}} without the calculator in 
> {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit 
> test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-21 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-1115:
-
Attachment: YARN-1115.branch-2.10.004.patch

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch, 
> YARN-1115.003.patch, YARN-1115.004.patch, YARN-1115.branch-2.10.004.patch, 
> YARN-1115.branch-3.2.004.patch, YARN-1115.branch-3.3.004.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-21 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432729#comment-17432729
 ] 

Eric Payne commented on YARN-1115:
--

I attached patch 004 for branch-3.2. The backport from 3.3 was clean but the 
unit test failed, so I created a branch-3.2 patch.

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch, 
> YARN-1115.003.patch, YARN-1115.004.patch, YARN-1115.branch-3.2.004.patch, 
> YARN-1115.branch-3.3.004.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-21 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-1115:
-
Attachment: YARN-1115.branch-3.2.004.patch

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch, 
> YARN-1115.003.patch, YARN-1115.004.patch, YARN-1115.branch-3.2.004.patch, 
> YARN-1115.branch-3.3.004.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-21 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432600#comment-17432600
 ] 

Eric Payne commented on YARN-1115:
--

{quote}
|versions|git=2.17.1 maven=3.6.0 spotbugs=4.2.2|
{quote}
[~ahussein], it looks like the pre-commit build is using Maven 3.6.0.

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch, 
> YARN-1115.003.patch, YARN-1115.004.patch, YARN-1115.branch-3.3.004.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-21 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432594#comment-17432594
 ] 

Eric Payne commented on YARN-1115:
--

bq. Eric Payne were you able to build branch-3.3 locally?
Yes, but I'm using Maven 3.6.3

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch, 
> YARN-1115.003.patch, YARN-1115.004.patch, YARN-1115.branch-3.3.004.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-20 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-1115:
-
Attachment: YARN-1115.branch-3.3.004.patch

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch, 
> YARN-1115.003.patch, YARN-1115.004.patch, YARN-1115.branch-3.3.004.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-20 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432096#comment-17432096
 ] 

Eric Payne commented on YARN-1115:
--

[~ahussein], the precommit build for patch 004 looks good to me. Can you please 
review? Thanks!

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch, 
> YARN-1115.003.patch, YARN-1115.004.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-20 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17431371#comment-17431371
 ] 

Eric Payne commented on YARN-1115:
--

I have now attached version 004 of this patch. Patch 003 fixed most of the 
checkstyle issues, but I needed patch 004 to remove the javac warning about 
deprecation for the old submitapplication method.

I will not be fixing the checkstyle warnings about too many parameters to 
methods. That would take a restructure of the code.

The failed unit tests listed above from the 003 precommit run do not seem to be 
related. They succeed in my dev environmebnt.

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch, 
> YARN-1115.003.patch, YARN-1115.004.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-20 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-1115:
-
Attachment: YARN-1115.004.patch

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch, 
> YARN-1115.003.patch, YARN-1115.004.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-19 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-1115:
-
Attachment: YARN-1115.003.patch

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch, 
> YARN-1115.003.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-14 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-1115:
-
Attachment: (was: YARN-1115.branch-3.3.002.patch)

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-14 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428951#comment-17428951
 ] 

Eric Payne commented on YARN-1115:
--

{quote}
|-1|mvninstall|0m 
42s|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1228/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt|hadoop-yarn-server-resourcemanager
 in the patch failed.|
|-1|compile|2m 
46s|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1228/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn.txt|hadoop-yarn
 in the patch failed.|
{quote}
I submitted the bracnh-3.3 patch without compiling it first. And this, kids, is 
why you don't do that.

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch, 
> YARN-1115.branch-3.3.002.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-14 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-1115:
-
Attachment: YARN-1115.branch-3.3.002.patch

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch, 
> YARN-1115.branch-3.3.002.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-10-13 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-1115:
-
Attachment: YARN-1115.002.patch

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch, YARN-1115.002.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8222) Fix potential NPE when gets RMApp from RM context

2021-10-12 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427832#comment-17427832
 ] 

Eric Payne commented on YARN-8222:
--

This bug caused our RM to crash on a production cluster.

I will backport this to branch-2.10 (it backports cleanly).

> Fix potential NPE when gets RMApp from RM context
> -
>
> Key: YARN-8222
> URL: https://issues.apache.org/jira/browse/YARN-8222
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1, 3.0.3
>
> Attachments: YARN-8222.001.patch
>
>
> Recently we did some performance tests and found two NPE problems when 
> calling rmContext.getRMApps().get(appId).get...
> These NPE problems occasionally happened when doing performance tests with 
> large number and fast-finished applications. We have checked other places 
> which call rmContext.getRMApps().get(...), most of them have null check and 
> some does not need (The process can guarantee that the return result will not 
> be null). 
> To fix these problems, We can add a null check for application before getting 
> attempt form it.
> (1) NPE in RMContainerImpl$FinishedTransition#updateAttemptMetrics
> {noformat}
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.updateAttemptMetrics(RMContainerImpl.java:742)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:715)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:699)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:482)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:64)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.containerCompleted(FiCaSchedulerApp.java:195)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1793)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:2624)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:663)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1514)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:2396)
>         at 
> org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:205)
>         at 
> org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:60)
>         at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>         at java.lang.Thread.run(Thread.java:834)
> {noformat}
> This NPE looks like happen when node heartbeat delay and try to update 
> attempt metrics for a non-exist app. 
> Reference code of RMContainerImpl$FinishedTransition#updateAttemptMetrics:
> {code:java}
> private static void updateAttemptMetrics(RMContainerImpl container) {
>   Resource resource = container.getContainer().getResource();
>   RMAppAttempt rmAttempt = container.rmContext.getRMApps()
>   .get(container.getApplicationAttemptId().getApplicationId())
>   .getCurrentAppAttempt();
>   if (rmAttempt != null) {
>  //
>   }
> }
> {code}
> (2) NPE in SchedulerApplicationAttempt#incNumAllocatedContainers
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.incNumAllocatedContainers(SchedulerApplicationAttempt.java:1268)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:638)
> at 
> org.apache.hadoop.yarn.server.resourcem

[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2021-10-08 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426215#comment-17426215
 ] 

Eric Payne commented on YARN-8546:
--

[~Tao Yang], [~cheersyang], if there are no objections, I'll go ahead and 
backport this to branch-2.10 withe the changes to the unit test.

> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8546.001.patch, YARN-8546.branch-2.10.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9975) Support proxy ACL user for CapacityScheduler

2021-10-07 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-9975.
--
Resolution: Duplicate

I'm closing this as a dup of YARN-1115. Please reopen if you disagree.

> Support proxy ACL user for CapacityScheduler
> 
>
> Key: YARN-9975
> URL: https://issues.apache.org/jira/browse/YARN-9975
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> As commented in YARN-9698.
> I will open a new jira for the proxy user feature. 
> The background is that we have long running  sql thriftserver for many users:
> {quote}{{user->sql proxy-> sql thriftserver}}{quote}
> But we do not have keytab for all users on 'sql proxy'. We just use a super 
> user like 'sql_prc' to submit the 'sql thriftserver' application. To support 
> this we should change the scheduler to support proxy user acl



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2021-10-06 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425185#comment-17425185
 ] 

Eric Payne commented on YARN-8546:
--

I uploaded branch-2.10 patch.
The {{FiCaSchedulerApp}} changes are the same.
The {{TestCapacitySchedulerAsyncScheduling}} changes are essentially the same, 
except I had to add/modify Assertions since {{CapacityScheduler#tryCommit}} is 
a void function in 2.10. The new unit test in branch-3 expected {{tryCommit}} 
to return a boolean.

[~Tao Yang] / [~cheersyang], would you mind taking a look?

> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8546.001.patch, YARN-8546.branch-2.10.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2021-10-06 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-8546:
-
Attachment: YARN-8546.branch-2.10.001.patch

> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8546.001.patch, YARN-8546.branch-2.10.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2021-10-06 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne reopened YARN-8546:
--

> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8546.001.patch, YARN-8546.branch-2.10.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2021-10-05 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424713#comment-17424713
 ] 

Eric Payne commented on YARN-8546:
--

In conjunction with YARN-8127, I'd also like to backport this JIRA to 
branch-2.10.

> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8546.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8127) Resource leak when async scheduling is enabled

2021-10-05 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424650#comment-17424650
 ] 

Eric Payne commented on YARN-8127:
--

Also, FWIW, all tests that ran passed:
{noformat}
[INFO] Results:
[INFO] 
[WARNING] Tests run: 2173, Failures: 0, Errors: 0, Skipped: 8
[INFO] 
{noformat}
I'll backport this afternoon.

> Resource leak when async scheduling is enabled
> --
>
> Key: YARN-8127
> URL: https://issues.apache.org/jira/browse/YARN-8127
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8127.001.patch, YARN-8127.002.patch, 
> YARN-8127.003.patch, YARN-8127.004.patch, YARN-8127.branch-2.10.004.patch
>
>
> Brief steps to reproduce
>  # Enable async scheduling, 5 threads
>  # Submit a lot of jobs trying to exhaust cluster resource
>  # After a while, observed NM allocated resource is more than resource 
> requested by allocated containers
> Looks like the commit phase is not sync handling reserved containers, causing 
> some proposal incorrectly accepted, subsequently resource was deducted 
> multiple times for a container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8127) Resource leak when async scheduling is enabled

2021-10-04 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424217#comment-17424217
 ] 

Eric Payne commented on YARN-8127:
--

It looks like the unit tests didn't run because of a VM error:
{noformat}
[ERROR] ExecutionException The forked VM terminated without properly saying 
goodbye. VM crash or System.exit called?
{noformat}

> Resource leak when async scheduling is enabled
> --
>
> Key: YARN-8127
> URL: https://issues.apache.org/jira/browse/YARN-8127
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8127.001.patch, YARN-8127.002.patch, 
> YARN-8127.003.patch, YARN-8127.004.patch, YARN-8127.branch-2.10.004.patch
>
>
> Brief steps to reproduce
>  # Enable async scheduling, 5 threads
>  # Submit a lot of jobs trying to exhaust cluster resource
>  # After a while, observed NM allocated resource is more than resource 
> requested by allocated containers
> Looks like the commit phase is not sync handling reserved containers, causing 
> some proposal incorrectly accepted, subsequently resource was deducted 
> multiple times for a container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-8127) Resource leak when async scheduling is enabled

2021-10-04 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne reopened YARN-8127:
--

Attched branch-2.10 patch. Reopening and putting in PATCH AVAILABLE state to 
kick the prec-mmit build.
The backport is clean except for a couple of minor unit test changes.
If all goes well, I will cherry-pick back to 2.10 and fixup the unit test there.

> Resource leak when async scheduling is enabled
> --
>
> Key: YARN-8127
> URL: https://issues.apache.org/jira/browse/YARN-8127
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8127.001.patch, YARN-8127.002.patch, 
> YARN-8127.003.patch, YARN-8127.004.patch, YARN-8127.branch-2.10.004.patch
>
>
> Brief steps to reproduce
>  # Enable async scheduling, 5 threads
>  # Submit a lot of jobs trying to exhaust cluster resource
>  # After a while, observed NM allocated resource is more than resource 
> requested by allocated containers
> Looks like the commit phase is not sync handling reserved containers, causing 
> some proposal incorrectly accepted, subsequently resource was deducted 
> multiple times for a container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8127) Resource leak when async scheduling is enabled

2021-10-04 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-8127:
-
Attachment: YARN-8127.branch-2.10.004.patch

> Resource leak when async scheduling is enabled
> --
>
> Key: YARN-8127
> URL: https://issues.apache.org/jira/browse/YARN-8127
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8127.001.patch, YARN-8127.002.patch, 
> YARN-8127.003.patch, YARN-8127.004.patch, YARN-8127.branch-2.10.004.patch
>
>
> Brief steps to reproduce
>  # Enable async scheduling, 5 threads
>  # Submit a lot of jobs trying to exhaust cluster resource
>  # After a while, observed NM allocated resource is more than resource 
> requested by allocated containers
> Looks like the commit phase is not sync handling reserved containers, causing 
> some proposal incorrectly accepted, subsequently resource was deducted 
> multiple times for a container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8127) Resource leak when async scheduling is enabled

2021-10-04 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424069#comment-17424069
 ] 

Eric Payne commented on YARN-8127:
--

I'd like to backport this to 2.10. If no objections, I'll put up a patch.

> Resource leak when async scheduling is enabled
> --
>
> Key: YARN-8127
> URL: https://issues.apache.org/jira/browse/YARN-8127
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8127.001.patch, YARN-8127.002.patch, 
> YARN-8127.003.patch, YARN-8127.004.patch
>
>
> Brief steps to reproduce
>  # Enable async scheduling, 5 threads
>  # Submit a lot of jobs trying to exhaust cluster resource
>  # After a while, observed NM allocated resource is more than resource 
> requested by allocated containers
> Looks like the commit phase is not sync handling reserved containers, causing 
> some proposal incorrectly accepted, subsequently resource was deducted 
> multiple times for a container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10938) Support reservation scheduling enabled switch for capacity scheduler

2021-09-30 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423018#comment-17423018
 ] 

Eric Payne commented on YARN-10938:
---

[~Weihao Zheng]
Reservations are somewhat integral to the behavior of the Capacity Scheduler. 
The Capacity Scheduler is based on FIFO order, and reservations preserves the 
submission order of applications.

If you turn off reservation scheduling, an app that needs larger containers 
will be starved by those with smaller containers. For example, 
- Cluster has 3 nodes of 10GB each.
- Cluster has 1 Queue.
- App1 requests 27 1.5GB containers, so each node has 9GB used.
- App2 launches and requests a 3GB AM container to start running.
- App3 needs 500 1GB containers.

Without reservations, App2 is never scheduled until App3 is finished, even 
though App2 got there first. This is because as soon as any 1.5GB container is 
released from App1, App3 jumps in and takes at least 1GB container on that node.

With reservations enabled, App2 can reserve on one of the nodes even though the 
total 3GB is not on the same node. The 3GB gets charged to the queue and when 
any of the 3 nodes has 3GB or more free, re-reservation kicks in and App2 gets 
the container. This preserves the order of submission.

> Support reservation scheduling enabled switch for capacity scheduler
> 
>
> Key: YARN-10938
> URL: https://issues.apache.org/jira/browse/YARN-10938
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weihao Zheng
>Assignee: Weihao Zheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host. But this algorithm is not suitable for small 
> cluster which only have very limited resources. So we can add a switch 
> property in capacity scheduler's configuration to avoid reservation 
> scheduling in these use cases.
> CHANGES:
> Add {{"yarn.scheduler.capacity.reservation.enabled"}} in capacity scheduler's 
> configuration.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-09-20 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417744#comment-17417744
 ] 

Eric Payne commented on YARN-1115:
--

bq.  I re-defined the user variable from String to be UserGroupInformation in 
order to reduce the number of code changes. Maybe that wasn't the best approach.
Actually, that statement is inaccurate. I didn't change String to 
UserGroupInformation. The String variable is still {{user}} and contains the 
user's short name. The {{userUgi}} variable is of UserGroupInformation type, 
and contains the whole set of UGI information that is available when the job 
was submitted. I then modified the signatures of the methods in the call chain 
to pass the whole UGI so that when the user UGI gets to the Capacity Scheduler, 
it can check for either the real user or the proxied user.
bq. Yes, you are right. It is confusing, especially when you say it that way 
While I do understand and agree that it is confusing, I think it is still set 
up correctly. {{user}} is always the user for whom YARN will be running the 
application. It's just that sometimes that will be the real user and sometimes 
that will be the proxied user. Unfortunately, although this explanation is 
accurate, it is still confusing.

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-09-20 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416853#comment-17416853
 ] 

Eric Payne edited comment on YARN-1115 at 9/20/21, 4:44 PM:


[~gandras], thank you a lot for the review!
{quote}
* Submitting an app without a proxy user:
  ** user is the real user
  ** realUser is null
* Submitting an app with a proxy user:
  ** user is the proxy user
  ** realUser is the real user
{quote}
Yes, you are right. It is confusing, especially when you say it that way ;-)
-I re-defined the {{user}} variable from {{String}} to be 
{{UserGroupInformation}} in order to reduce the number of code changes. Maybe 
that wasn't the best approach.-


was (Author: eepayne):
[~gandras], thank you a lot for the review!
{quote}
* Submitting an app without a proxy user:
  ** user is the real user
  ** realUser is null
* Submitting an app with a proxy user:
  ** user is the proxy user
  ** realUser is the real user
{quote}
Yes, you are right. It is confusing, especially when you say it that way ;-)
I re-defined the {{user}} variable from {{String}} to be 
{{UserGroupInformation}} in order to reduce the number of code changes. Maybe 
that wasn't the best approach.

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-09-17 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416853#comment-17416853
 ] 

Eric Payne edited comment on YARN-1115 at 9/17/21, 6:43 PM:


[~gandras], thank you a lot for the review!
{quote}
* Submitting an app without a proxy user:
  ** user is the real user
  ** realUser is null
* Submitting an app with a proxy user:
  ** user is the proxy user
  ** realUser is the real user
{quote}
Yes, you are right. It is confusing, especially when you say it that way ;-)
I re-defined the {{user}} variable from {{String}} to be 
{{UserGroupInformation}} in order to reduce the number of code changes. Maybe 
that wasn't the best approach.


was (Author: eepayne):
[~gandras], thank you a lot for the review!
{quote}
* Submitting an app without a proxy user:
   * user is the real user
   * realUser is null
* Submitting an app with a proxy user:
   * user is the proxy user
   * realUser is the real user
{quote}
Yes, you are right. It is confusing, especially when you say it that way ;-)
I re-defined the {{user}} variable from {{String}} to be 
{{UserGroupInformation}} in order to reduce the number of code changes. Maybe 
that wasn't the best approach.

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-09-17 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416853#comment-17416853
 ] 

Eric Payne commented on YARN-1115:
--

[~gandras], thank you a lot for the review!
{quote}
* Submitting an app without a proxy user:
   * user is the real user
   * realUser is null
* Submitting an app with a proxy user:
   * user is the proxy user
   * realUser is the real user
{quote}
Yes, you are right. It is confusing, especially when you say it that way ;-)
I re-defined the {{user}} variable from {{String}} to be 
{{UserGroupInformation}} in order to reduce the number of code changes. Maybe 
that wasn't the best approach.

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-16 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416314#comment-17416314
 ] 

Eric Payne commented on YARN-10935:
---

OK, attached branch-3.2 patch.
Thanks [~ebadger].

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch, 
> YARN-10935.branch-3.2.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-16 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Attachment: YARN-10935.branch-3.2.003.patch

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch, 
> YARN-10935.branch-3.2.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-16 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416276#comment-17416276
 ] 

Eric Payne commented on YARN-10935:
---

Attaching the branch-2.10 patch. Will look into the branch-3.2 patch.

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-16 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Attachment: YARN-10935.branch-2.10.003.patch

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-15 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415771#comment-17415771
 ] 

Eric Payne commented on YARN-10935:
---

Thanks for the reviews, [~ahussein] and [~ebadger]. Branch-2.10 needs it's own 
patch. I'm working on that now.

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-13 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414291#comment-17414291
 ] 

Eric Payne commented on YARN-10935:
---

[~prabhujoseph], [~sunilg], [~ahussein], [~ebadger], [~snemeth], [~zhuqi], can 
I ask for a review please?

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-10 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Attachment: YARN-10935.003.patch

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-10 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413415#comment-17413415
 ] 

Eric Payne commented on YARN-10935:
---

I attached v3 of the patch.

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-10 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413374#comment-17413374
 ] 

Eric Payne commented on YARN-10935:
---

I suspect I could use {{fitsIn}} instead of 
{{isAnyMajorResourceZeroOrNegative}}. I'll check it out.

Also, I'll address the whitespace and checkstyle warnings.

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9975) Support proxy ACL user for CapacityScheduler

2021-09-10 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413370#comment-17413370
 ] 

Eric Payne commented on YARN-9975:
--

AFAICT, the requirements outlined in [zhoukang's comment in 
YARN-9698|https://issues.apache.org/jira/browse/YARN-9698?focusedCommentId=16893744&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16893744]
 and in this JIRA are covered by HADOOP-17857. I am fine with closing this one 
as a dup of HADOOP-17857.
[~cane], Thoughts?

> Support proxy ACL user for CapacityScheduler
> 
>
> Key: YARN-9975
> URL: https://issues.apache.org/jira/browse/YARN-9975
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> As commented in YARN-9698.
> I will open a new jira for the proxy user feature. 
> The background is that we have long running  sql thriftserver for many users:
> {quote}{{user->sql proxy-> sql thriftserver}}{quote}
> But we do not have keytab for all users on 'sql proxy'. We just use a super 
> user like 'sql_prc' to submit the 'sql thriftserver' application. To support 
> this we should change the scheduler to support proxy user acl



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

2021-09-10 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413346#comment-17413346
 ] 

Eric Payne commented on YARN-10903:
---

Thanks [~jackwangcs] for your patience and thanks for fixing this bug. The 
changes LGTM.
+1.

> Too many "Failed to accept allocation proposal" because of wrong Headroom 
> check for DRF
> ---
>
> Key: YARN-10903
> URL: https://issues.apache.org/jira/browse/YARN-10903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: jackwangcs
>Assignee: jackwangcs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The headroom check in  `ParentQueue.canAssign` and 
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue 
> is near-fully used. 
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then 
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
> currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
> required); {code}
> will be true but in fact we can not allocate resources to the request due to 
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1626747977559_95859 
> headRoom= currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
>   Request={AllocationRequestId: -1, Priority: 1, Capability:  vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution 
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Try to commit allocation proposal=New 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
>  ALLOCATED=[(Application=appattempt_1626747977559_95859_01; 
> Node=:8041; Resource=)]
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
>  userLimit is fetched. userLimit=, 
> userSpecificUserLimit=, 
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Headroom calculation for user x:  userLimit= 
> queueMaxAvailRes= consumed= 
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-07-21 23:49:39,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

2021-09-10 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413222#comment-17413222
 ] 

Eric Payne commented on YARN-10903:
---

Thanks [~jackwangcs] and [~Tao Yang] for raising the issue and for reviewing 
and commenting. Headroom calculations are very sensitive and any changes could 
have unforseen side effects. I will take some time today and review.

> Too many "Failed to accept allocation proposal" because of wrong Headroom 
> check for DRF
> ---
>
> Key: YARN-10903
> URL: https://issues.apache.org/jira/browse/YARN-10903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: jackwangcs
>Assignee: jackwangcs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The headroom check in  `ParentQueue.canAssign` and 
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue 
> is near-fully used. 
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then 
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
> currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
> required); {code}
> will be true but in fact we can not allocate resources to the request due to 
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1626747977559_95859 
> headRoom= currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
>   Request={AllocationRequestId: -1, Priority: 1, Capability:  vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution 
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Try to commit allocation proposal=New 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
>  ALLOCATED=[(Application=appattempt_1626747977559_95859_01; 
> Node=:8041; Resource=)]
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
>  userLimit is fetched. userLimit=, 
> userSpecificUserLimit=, 
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Headroom calculation for user x:  userLimit= 
> queueMaxAvailRes= consumed= 
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-07-21 23:49:39,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-10 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Attachment: YARN-10935.002.patch

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Attachment: YARN-10935.001.patch

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411398#comment-17411398
 ] 

Eric Payne commented on YARN-10935:
---

For example, In the following screenshot, the advertising queue is a child of 
root and a parent of 3 sub-queues. One of the sub-queues has consumed all of 
the advertising parent queue's resources. The second sub-queue has submitted 
two apps. One of them is schedulable and one is non-schedulable. The second app 
is non-schedulable because starting the app would put the queue above the 
queue's AM limit:

 !Screen Shot 2021-09-07 at 12.49.52 PM.png! 

See that the second app can't start because of the following:

 !Screen Shot 2021-09-07 at 12.55.37 PM.png! 
Note that, in this example, the max queue AM limit should never go below 2GB 
memory and 16 vCores.


> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Attachment: Screen Shot 2021-09-07 at 12.55.37 PM.png

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Attachment: Screen Shot 2021-09-07 at 12.49.52 PM.png

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Summary: AM Total Queue Limit goes below per-user AM Limit if parent is 
full.  (was: AM Total Queue Limit goes below per-uwer AM Limit if parent is 
full.)

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10935) AM Total Queue Limit goes below per-uwer AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)
Eric Payne created YARN-10935:
-

 Summary: AM Total Queue Limit goes below per-uwer AM Limit if 
parent is full.
 Key: YARN-10935
 URL: https://issues.apache.org/jira/browse/YARN-10935
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, capacityscheduler
Reporter: Eric Payne


This happens when DRF is enabled and all of one resource is consumed but the 
second resources still has plenty available.

This is reproduceable by setting up a parent queue where the capacity and max 
capacity are the same, with 2 or more sub-queues whose max capacity is 100%.

In one of the sub-queues, start a long-running app that consumes all resources 
in the parent queue's hieararchy. This app will consume all of the memory but 
not vary many vcores (for example)

In a second queue, submit an app. The *{{Max Application Master Resources Per 
User}}* limit is much more than the *{{Max Application Master Resources}}* 
limit.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10935) AM Total Queue Limit goes below per-uwer AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne reassigned YARN-10935:
-

Assignee: Eric Payne

> AM Total Queue Limit goes below per-uwer AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10899) control whether non-exclusive allocation is used

2021-08-31 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407699#comment-17407699
 ] 

Eric Payne commented on YARN-10899:
---

{quote}
 When node label expression is empty string, only default partition without 
non-exclusive allocation is used,
when node label expression is null, only it allowed non-exclusive allocation.
{quote}
I am concerned that this change does not conform to the intent of the 
requirements of the partition label feature.
Exclusivity and non-exclusivity are associated with a partition when it is 
created. Queues can be associated with these partitions and the default 
behavior in those queues is defined by the queue's partition labels and their 
attributes. I feel that it is improper to change the behavior based on whether 
or not the user has specified an empty string or a null.

> control whether non-exclusive allocation is used
> 
>
> Key: YARN-10899
> URL: https://issues.apache.org/jira/browse/YARN-10899
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: KWON BYUNGCHANG
>Priority: Major
> Attachments: YARN-10899.001.patch, YARN-10899.002.patch, 
> YARN-10899.003.patch, YARN-10899.004.patch
>
>
> A non-exclusive partition has the advantage of increasing resource 
> utilization.
>  But, it is not useful in all use cases.
> In the case of a long-running container,
>  if it is allocated with non-exclusive allocation, it is more likely to be 
> preempted and the overhead of repeating the same operation occurs.
> A function that can control whether non-exclusive allocation is used from the 
> user's point of view is required.
> I suggest:
>  When node label expression is empty string, only default partition without 
> non-exclusive allocation is used,
>  when node label expression is null, only it allowed non-exclusive allocation.
> I will attach patch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator

2021-08-31 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407661#comment-17407661
 ] 

Eric Payne commented on YARN-10848:
---

bq. IMO this is breaking the existing behavior of DefaultResourceCalculator
Agreed.
Just to add my 2 cents...
IMO, the DefaultResourceCalculator should only consider the memory portion of 
the resource. This is my understanding of "correct" behavior for 
DefaultResourceCalculator.

> Vcore allocation problem with DefaultResourceCalculator
> ---
>
> Key: YARN-10848
> URL: https://issues.apache.org/jira/browse/YARN-10848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Minni Mittal
>Priority: Major
>  Labels: pull-request-available
> Attachments: TestTooManyContainers.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
> containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is 
> {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
> if (calculator.computeAvailableContainers(Resources
> .add(node.getUnallocatedResource(), 
> node.getTotalKillableResources()),
> minimumAllocation) <= 0) {
>   LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
>   + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in 
> {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
> if (!Resources.fitsIn(rc, capability, totalResource)) {
>   LOG.warn("Node : " + node.getNodeID()
>   + " does not have sufficient resource for ask : " + pendingAsk
>   + " node total capability : " + node.getTotalResource());
>   // Skip this locality request
>   ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
>   activitiesManager, node, application, schedulerKey,
>   ActivityDiagnosticConstant.
>   NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
>   + getResourceDiagnostics(capability, totalResource),
>   ActivityLevel.NODE);
>   return ContainerAllocation.LOCALITY_SKIPPED;
> }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
> Resource capability = pendingAsk.getPerAllocationResource();
> Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the 
> problem. The root cause is that we pass the resource calculator to 
> {{Resource.fitsIn()}}. Instead, we should use an overridden version, just 
> like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
>// Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
>   // Inform the application of the new container for this request
>   RMContainer allocatedContainer =
>   allocate(type, node, schedulerKey, pendingAsk,
>   reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use 
> {{Resources.fitsIn()}} without the calculator in 
> {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit 
> test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-08-26 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405298#comment-17405298
 ] 

Eric Payne commented on YARN-1115:
--

The pre-commit build failed because of the dependency on HADOOP-17857. I should 
not have moved the JIRA to the "submit patch" stage, but I wanted to put the 
patch up because I'm hoping I can get some eyeballs on it and some input as to 
whether or not it meets the requirements of YARN-9975.
[~snemeth], thoughts?

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-08-24 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-1115:
-
Description: 
In the framework for secure implementation using UserGroupInformation.doAs 
(https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
 a trusted superuser can submit jobs on behalf of another user in a secure way. 
In this framework, the superuser is referred to as the real user and the 
proxied user is referred to as the effective user.

Currently when a job is submitted as an effective user, the ACLs for the 
effective user are checked against the queue on which the job is to be run. 
Depending on an optional configuration, the scheduler should also check the 
ACLs of the real user if the configuration to do so is set.

For example, suppose my superuser name is super, and super is configured to 
securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
only allows ACLs for super, not for joe.

When super proxies to joe in order to submit a job to the ops queue, it will 
fail because joe, as the effective user, does not have ACLs on the ops queue.

In many cases this is what you want, in order to protect queues that joe should 
not be using.

However, there are times when super may need to proxy to many users, and the 
client running as super just wants to use the ops queue because the ops queue 
is already dedicated to the client's purpose, and, to keep the ops queue 
dedicated to that purpose, super doesn't want to open up ACLs to joe in general 
on the ops queue. Without this functionality, in this case, the client running 
as super needs to figure out which queue each user has ACLs opened up for, and 
then coordinate with other tasks using those queues.


  was:
In the framework for secure implementation using UserGroupInformation.doAs 
(http://hadoop.apache.org/docs/stable/Secure_Impersonation.html), a trusted 
superuser can submit jobs on behalf of another user in a secure way. In this 
framework, the superuser is referred to as the real user and the proxied user 
is referred to as the effective user.

Currently when a job is submitted as an effective user, the ACLs for the 
effective user are checked against the queue on which the job is to be run. 
Depending on an optional configuration, the scheduler should also check the 
ACLs of the real user if the configuration to do so is set.

For example, suppose my superuser name is super, and super is configured to 
securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
only allows ACLs for super, not for joe.

When super proxies to joe in order to submit a job to the ops queue, it will 
fail because joe, as the effective user, does not have ACLs on the ops queue.

In many cases this is what you want, in order to protect queues that joe should 
not be using.

However, there are times when super may need to proxy to many users, and the 
client running as super just wants to use the ops queue because the ops queue 
is already dedicated to the client's purpose, and, to keep the ops queue 
dedicated to that purpose, super doesn't want to open up ACLs to joe in general 
on the ops queue. Without this functionality, in this case, the client running 
as super needs to figure out which queue each user has ACLs opened up for, and 
then coordinate with other tasks using those queues.



> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops

[jira] [Updated] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-08-20 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-1115:
-
Attachment: YARN-1115.001.patch

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (http://hadoop.apache.org/docs/stable/Secure_Impersonation.html), a trusted 
> superuser can submit jobs on behalf of another user in a secure way. In this 
> framework, the superuser is referred to as the real user and the proxied user 
> is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9975) Support proxy ACL user for CapacityScheduler

2021-08-19 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17401825#comment-17401825
 ] 

Eric Payne commented on YARN-9975:
--

[~cane] / [~snemeth], for about a year, we have been running code in production 
that I think covers this requirement. I have opened HADOOP-17857 and outlined 
what our requirements were, as well as necessary changes to AccessControlList 
class. Please let me know if this fulfills your requirements.

We have also been using this in YARN to submit apps as a proxied user but 
restricting the CS queue ACLs to the real user. I can submit the patch for the 
YARN pieces to YARN-1115 if that would be of interest.

> Support proxy ACL user for CapacityScheduler
> 
>
> Key: YARN-9975
> URL: https://issues.apache.org/jira/browse/YARN-9975
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> As commented in YARN-9698.
> I will open a new jira for the proxy user feature. 
> The background is that we have long running  sql thriftserver for many users:
> {quote}{{user->sql proxy-> sql thriftserver}}{quote}
> But we do not have keytab for all users on 'sql proxy'. We just use a super 
> user like 'sql_prc' to submit the 'sql thriftserver' application. To support 
> this we should change the scheduler to support proxy user acl



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9975) Support proxy ACL user for CapacityScheduler

2021-08-18 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17401092#comment-17401092
 ] 

Eric Payne commented on YARN-9975:
--

bq. zhoukang, does this JIRA have the same requirements outlined in YARN-1115?
[~snemeth], The requirements in YARN-1115 sound similar to this one.

> Support proxy ACL user for CapacityScheduler
> 
>
> Key: YARN-9975
> URL: https://issues.apache.org/jira/browse/YARN-9975
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> As commented in YARN-9698.
> I will open a new jira for the proxy user feature. 
> The background is that we have long running  sql thriftserver for many users:
> {quote}{{user->sql proxy-> sql thriftserver}}{quote}
> But we do not have keytab for all users on 'sql proxy'. We just use a super 
> user like 'sql_prc' to submit the 'sql thriftserver' application. To support 
> this we should change the scheduler to support proxy user acl



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10505) Extend the maximum-capacity property to support Fair Scheduler migration

2021-08-11 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397407#comment-17397407
 ] 

Eric Payne commented on YARN-10505:
---

Thanks for clearing that up!

> Extend the maximum-capacity property to support Fair Scheduler migration
> 
>
> Key: YARN-10505
> URL: https://issues.apache.org/jira/browse/YARN-10505
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>
> Currently Fair Scheduler supports the following 3 kinds of settings:
>  * Single percentage (relative to parent) i.e. "X%"
>  * A set of percentages (relative to parent) i.e. "X% cpu, Y% memory"
>  * Absolute resources i.e. "X mb, Y vcores"
> Please note, that the new, recommended format does not support the single 
> percentage mode, only the last 2, like: “vcores=X, memory-mb=Y” or 
> “vcores=X%, memory-mb=Y%” respectively.
> Tasks to accomplish:
>  #  It is recommended that all three formats are supported for 
> maximum-capacity in CS after introducing weight mode.
>  # Also we want to introduce the percentage modes relative to the cluster, 
> not the parent, i.e The property root.users.maximum-capacity will mean one of 
> the following things: 
>  ## Either Parent Percentage: maximum capacity relative to its parent. If 
> it’s set to 50, then it means that the capacity is capped with respect to the 
> parent. This can be covered by the current format, no change there.
>  ## Or Cluster Percentage: maximum capacity expressed as a percentage of the 
> overall cluster capacity. This case is the new scenario, for example:
> {{yarn.scheduler.capacity.root.users.max-capacity = c:50%}}
> {{yarn.scheduler.capacity.root.users.max-capacity = c:50%, c:30%}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10505) Extend the maximum-capacity property to support Fair Scheduler migration

2021-08-11 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397371#comment-17397371
 ] 

Eric Payne commented on YARN-10505:
---

bq. Please note, that the new, recommended format does not support the single 
percentage mode, only the last 2, like: “vcores=X, memory-mb=Y” or “vcores=X%, 
memory-mb=Y%” respectively.
[~bteke] / [~rreti] / [~gandras],
If I understand correctly, this change would no longer support the legacy 
format of specifying a single percentage for the capacity and maximum-capacity 
properties. Is that correct? If so, I would not be able to support that change 
because it would require users who are upgrading to make changes to their queue 
configs. Logistically, from the user's perspective, this would be cumbersome. I 
think it is important to keep the single percentage to support backwards 
compatibility.

> Extend the maximum-capacity property to support Fair Scheduler migration
> 
>
> Key: YARN-10505
> URL: https://issues.apache.org/jira/browse/YARN-10505
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>
> Currently Fair Scheduler supports the following 3 kinds of settings:
>  * Single percentage (relative to parent) i.e. "X%"
>  * A set of percentages (relative to parent) i.e. "X% cpu, Y% memory"
>  * Absolute resources i.e. "X mb, Y vcores"
> Please note, that the new, recommended format does not support the single 
> percentage mode, only the last 2, like: “vcores=X, memory-mb=Y” or 
> “vcores=X%, memory-mb=Y%” respectively.
> Tasks to accomplish:
>  #  It is recommended that all three formats are supported for 
> maximum-capacity in CS after introducing weight mode.
>  # Also we want to introduce the percentage modes relative to the cluster, 
> not the parent, i.e The property root.users.maximum-capacity will mean one of 
> the following things: 
>  ## Either Parent Percentage: maximum capacity relative to its parent. If 
> it’s set to 50, then it means that the capacity is capped with respect to the 
> parent. This can be covered by the current format, no change there.
>  ## Or Cluster Percentage: maximum capacity expressed as a percentage of the 
> overall cluster capacity. This case is the new scenario, for example:
> {{yarn.scheduler.capacity.root.users.max-capacity = c:50%}}
> {{yarn.scheduler.capacity.root.users.max-capacity = c:50%, c:30%}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10456) RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics registry

2021-07-14 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380826#comment-17380826
 ] 

Eric Payne commented on YARN-10456:
---

[~Jim_Brennan], [~ebadger], [~edfi202], [~prabhujoseph], [~BilwaST], [~snemeth] 
:
Would someone be willing to review this? Thanks!

> RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics 
> registry
> -
>
> Key: YARN-10456
> URL: https://issues.apache.org/jira/browse/YARN-10456
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.3.0, 3.2.1, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10456.001.patch
>
>
> Several queue metrics (such as AppsRunning, PendingContainers, etc.) stopped 
> working after we upgraded to 2.10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10456) RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics registry

2021-07-13 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10456:
--
Attachment: YARN-10456.001.patch

> RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics 
> registry
> -
>
> Key: YARN-10456
> URL: https://issues.apache.org/jira/browse/YARN-10456
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.3.0, 3.2.1, 3.1.4, 2.10.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10456.001.patch
>
>
> Several queue metrics (such as AppsRunning, PendingContainers, etc.) stopped 
> working after we upgraded to 2.10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10821) User limit is not calculated as per definition for preemption

2021-07-06 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376054#comment-17376054
 ] 

Eric Payne edited comment on YARN-10821 at 7/6/21, 10:00 PM:
-

{quote}
{noformat}
max ((total resources consumed by active users) / (number of active users)),
((totalQueueResourcesAvailableToUsers) * min-user-limit-pct))
{noformat}
bq. UserA is not using any resource yet, and no one else is using QueueA. Is 
not "resource usage by active users" expected to be low in this case?
{quote}
So, you know how I said that the above was a simplification? Here is another 
wrinkle in that algorithm:
IF queueUsage <= queueCapacity
THEN
  (total resources consumed by active users) = 
yarn.scheduler.capacity.root.QueueA.capacity
ENDIF
So, basically, in the use case where QueueA is empty, the algorithm becomes:
{noformat}
max ((yarn.scheduler.capacity.root.QueueA.capacity) / (number of active users)),
((totalQueueResourcesAvailableToUsers) * min-user-limit-pct))
{noformat}


was (Author: eepayne):
{quote}
{noformat}
max ((total resources consumed by active users) / (number of active users)),
((totalQueueResourcesAvailableToUsers) * min-user-limit-pct))
{noformat}
bq. UserA is not using any resource yet, and no one else is using QueueA. Is 
not "resource usage by active users" expected to be low in this case?
{quote}
So, you know how I said that the above was a simplification? Here is another 
wrinkle in that algorithm:
{noformat}
IF queueUsage <= queueCapacity
THEN
  (total resources consumed by active users) = 
yarn.scheduler.capacity.root.QueueA.capacity
ENDIF
So, basically, in the use case where QueueA is empty, the algorithm becomes:
{noformat}
max ((yarn.scheduler.capacity.root.QueueA.capacity) / (number of active users)),
((totalQueueResourcesAvailableToUsers) * min-user-limit-pct))
{noformat}

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10821.001.patch
>
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption

2021-07-06 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376054#comment-17376054
 ] 

Eric Payne commented on YARN-10821:
---

{quote}
{noformat}
max ((total resources consumed by active users) / (number of active users)),
((totalQueueResourcesAvailableToUsers) * min-user-limit-pct))
{noformat}
bq. UserA is not using any resource yet, and no one else is using QueueA. Is 
not "resource usage by active users" expected to be low in this case?
{quote}
So, you know how I said that the above was a simplification? Here is another 
wrinkle in that algorithm:
{noformat}
IF queueUsage <= queueCapacity
THEN
  (total resources consumed by active users) = 
yarn.scheduler.capacity.root.QueueA.capacity
ENDIF
So, basically, in the use case where QueueA is empty, the algorithm becomes:
{noformat}
max ((yarn.scheduler.capacity.root.QueueA.capacity) / (number of active users)),
((totalQueueResourcesAvailableToUsers) * min-user-limit-pct))
{noformat}

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10821.001.patch
>
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10834) Intra-queue preemption: apps that don't use defined custom resource won't be preempted.

2021-06-28 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10834:
--
Attachment: YARN-10834.branch-2.10.001.patch

> Intra-queue preemption: apps that don't use defined custom resource won't be 
> preempted.
> ---
>
> Key: YARN-10834
> URL: https://issues.apache.org/jira/browse/YARN-10834
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.2.3, 3.3.2
>
> Attachments: YARN-10834.001.patch, YARN-10834.branch-2.10.001.patch
>
>
> YARN-8292 added handling of negative resources during the preemption 
> calculation phase. That JIRA hard-coded it so that for inter-(cross-)queue 
> preemption, the a single resource in the vector could go negative while 
> calculating ideal assignments and preemptions. It also hard-coded it so that 
> during intra-(in-)queue preemption calculations, no resource could not go 
> negative. YARN-10613 made these options configurable.
> However, in clusters where custom resources are defined, apps that don't use 
> the extended resource won't be preempted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10834) Intra-queue preemption: apps that don't use defined custom resource won't be preempted.

2021-06-28 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17370890#comment-17370890
 ] 

Eric Payne commented on YARN-10834:
---

Thanks very much, [~Jim_Brennan]. I have uploaded a branch-2.10 patch.

> Intra-queue preemption: apps that don't use defined custom resource won't be 
> preempted.
> ---
>
> Key: YARN-10834
> URL: https://issues.apache.org/jira/browse/YARN-10834
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.2.3, 3.3.2
>
> Attachments: YARN-10834.001.patch, YARN-10834.branch-2.10.001.patch
>
>
> YARN-8292 added handling of negative resources during the preemption 
> calculation phase. That JIRA hard-coded it so that for inter-(cross-)queue 
> preemption, the a single resource in the vector could go negative while 
> calculating ideal assignments and preemptions. It also hard-coded it so that 
> during intra-(in-)queue preemption calculations, no resource could not go 
> negative. YARN-10613 made these options configurable.
> However, in clusters where custom resources are defined, apps that don't use 
> the extended resource won't be preempted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10834) Intra-queue preemption: apps that don't use defined custom resource won't be preempted.

2021-06-25 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10834:
--
Attachment: YARN-10834.001.patch

> Intra-queue preemption: apps that don't use defined custom resource won't be 
> preempted.
> ---
>
> Key: YARN-10834
> URL: https://issues.apache.org/jira/browse/YARN-10834
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10834.001.patch
>
>
> YARN-8292 added handling of negative resources during the preemption 
> calculation phase. That JIRA hard-coded it so that for inter-(cross-)queue 
> preemption, the a single resource in the vector could go negative while 
> calculating ideal assignments and preemptions. It also hard-coded it so that 
> during intra-(in-)queue preemption calculations, no resource could not go 
> negative. YARN-10613 made these options configurable.
> However, in clusters where custom resources are defined, apps that don't use 
> the extended resource won't be preempted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10834) Intra-queue preemption: apps that don't use defined custom resource won't be preempted.

2021-06-25 Thread Eric Payne (Jira)
Eric Payne created YARN-10834:
-

 Summary: Intra-queue preemption: apps that don't use defined 
custom resource won't be preempted.
 Key: YARN-10834
 URL: https://issues.apache.org/jira/browse/YARN-10834
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Eric Payne
Assignee: Eric Payne


YARN-8292 added handling of negative resources during the preemption 
calculation phase. That JIRA hard-coded it so that for inter-(cross-)queue 
preemption, the a single resource in the vector could go negative while 
calculating ideal assignments and preemptions. It also hard-coded it so that 
during intra-(in-)queue preemption calculations, no resource could not go 
negative. YARN-10613 made these options configurable.

However, in clusters where custom resources are defined, apps that don't use 
the extended resource won't be preempted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10824) Title not set for JHS and NM webpages

2021-06-21 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366828#comment-17366828
 ] 

Eric Payne commented on YARN-10824:
---

bq.  I'm not sure "About the Node" is a good title for the node page.  Maybe 
"Node Info"?
If we change the title, I would make it "Node Information" so that it matches 
the name in the "NodeManager" pulldown.

> Title not set for JHS and NM webpages
> -
>
> Key: YARN-10824
> URL: https://issues.apache.org/jira/browse/YARN-10824
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rajshree Mishra
>Assignee: Bilwa S T
>Priority: Major
> Attachments: JHS URL.jpg, NM URL.jpg, YARN-10824.001.patch
>
>
> The following issue was reported by one of our internal web security check 
> tools: 
> Passing a title to the jobHistoryServer(jhs) or Nodemanager(nm) pages using a 
> url similar to:
> [https://[hostname]:[jhs_port]/jobhistory/about?title=12345%27%22]
> or 
> [https://[hostname]:[nm_port]/node?title=12345]
> sets the page title to be set to the value mentioned.
> [Image attached]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-21 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366823#comment-17366823
 ] 

Eric Payne edited comment on YARN-10821 at 6/21/21, 8:08 PM:
-

[~gandras], can you please provide a step-by-step use case to reproduce the 
problem you are encountering?

There were a lot of factors involved in the design of the user limit 
calculations in {{UsersManager#computeUserLimit}}, and I am reluctant to change 
them because they affect resource allocation as well as preemption. Some of the 
background for the user-limit calculations can be found in YARN-5889.

It may be appropriate to modify 
{{LeafQueue#getTotalPendingResourcesConsideringUserLimit}}, but I need to have 
a better understanding of the use case. I am a little confused about what you 
are seeing:
bq. What we have also observed, is that by turning off 
minimum-user-limit-percent (setting it to 100), there was no issue. But when we 
set MULP to say 50 percent, the queue have only been granted half of its 
effective capacity, even though the other queue was using 600% of its effective 
capacity and preemption should have kicked in.
If I understand the use case, this should not have happened. The user limit is 
calculated as follows (actually, the following is a simplified calculation, but 
will do for our purpose):
{noformat}
max ((total resources consumed by active users) / (number of active users)),
((totalQueueResourcesAvailableToUsers) * min-user-limit-pct))
{noformat}
So, if there are 100 resources used by 1 active user, and the 
min-user-limit-pct is 50%, the calculation should have been as follows:
{noformat}
max (100/1,100*.5) = max(100,50) = 100
{noformat}



was (Author: eepayne):
[~gandras], can you please provide a step-by-step use case to reproduce the 
problem you are encountering?

There were a lot of factors involved in the design of the user limit 
calculations in {{UsersManager#computeUserLimit}}, and I am reluctant to change 
them because they affect resource allocation as well as preemption. Some of the 
background for the user-limit calculations can be found in YARN-5889.

It may be appropriate to modify 
{{LeafQueue#getTotalPendingResourcesConsideringUserLimit}}, but I need to have 
a better understanding of the use case. I am a little confused about what you 
are seeing:
bq. What we have also observed, is that by turning off 
minimum-user-limit-percent (setting it to 100), there was no issue. But when we 
set MULP to say 50 percent, the queue have only been granted half of its 
effective capacity, even though the other queue was using 600% of its effective 
capacity and preemption should have kicked in.
If I understand the use case, this should not have happened. The user limit is 
calculated as follows (actually, the following is a simplified calculation, but 
will do for our purpose):
{noformat}
max ((total resources consumed by active users) / (number of active users)),
((totalQueueResourcesAvailableToUsers) * min-user-limit-pct))
{noformat}
So, if there are 100 resources used by 1 active user, and the 
min-user-limit-pct is 50%, the calculation should have been as follows:

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10821.001.patch
>
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This mes

[jira] [Comment Edited] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-21 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366823#comment-17366823
 ] 

Eric Payne edited comment on YARN-10821 at 6/21/21, 8:07 PM:
-

[~gandras], can you please provide a step-by-step use case to reproduce the 
problem you are encountering?

There were a lot of factors involved in the design of the user limit 
calculations in {{UsersManager#computeUserLimit}}, and I am reluctant to change 
them because they affect resource allocation as well as preemption. Some of the 
background for the user-limit calculations can be found in YARN-5889.

It may be appropriate to modify 
{{LeafQueue#getTotalPendingResourcesConsideringUserLimit}}, but I need to have 
a better understanding of the use case. I am a little confused about what you 
are seeing:
bq. What we have also observed, is that by turning off 
minimum-user-limit-percent (setting it to 100), there was no issue. But when we 
set MULP to say 50 percent, the queue have only been granted half of its 
effective capacity, even though the other queue was using 600% of its effective 
capacity and preemption should have kicked in.
If I understand the use case, this should not have happened. The user limit is 
calculated as follows (actually, the following is a simplified calculation, but 
will do for our purpose):
{noformat}
max ((total resources consumed by active users) / (number of active users)),
((totalQueueResourcesAvailableToUsers) * min-user-limit-pct))
{noformat}
So, if there are 100 resources used by 1 active user, and the 
min-user-limit-pct is 50%, the calculation should have been as follows:


was (Author: eepayne):
[~gandras], can you please provide a step-by-step use case to reproduce the 
problem you are encountering?

There were a lot of factors involved in the design of the user limit 
calculations in {{UsersManager#computeUserLimit}}, and I am reluctant to change 
them because they affect resource allocation as well as preemption. Some of the 
background for the user-limit calculations can be found in YARN-5889.

It may be appropriate to modify 
{{LeafQueue#getTotalPendingResourcesConsideringUserLimit}}, but I need to have 
a better understanding of the use case. I am a little confused about what you 
are seeing:
bq. What we have also observed, is that by turning off 
minimum-user-limit-percent (setting it to 100), there was no issue. But when we 
set MULP to say 50 percent, the queue have only been granted half of its 
effective capacity, even though the other queue was using 600% of its effective 
capacity and preemption should have kicked in.
If I understand the use case, this should not have happened. The user limit is 
calculated as follows:

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10821.001.patch
>
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-21 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366823#comment-17366823
 ] 

Eric Payne commented on YARN-10821:
---

[~gandras], can you please provide a step-by-step use case to reproduce the 
problem you are encountering?

There were a lot of factors involved in the design of the user limit 
calculations in {{UsersManager#computeUserLimit}}, and I am reluctant to change 
them because they affect resource allocation as well as preemption. Some of the 
background for the user-limit calculations can be found in YARN-5889.

It may be appropriate to modify 
{{LeafQueue#getTotalPendingResourcesConsideringUserLimit}}, but I need to have 
a better understanding of the use case. I am a little confused about what you 
are seeing:
bq. What we have also observed, is that by turning off 
minimum-user-limit-percent (setting it to 100), there was no issue. But when we 
set MULP to say 50 percent, the queue have only been granted half of its 
effective capacity, even though the other queue was using 600% of its effective 
capacity and preemption should have kicked in.
If I understand the use case, this should not have happened. The user limit is 
calculated as follows:

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10821.001.patch
>
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-17 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17364966#comment-17364966
 ] 

Eric Payne commented on YARN-10821:
---

[~gandras], I am running behind. I will try to get back to this early next week.

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10821.001.patch
>
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values

2021-06-14 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363215#comment-17363215
 ] 

Eric Payne commented on YARN-10802:
---

[~bteke], Thanks for raising this issue and for working on it. I have a 
question and an observation.
{quote}Capacity Scheduler's minimum-user-limit-percent only accepts integers, 
which means at most 100 users can use a single queue fairly
{quote}
This isn't exactly accurate.

Minimum user limit percent is only enforced when a queue's max capacity is 
reached _AND_ (100 / {{min-user-limit-pct}}) users are both using resources and 
asking for more resources. As long as the queue's max capacity is not reached 
_AND_ there are more resources available in the system, the 101st, 102nd, 
103rd, etc., will be assigned resources.

So, my question is, do you have a use case where
 1. 100 users are using up the max capacity in the queue
 2. All 100 users are active (that is, requesting more resources)
 3. The 101st user comes in and is starved because, as containers are released, 
they are assigned to one of the first 100 (again, because they are all asking 
for resources)?

We have several very-heavily-used multi-tenant queues that often have 100 or 
more users running, but only a subset of them are actively requesting resources.

My observation is that when we have set the min-user-limit-pct to be 1 in a 
very highly used multi-tenant queue, the user limit grows way too slowly. The 
min-user-limit-pct is used in calculating the user limit (seen as "Max 
Resources" in the queue's pull-down menu in the RM GUI). When the queue grows 
above its capacity but is still below its max capacity, the calculations for 
user limit in {{UsersManager#computeUserLimit}} uses the min-user-limit-pct to 
limit how fast the user limit can grow. The smaller the min-user-limit-pct is, 
the slower it grows. What ends up happening is that a few users want to grow 
larger, but several smaller users come in, request resources, and leave without 
ever reaching the current user limit. This process repeats because there are 
several new active users all the time, so the longer-running, larger users 
can't grow beyond a certain limit even though there are still available queue 
and cluster resources.

> Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
> -
>
> Key: YARN-10802
> URL: https://issues.apache.org/jira/browse/YARN-10802
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10802.001.patch, YARN-10802.002.patch, 
> YARN-10802.003.patch, YARN-10802.004.patch
>
>
> Capacity Scheduler's minimum-user-limit-percent only accepts integers, which 
> means at most 100 users can use a single queue fairly. Using decimal values 
> could solve this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-14 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363171#comment-17363171
 ] 

Eric Payne commented on YARN-10821:
---

{quote}

- In UsersManager#computeUserLimit the userLimit is calculated as is 
(currentCapacity * userLimit)
{code}
 Resource userLimitResource = Resources.max(resourceCalculator,
 partitionResource,
 Resources.divideAndCeil(resourceCalculator, resourceUsed,
 usersSummedByWeight),
 Resources.divideAndCeil(resourceCalculator,
 Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
 100));
{code}
{quote}
One more thing to note: another difference between the preemption and 
allocation calculations is that in the preemption path, {{resourceUsed}} in the 
above algorithm is resources used by all users whereas in the allocation path, 
it is only resources used by active users (that is, users currently asking for 
resources).

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10821.001.patch
>
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-14 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363169#comment-17363169
 ] 

Eric Payne commented on YARN-10821:
---

{quote}
In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
calculated first:
In UsersManager#computeUserLimit the userLimit is calculated as is 
(currentCapacity * userLimit) 
{quote}
[~gandras], thanks for raising this issue.

{{LeafQueue#getUserAMResourceLimitPerPartition}} and 
{{UsersManager#computeUserLimit}} are used to calculate different things.
{{getUserAMResourceLimitPerPartition}} is used to calculate the maximum 
resources that can be used for AMs by all apps from a single user in the 
{{LeafQueue}}
{{computeUserLimit}} is used to calculate the maximum total resources that can 
be used by all apps from a single user in the {{LeafQueue}}
{{computeUserLimit}} is used not only during calculations by the preemption 
monitor, but it is also used to calculate headroom during container allocation 
and assignment to a queue. In this way, the preemption monitor and the Capacity 
Scheduler allocations are using the same computations for each users' user 
limit.

The calculations in {{getUserAMResourceLimitPerPartition}} are more lenient 
than those in {{computeUserLimit}}. But they are calculating different limits. 
This difference is not between preemption vs. allocation, but between AM 
resources limit vs. total resources limit per user.

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10821.001.patch
>
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-14 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363004#comment-17363004
 ] 

Eric Payne commented on YARN-10821:
---

Thanks [~gandras] for bringing this up. I will take a look.

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >