[jira] [Commented] (YARN-9869) Create scheduling policy to auto-adjust queue elasticity based on cluster demand

2021-04-18 Thread Min Shen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324628#comment-17324628
 ] 

Min Shen commented on YARN-9869:


[~jhung], are we still planning to contribute this policy upstream?

I think elasticity tuner has its merit, and it would be nice if we can make it 
accessible to the broader industry.

> Create scheduling policy to auto-adjust queue elasticity based on cluster 
> demand
> 
>
> Key: YARN-9869
> URL: https://issues.apache.org/jira/browse/YARN-9869
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jonathan Hung
>Priority: Major
>
> Currently LinkedIn has a policy to auto-adjust queue elasticity based on 
> real-time queue demand. We've been running this policy in production for a 
> long time and it has helped improve overall cluster utilization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7084) TestSchedulingMonitor#testRMStarts fails sporadically

2017-09-28 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185012#comment-16185012
 ] 

Min Shen commented on YARN-7084:


[~jlowe]

I think you change in the patch makes more sense.
The semantic of the original unit test was to verify that the monitor policy 
gets invoked after service starts.
Verification with a timeout is a better approach here for this test purpose.

> TestSchedulingMonitor#testRMStarts fails sporadically
> -
>
> Key: YARN-7084
> URL: https://issues.apache.org/jira/browse/YARN-7084
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-7084.001.patch
>
>
> TestSchedulingMonitor has been failing sporadically in precommit builds.  
> Failures look like this:
> {noformat}
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.802 sec <<< 
> FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor
> testRMStarts(org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor)
>   Time elapsed: 1.728 sec  <<< FAILURE!
> org.mockito.exceptions.verification.WantedButNotInvoked: 
> Wanted but not invoked:
> schedulingEditPolicy.editSchedule();
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor.testRMStarts(TestSchedulingMonitor.java:58)
> However, there were other interactions with this mock:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.(SchedulingMonitor.java:50)
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.serviceInit(SchedulingMonitor.java:61)
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.serviceInit(SchedulingMonitor.java:62)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor.testRMStarts(TestSchedulingMonitor.java:58)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5543) ResourceManager SchedulingMonitor could potentially terminate the preemption checker thread

2016-10-28 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616178#comment-15616178
 ] 

Min Shen commented on YARN-5543:


[~leftnoteasy],

Do you have more comments on this ticket?

> ResourceManager SchedulingMonitor could potentially terminate the preemption 
> checker thread
> ---
>
> Key: YARN-5543
> URL: https://issues.apache.org/jira/browse/YARN-5543
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.7.0, 2.6.1
>Reporter: Min Shen
>Assignee: Min Shen
>  Labels: oct16-medium
> Attachments: YARN-5543.001.patch, YARN-5543.002.patch, 
> YARN-5543.003.patch
>
>
> In SchedulingMonitor.java, when the service starts, it starts a checker 
> thread to perform Capacity Scheduler's preemption. However, the 
> implementation of this checker thread has the following issue:
> {code}
> while (!stopped && !Thread.currentThread().isInterrupted()) {
> 
> try {
>   Thread.sleep(monitorInterval)
> } catch (InterruptedException e) {
>   
>   break;
> }
> }
> {code}
> The above code snippet will terminate the checker thread whenever it is 
> interrupted. 
> We noticed in our cluster that this could lead to CapacityScheduler's 
> preemption disabled unexpectedly due to the checker thread getting terminated.
> We propose to use ScheduledExecutorService to improve the robustness of this 
> part of the code to ensure the liveness of CapacityScheduler's preemption 
> functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4161) Capacity Scheduler : Assign single or multiple containers per heart beat driven by configuration

2016-10-27 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613484#comment-15613484
 ] 

Min Shen commented on YARN-4161:


The patch does not apply to trunk.
[~mayank_bansal], could you please rebase your patch?

> Capacity Scheduler : Assign single or multiple containers per heart beat 
> driven by configuration
> 
>
> Key: YARN-4161
> URL: https://issues.apache.org/jira/browse/YARN-4161
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacity scheduler
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
>  Labels: oct16-medium
> Attachments: YARN-4161.patch
>
>
> Capacity Scheduler right now schedules multiple containers per heart beat if 
> there are more resources available in the node.
> This approach works fine however in some cases its not distribute the load 
> across the cluster hence throughput of the cluster suffers. I am adding 
> feature to drive that using configuration by that we can control the number 
> of containers assigned per heart beat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4899) Queue metrics of SLS capacity scheduler only activated after app submit to the queue

2016-10-27 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613458#comment-15613458
 ] 

Min Shen commented on YARN-4899:


The checkstyle output is inaccessible now.
Need to rekick jenkins to check if these issues are relevant.

Also, it seems a duplicate to keep referring {{"counter.queues."}} when 
defining these counter names:
{noformat}
  "counter.queue." + queueName + ".pending.memory",
  "counter.queue." + queueName + ".pending.cores",
  "counter.queue." + queueName + ".allocated.memory",
  "counter.queue." + queueName + ".allocated.cores" };
{noformat}

> Queue metrics of SLS capacity scheduler only activated after app submit to 
> the queue
> 
>
> Key: YARN-4899
> URL: https://issues.apache.org/jira/browse/YARN-4899
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>  Labels: oct16-medium
> Attachments: YARN-4899.1.patch
>
>
> We should start recording queue metrics since cluster start.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4896) ProportionalPreemptionPolicy needs to handle AMResourcePercentage per partition

2016-10-27 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613306#comment-15613306
 ] 

Min Shen commented on YARN-4896:


The patch looks OK to me.
The unit test failure does not seem to be relevant.
Not sure if the check style warnings are relevant as the link to the output no 
longer works.
Ping [~leftnoteasy] to see if he has any additional comment.

> ProportionalPreemptionPolicy needs to handle AMResourcePercentage per 
> partition
> ---
>
> Key: YARN-4896
> URL: https://issues.apache.org/jira/browse/YARN-4896
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.2
>Reporter: Sunil G
>Assignee: Sunil G
>  Labels: oct16-easy
> Attachments: 0001-YARN-4896.patch, 0002-YARN-4896.patch, 
> YARN-4896.0003.patch
>
>
> In PCPP, currently we are using {{getMaxAMResourcePerQueuePercent()}} to get 
> the max AM capacity for queue to save AM Containers from preemption. As we 
> are now supporting MaxAMResourcePerQueuePercent per partition, PCPP also need 
> to handle the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5215) Scheduling containers based on external load in the servers

2016-10-27 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-5215:
---
Labels: oct16-hard  (was: )

> Scheduling containers based on external load in the servers
> ---
>
> Key: YARN-5215
> URL: https://issues.apache.org/jira/browse/YARN-5215
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Reporter: Inigo Goiri
>  Labels: oct16-hard
> Attachments: YARN-5215.000.patch, YARN-5215.001.patch
>
>
> Currently YARN runs containers in the servers assuming that they own all the 
> resources. The proposal is to use the utilization information in the node and 
> the containers to estimate how much is consumed by external processes and 
> schedule based on this estimation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5256) Add REST endpoint to support detailed NodeLabel Informations

2016-10-27 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-5256:
---
Labels: oct16-medium  (was: )

> Add REST endpoint to support detailed NodeLabel Informations
> 
>
> Key: YARN-5256
> URL: https://issues.apache.org/jira/browse/YARN-5256
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Sunil G
>Assignee: Sunil G
>  Labels: oct16-medium
> Attachments: YARN-5256-YARN-3368.1.patch, 
> YARN-5256-YARN-3368.2.patch, YARN-5256.0001.patch, YARN-5256.0002.patch, 
> YARN-5256.0003.patch
>
>
> Add a new REST endpoint to fetch few more detailed information about node 
> labels such as resource, list of nodes etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5215) Scheduling containers based on external load in the servers

2016-10-27 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-5215:
---
Component/s: api

> Scheduling containers based on external load in the servers
> ---
>
> Key: YARN-5215
> URL: https://issues.apache.org/jira/browse/YARN-5215
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Reporter: Inigo Goiri
> Attachments: YARN-5215.000.patch, YARN-5215.001.patch
>
>
> Currently YARN runs containers in the servers assuming that they own all the 
> resources. The proposal is to use the utilization information in the node and 
> the containers to estimate how much is consumed by external processes and 
> schedule based on this estimation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5216) Expose configurable preemption policy for OPPORTUNISTIC containers running on the NM

2016-10-27 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-5216:
---
Labels: oct16-hard  (was: )

> Expose configurable preemption policy for OPPORTUNISTIC containers running on 
> the NM
> 
>
> Key: YARN-5216
> URL: https://issues.apache.org/jira/browse/YARN-5216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-scheduling
>Reporter: Arun Suresh
>Assignee: Hitesh Sharma
>  Labels: oct16-hard
> Attachments: YARN5216.001.patch, yarn5216.002.patch
>
>
> Currently, the default action taken by the QueuingContainerManager, 
> introduced in YARN-2883, when a GUARANTEED Container is scheduled on an NM 
> with OPPORTUNISTIC containers using up resources, is to KILL the running 
> OPPORTUNISTIC containers.
> This JIRA proposes to expose a configurable hook to allow the NM to take a 
> different action.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5216) Expose configurable preemption policy for OPPORTUNISTIC containers running on the NM

2016-10-27 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-5216:
---
Component/s: distributed-scheduling

> Expose configurable preemption policy for OPPORTUNISTIC containers running on 
> the NM
> 
>
> Key: YARN-5216
> URL: https://issues.apache.org/jira/browse/YARN-5216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-scheduling
>Reporter: Arun Suresh
>Assignee: Hitesh Sharma
>  Labels: oct16-hard
> Attachments: YARN5216.001.patch, yarn5216.002.patch
>
>
> Currently, the default action taken by the QueuingContainerManager, 
> introduced in YARN-2883, when a GUARANTEED Container is scheduled on an NM 
> with OPPORTUNISTIC containers using up resources, is to KILL the running 
> OPPORTUNISTIC containers.
> This JIRA proposes to expose a configurable hook to allow the NM to take a 
> different action.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5241) FairScheduler repeat container completed

2016-10-27 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-5241:
---
Labels: oct16-easy  (was: )

> FairScheduler repeat container completed
> 
>
> Key: YARN-5241
> URL: https://issues.apache.org/jira/browse/YARN-5241
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0, 2.6.1, 2.8.0, 2.7.2
>Reporter: ChenFolin
>  Labels: oct16-easy
> Attachments: YARN-5241-001.patch, YARN-5241-002.patch, 
> YARN-5241-003.patch, repeatContainerCompleted.log
>
>
> NodeManager heartbeat event NODE_UPDATE and ApplicationMaster allocate 
> operate may cause repeat container completed, it can lead something wrong.
> Node releaseContainer can pervent repeat release operate:
> like:
> public synchronized void releaseContainer(Container container) {
> if (!isValidContainer(container.getId())) {
>   LOG.error("Invalid container released " + container);
>   return;
> }
> FSAppAttempt containerCompleted did not prevent repeat container completed 
> operate.
> Detail logs at attach file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5188) FairScheduler performance bug

2016-10-27 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612821#comment-15612821
 ] 

Min Shen commented on YARN-5188:


Cancelling the patch as it no longer applies to trunk. [~chenfolin], could you 
please rebase your patch against trunk?

> FairScheduler performance bug
> -
>
> Key: YARN-5188
> URL: https://issues.apache.org/jira/browse/YARN-5188
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.5.0
>Reporter: ChenFolin
> Attachments: YARN-5188-1.patch
>
>
>  My Hadoop Cluster has recently encountered a performance problem. Details as 
> Follows.
> There are two point which can cause this performance issue.
> 1: application sort before assign container at FSLeafQueue. TreeSet is not 
> the best, Why not keep orderly ? and then we can use binary search to help 
> keep orderly when a application's resource usage has changed.
> 2: queue sort and assignContainerPreCheck will lead to compute all leafqueue 
> resource usage ,Why can we store the leafqueue usage at memory and update it 
> when assign container op release container happen?
>
>The efficiency of assign container in the Resourcemanager may fall 
> when the number of running and pending application grows. And the fact is the 
> cluster has too many PendingMB or PengdingVcore , and the Cluster 
> current utilization rate may below 20%.
>I checked the resourcemanager logs, I found that every assign 
> container may cost 5 ~ 10 ms, but just 0 ~ 1 ms at usual time.
>  
>I use TestFairScheduler to reproduce the scene:
>  
>Just one queue: root.defalut
>  10240 apps.
>  
>assign container avg time:  6753.9 us ( 6.7539 ms)  
>  apps sort time (FSLeafQueue : Collections.sort(runnableApps, 
> comparator); ): 4657.01 us ( 4.657 ms )
>  compute LeafQueue Resource usage : 905.171 us ( 0.905171 ms )
>  
>  When just root.default, one assign container op contains : ( one apps 
> sort op ) + 2 * ( compute leafqueue usage op )
>According to the above situation, I think the assign container op has 
> a performance problem  . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4835) [YARN-3368] REST API related changes for new Web UI

2016-10-27 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-4835:
---
Labels: oct16-hard  (was: )

> [YARN-3368] REST API related changes for new Web UI
> ---
>
> Key: YARN-4835
> URL: https://issues.apache.org/jira/browse/YARN-4835
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: YARN-3368
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: oct16-hard
> Attachments: YARN-4835-YARN-3368.01.patch, 
> YARN-4835-YARN-3368.02.patch
>
>
> Following things need to be added for AM related web pages.
> 1. Support task state query param in REST URL for fetching tasks.
> 2. Support task attempt state query param in REST URL for fetching task 
> attempts.
> 3. A new REST endpoint to fetch counters for each task belonging to a job. 
> Also have a query param for counter name.
>i.e. something like :
>   {{/jobs/\{jobid\}/taskCounters}}
> 4. A REST endpoint in NM for fetching all log files associated with a 
> container. Useful if logs served by NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource

2016-10-27 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-4575:
---
Labels: oct16-easy  (was: )

> ApplicationResourceUsageReport should return ALL  reserved resource
> ---
>
> Key: YARN-4575
> URL: https://issues.apache.org/jira/browse/YARN-4575
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>  Labels: oct16-easy
> Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch
>
>
> ApplicationResourceUsageReport reserved resource report  is only of default 
> parition should be of all partitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4572) TestCapacityScheduler#testHeadRoomCalculationWithDRC failing

2016-10-27 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-4572:
---
Labels: oct16-easy  (was: )

> TestCapacityScheduler#testHeadRoomCalculationWithDRC failing
> 
>
> Key: YARN-4572
> URL: https://issues.apache.org/jira/browse/YARN-4572
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: test, yarn
>Reporter: Bibin A Chundatt
>  Labels: oct16-easy
> Attachments: YARN-4572.1.patch
>
>
> {noformat}
> Tests run: 46, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 127.996 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
> testHeadRoomCalculationWithDRC(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler)
>   Time elapsed: 0.189 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<16384>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler.testHeadRoomCalculationWithDRC(TestCapacityScheduler.java:3041)
> {noformat}
> https://builds.apache.org/job/PreCommit-YARN-Build/10204/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt
> https://builds.apache.org/job/PreCommit-YARN-Build/10204/testReport/
> Failed in jdk8 locally the same is passing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4533) Killing applications by user in Yarn RMAdmin CLI

2016-10-27 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-4533:
---
Labels: oct16-medium  (was: )

> Killing applications by user in Yarn RMAdmin CLI
> 
>
> Key: YARN-4533
> URL: https://issues.apache.org/jira/browse/YARN-4533
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: applications, client
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>  Labels: oct16-medium
> Attachments: YARN-4533.001.patch
>
>
> The cmd likes
> {code}
> [-killApplicationsForUser [username]] Kill the applications of specific user.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4532) Killing applications by appStates and queue in Yarn Application CLI

2016-10-27 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-4532:
---
Labels: oct16-medium  (was: )

> Killing applications by appStates and queue in Yarn Application CLI
> ---
>
> Key: YARN-4532
> URL: https://issues.apache.org/jira/browse/YARN-4532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: applications, client
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>  Labels: oct16-medium
> Attachments: YARN-4532.001.patch
>
>
> The cmd likes 
> {code}
> -killByAppStates  The states of application that will be killed.
> -killOfQueue  Kill the applications of specific queue.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4529) Yarn CLI killing applications in batch

2016-10-27 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-4529:
---
Labels: oct16-easy  (was: )

> Yarn CLI killing applications in batch
> --
>
> Key: YARN-4529
> URL: https://issues.apache.org/jira/browse/YARN-4529
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: applications, client
>Affects Versions: 2.7.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>  Labels: oct16-easy
> Attachments: YARN-4529.001.patch
>
>
> We have not a good way to kill applications conveniently when starting some 
> apps unexpected. At present, we have to kill them one by one. We can add some 
> kill command that can kill apps in batch, like these:
> {code}
> -killByAppStatesThe states of application that will be killed.
> -killByUser  Kill running-state applications of specific 
> user.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5734) OrgQueue for easy CapacityScheduler queue configuration management

2016-10-13 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573674#comment-15573674
 ] 

Min Shen commented on YARN-5734:


[~curino], [~subru],

As discussed offline, could you please provide feedbacks on the design docs we 
currently have?

> OrgQueue for easy CapacityScheduler queue configuration management
> --
>
> Key: YARN-5734
> URL: https://issues.apache.org/jira/browse/YARN-5734
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Min Shen
>Assignee: Min Shen
> Attachments: OrgQueue_Design_v0.pdf
>
>
> The current xml based configuration mechanism in CapacityScheduler makes it 
> very inconvenient to apply any changes to the queue configurations. We saw 2 
> main drawbacks in the file based configuration mechanism:
> # This makes it very inconvenient to automate queue configuration updates. 
> For example, in our cluster setup, we leverage the queue mapping feature from 
> YARN-2411 to route users to their dedicated organization queues. It could be 
> extremely cumbersome to keep updating the config file to manage the very 
> dynamic mapping between users to organizations.
> # Even a user has the admin permission on one specific queue, that user is 
> unable to make any queue configuration changes to resize the subqueues, 
> changing queue ACLs, or creating new queues. All these operations need to be 
> performed in a centralized manner by the cluster administrators.
> With these current limitations, we realized the need of a more flexible 
> configuration mechanism that allows queue configurations to be stored and 
> managed more dynamically. We developed the feature internally at LinkedIn 
> which introduces the concept of MutableConfigurationProvider. What it 
> essentially does is to provide a set of configuration mutation APIs that 
> allows queue configurations to be updated externally with a set of REST APIs. 
> When performing the queue configuration changes, the queue ACLs will be 
> honored, which means only queue administrators can make configuration changes 
> to a given queue. MutableConfigurationProvider is implemented as a pluggable 
> interface, and we have one implementation of this interface which is based on 
> Derby embedded database.
> This feature has been deployed at LinkedIn's Hadoop cluster for a year now, 
> and have gone through several iterations of gathering feedbacks from users 
> and improving accordingly. With this feature, cluster administrators are able 
> to automate lots of thequeue configuration management tasks, such as setting 
> the queue capacities to adjust cluster resources between queues based on 
> established resource consumption patterns, or managing updating the user to 
> queue mappings. We have attached our design documentation with this ticket 
> and would like to receive feedbacks from the community regarding how to best 
> integrate it with the latest version of YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-5734) OrgQueue for easy CapacityScheduler queue configuration management

2016-10-13 Thread Min Shen (JIRA)
Min Shen created YARN-5734:
--

 Summary: OrgQueue for easy CapacityScheduler queue configuration 
management
 Key: YARN-5734
 URL: https://issues.apache.org/jira/browse/YARN-5734
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Min Shen
Assignee: Min Shen


The current xml based configuration mechanism in CapacityScheduler makes it 
very inconvenient to apply any changes to the queue configurations. We saw 2 
main drawbacks in the file based configuration mechanism:
# This makes it very inconvenient to automate queue configuration updates. For 
example, in our cluster setup, we leverage the queue mapping feature from 
YARN-2411 to route users to their dedicated organization queues. It could be 
extremely cumbersome to keep updating the config file to manage the very 
dynamic mapping between users to organizations.
# Even a user has the admin permission on one specific queue, that user is 
unable to make any queue configuration changes to resize the subqueues, 
changing queue ACLs, or creating new queues. All these operations need to be 
performed in a centralized manner by the cluster administrators.

With these current limitations, we realized the need of a more flexible 
configuration mechanism that allows queue configurations to be stored and 
managed more dynamically. We developed the feature internally at LinkedIn which 
introduces the concept of MutableConfigurationProvider. What it essentially 
does is to provide a set of configuration mutation APIs that allows queue 
configurations to be updated externally with a set of REST APIs. When 
performing the queue configuration changes, the queue ACLs will be honored, 
which means only queue administrators can make configuration changes to a given 
queue. MutableConfigurationProvider is implemented as a pluggable interface, 
and we have one implementation of this interface which is based on Derby 
embedded database.

This feature has been deployed at LinkedIn's Hadoop cluster for a year now, and 
have gone through several iterations of gathering feedbacks from users and 
improving accordingly. With this feature, cluster administrators are able to 
automate lots of thequeue configuration management tasks, such as setting the 
queue capacities to adjust cluster resources between queues based on 
established resource consumption patterns, or managing updating the user to 
queue mappings. We have attached our design documentation with this ticket and 
would like to receive feedbacks from the community regarding how to best 
integrate it with the latest version of YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5543) ResourceManager SchedulingMonitor could potentially terminate the preemption checker thread

2016-08-29 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-5543:
---
Attachment: YARN-5543.003.patch

> ResourceManager SchedulingMonitor could potentially terminate the preemption 
> checker thread
> ---
>
> Key: YARN-5543
> URL: https://issues.apache.org/jira/browse/YARN-5543
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.7.0, 2.6.1
>Reporter: Min Shen
>Assignee: Min Shen
> Attachments: YARN-5543.001.patch, YARN-5543.002.patch, 
> YARN-5543.003.patch
>
>
> In SchedulingMonitor.java, when the service starts, it starts a checker 
> thread to perform Capacity Scheduler's preemption. However, the 
> implementation of this checker thread has the following issue:
> {code}
> while (!stopped && !Thread.currentThread().isInterrupted()) {
> 
> try {
>   Thread.sleep(monitorInterval)
> } catch (InterruptedException e) {
>   
>   break;
> }
> }
> {code}
> The above code snippet will terminate the checker thread whenever it is 
> interrupted. 
> We noticed in our cluster that this could lead to CapacityScheduler's 
> preemption disabled unexpectedly due to the checker thread getting terminated.
> We propose to use ScheduledExecutorService to improve the robustness of this 
> part of the code to ensure the liveness of CapacityScheduler's preemption 
> functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5543) ResourceManager SchedulingMonitor could potentially terminate the preemption checker thread

2016-08-29 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446963#comment-15446963
 ] 

Min Shen commented on YARN-5543:


[~wangda],

Revised patch attached. Could you please take a look?

> ResourceManager SchedulingMonitor could potentially terminate the preemption 
> checker thread
> ---
>
> Key: YARN-5543
> URL: https://issues.apache.org/jira/browse/YARN-5543
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.7.0, 2.6.1
>Reporter: Min Shen
>Assignee: Min Shen
> Attachments: YARN-5543.001.patch, YARN-5543.002.patch
>
>
> In SchedulingMonitor.java, when the service starts, it starts a checker 
> thread to perform Capacity Scheduler's preemption. However, the 
> implementation of this checker thread has the following issue:
> {code}
> while (!stopped && !Thread.currentThread().isInterrupted()) {
> 
> try {
>   Thread.sleep(monitorInterval)
> } catch (InterruptedException e) {
>   
>   break;
> }
> }
> {code}
> The above code snippet will terminate the checker thread whenever it is 
> interrupted. 
> We noticed in our cluster that this could lead to CapacityScheduler's 
> preemption disabled unexpectedly due to the checker thread getting terminated.
> We propose to use ScheduledExecutorService to improve the robustness of this 
> part of the code to ensure the liveness of CapacityScheduler's preemption 
> functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5543) ResourceManager SchedulingMonitor could potentially terminate the preemption checker thread

2016-08-29 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-5543:
---
Attachment: YARN-5543.002.patch

> ResourceManager SchedulingMonitor could potentially terminate the preemption 
> checker thread
> ---
>
> Key: YARN-5543
> URL: https://issues.apache.org/jira/browse/YARN-5543
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.7.0, 2.6.1
>Reporter: Min Shen
>Assignee: Min Shen
> Attachments: YARN-5543.001.patch, YARN-5543.002.patch
>
>
> In SchedulingMonitor.java, when the service starts, it starts a checker 
> thread to perform Capacity Scheduler's preemption. However, the 
> implementation of this checker thread has the following issue:
> {code}
> while (!stopped && !Thread.currentThread().isInterrupted()) {
> 
> try {
>   Thread.sleep(monitorInterval)
> } catch (InterruptedException e) {
>   
>   break;
> }
> }
> {code}
> The above code snippet will terminate the checker thread whenever it is 
> interrupted. 
> We noticed in our cluster that this could lead to CapacityScheduler's 
> preemption disabled unexpectedly due to the checker thread getting terminated.
> We propose to use ScheduledExecutorService to improve the robustness of this 
> part of the code to ensure the liveness of CapacityScheduler's preemption 
> functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5543) ResourceManager SchedulingMonitor could potentially terminate the preemption checker thread

2016-08-22 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431395#comment-15431395
 ] 

Min Shen commented on YARN-5543:


[~leftnoteasy],

The existing test case for SchedulingMonitor tests if it can be successfully 
initiated and started.
Do you think adding an additional unit test is necessary with this patch?

Also, for the test failure in 
TestNodeBlacklistingOnAMFailures.testNodeBlacklistingOnAMFailure, it seems 
irrelevant to this change.
Is this test case a known flaky one?

> ResourceManager SchedulingMonitor could potentially terminate the preemption 
> checker thread
> ---
>
> Key: YARN-5543
> URL: https://issues.apache.org/jira/browse/YARN-5543
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.7.0, 2.6.1
>Reporter: Min Shen
> Attachments: YARN-5543.001.patch
>
>
> In SchedulingMonitor.java, when the service starts, it starts a checker 
> thread to perform Capacity Scheduler's preemption. However, the 
> implementation of this checker thread has the following issue:
> {code}
> while (!stopped && !Thread.currentThread().isInterrupted()) {
> 
> try {
>   Thread.sleep(monitorInterval)
> } catch (InterruptedException e) {
>   
>   break;
> }
> }
> {code}
> The above code snippet will terminate the checker thread whenever it is 
> interrupted. 
> We noticed in our cluster that this could lead to CapacityScheduler's 
> preemption disabled unexpectedly due to the checker thread getting terminated.
> We propose to use ScheduledExecutorService to improve the robustness of this 
> part of the code to ensure the liveness of CapacityScheduler's preemption 
> functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5543) ResourceManager SchedulingMonitor could potentially terminate the preemption checker thread

2016-08-21 Thread Min Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated YARN-5543:
---
Attachment: YARN-5543.001.patch

Attaching the patch with the proposed changes.

> ResourceManager SchedulingMonitor could potentially terminate the preemption 
> checker thread
> ---
>
> Key: YARN-5543
> URL: https://issues.apache.org/jira/browse/YARN-5543
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.7.0, 2.6.1
>Reporter: Min Shen
> Attachments: YARN-5543.001.patch
>
>
> In SchedulingMonitor.java, when the service starts, it starts a checker 
> thread to perform Capacity Scheduler's preemption. However, the 
> implementation of this checker thread has the following issue:
> {code}
> while (!stopped && !Thread.currentThread().isInterrupted()) {
> 
> try {
>   Thread.sleep(monitorInterval)
> } catch (InterruptedException e) {
>   
>   break;
> }
> }
> {code}
> The above code snippet will terminate the checker thread whenever it is 
> interrupted. 
> We noticed in our cluster that this could lead to CapacityScheduler's 
> preemption disabled unexpectedly due to the checker thread getting terminated.
> We propose to use ScheduledExecutorService to improve the robustness of this 
> part of the code to ensure the liveness of CapacityScheduler's preemption 
> functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-5543) ResourceManager SchedulingMonitor could potentially terminate the preemption checker thread

2016-08-19 Thread Min Shen (JIRA)
Min Shen created YARN-5543:
--

 Summary: ResourceManager SchedulingMonitor could potentially 
terminate the preemption checker thread
 Key: YARN-5543
 URL: https://issues.apache.org/jira/browse/YARN-5543
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.6.1, 2.7.0
Reporter: Min Shen


In SchedulingMonitor.java, when the service starts, it starts a checker thread 
to perform Capacity Scheduler's preemption. However, the implementation of this 
checker thread has the following issue:
{code}
while (!stopped && !Thread.currentThread().isInterrupted()) {

try {
  Thread.sleep(monitorInterval)
} catch (InterruptedException e) {
  
  break;
}
}
{code}
The above code snippet will terminate the checker thread whenever it is 
interrupted. 
We noticed in our cluster that this could lead to CapacityScheduler's 
preemption disabled unexpectedly due to the checker thread getting terminated.

We propose to use ScheduledExecutorService to improve the robustness of this 
part of the code to ensure the liveness of CapacityScheduler's preemption 
functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5520) [Capacity Scheduler] Change the logic for when to trigger user/group mappings to queues

2016-08-18 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426845#comment-15426845
 ] 

Min Shen commented on YARN-5520:


[~Ying Zhang],

Your proposal does add more flexibility to queue mappings.
However, my only concern is related to the added complexity for admins to 
configure these mapping rules.
If the secondary queues for most users/groups are the same, it seems reasonable 
to just use {{yarn.scheduler.capacity.queue-mappings.disabled.queues}}.
If the secondary queues vary a lot between users/groups, it might be difficult 
for admins to configure these rules in the first place.

> [Capacity Scheduler] Change the logic for when to trigger user/group mappings 
> to queues
> ---
>
> Key: YARN-5520
> URL: https://issues.apache.org/jira/browse/YARN-5520
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.6.0, 2.7.0, 2.6.1
>Reporter: Min Shen
>
> In YARN-2411, the feature in Capacity Scheduler to support user/group based 
> mappings to queues was introduced.
> In the original implementation, the configuration key 
> {{yarn.scheduler.capacity.queue-mappings-override.enable}} was added to 
> control when to enable overriding user requested queues.
> However, even if this configuration is set to false, queue overriding could 
> still happen if the user didn't request for any specific queue or choose to 
> simply submit his job to "default" queue, according to the following if 
> condition which triggers queue overriding:
> {code}
> if (queueName.equals(YarnConfiguration.DEFAULT_QUEUE_NAME)
>   || overrideWithQueueMappings)
> {code}
> This logic does not seem very reasonable, as there's no way to fully disable 
> queue overriding when mappings are configured inside capacity-scheduler.xml.
> In addition, in our environment, we have setup a few organization dedicated 
> queues as well as some "adhoc" queues. The organization dedicated queues have 
> better resource guarantees and we want to be able to route users to the 
> corresponding organization queues. On the other hand, the "adhoc" queues have 
> less resource guarantees but everyone can use it to get some opportunistic 
> resources when the cluster is free.
> The current logic will also prevent this type of use cases as when you enable 
> queue overriding, users cannot use these "adhoc" queues any more. They will 
> always be routed to the dedicated organization queues.
> To address the above 2 issues, I propose to change the implementation so that:
> * Admin can fully disable queue overriding even if mappings are already 
> configured.
> * Admin have finer grained control to cope queue overriding with the above 
> mentioned organization/adhoc queue setups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5520) [Capacity Scheduler] Change the logic for when to trigger user/group mappings to queues

2016-08-15 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421682#comment-15421682
 ] 

Min Shen edited comment on YARN-5520 at 8/15/16 9:15 PM:
-

[~venkateshrin],

Thanks for providing feedbacks on this ticket.

To answer your questions, I'd like to use the following example to make the 
explanation more clear:
Assume we have 4 queues configured, i.e. root.orgA, root.orgB, root.default, 
and root.public
root.orgA's capacity and max capacity are configured as 45% and 45%, 
respectively. Same for root.orgB.
root.default is configured as 0%, 0% while root.public is configured as 5% and 
30%.
Preemption is also enabled. Thus, root.orgA and root.orgB each has 45% of 
guaranteed resources, while root.public has access to certain elastic resources 
w/o too much guarantee.

Also, assume we have 3 users, userA which belongs to orgA, userB which belongs 
to orgB, and userC which also belongs to orgB.
Admins want to route users to their corresponding organization queue, so they 
have configured the following in capacity-scheduler.xml:
{noformat}
u:userA:orgA, u:userB:orgB, u:userC:orgB
{noformat}
In addition, YarnConfiguration.DEFAULT_QUEUE_NAME is set to "default".

In my proposed change, when 
{{yarn.scheduler.capacity.queue-mappings-override.enable}} is set to false, 
user's application will always get submitted to whichever queue the user 
requests, or root.default if user does not specify a queue.

When {{yarn.scheduler.capacity.queue-mappings-override.enable}} is set to true, 
we have the following possible scenarios:
# userA/userB/userC submits jobs which do not specify a queue. My proposed 
change will override the application's queue with the one specified in the 
queue mappings configuration.
# userA/userB/userC submits jobs to root.default, root.orgA, or root.orgB. The 
application's queue will still be overridden with the one specified in the 
queue mappings configuration.
# userA/userB/userC submits jobs to root.public. The application will be 
submitted to root.public. This could happen in the following case: userB 
consumed all available resources in root.orgB but userA is not using resources 
in root.orgA. In the mean time, userC wants to launch his job. If we enforce 
queue overriding for all queues, then userC has to wait for userB's job to 
release resources. However, if we disable queue overriding for root.public, 
userC can use root.public to get resources much more quickly. 

In this way, the admin can override queues for applications submitted to a 
certain subset of queues in the cluster, while still allowing users to use the 
"adhoc" queues. It also distinguishes well between the cases when queue 
overriding is enabled vs. when it's not, since users have to explicitly specify 
to use the "adhoc" queues in order to disable queue overriding. As a result, 
the users should understand disabling queue overriding comes with the cost of 
less resource guarantees (because of preemption).

We can introduce an additional parameter 
{{yarn.scheduler.capacity.queue-mappings.disabled.queues}} to control the list 
of queues where queue overriding is disabled when 
{{yarn.scheduler.capacity.queue-mappings-override.enable}} is set to true.


was (Author: mshen):
[~venkateshrin],

Thanks for providing feedbacks on this ticket.

To answer your questions, I'd like to use the following example to make the 
explanation more clear:
Assume we have 4 queues configured, i.e. root.orgA, root.orgB, root.default, 
and root.public
root.orgA's capacity and max capacity are configured as 45% and 45%, 
respectively. Same for root.orgB.
root.default is configured as 0%, 0% while root.public is configured as 5% and 
30%.
Preemption is also enabled. Thus, root.orgA and root.orgB each has 45% of 
guaranteed resources, while root.public has access to certain elastic resources 
w/o too much guarantee.

Also, assume we have 3 users, userA which belongs to orgA, userB which belongs 
to orgB, and userC which also belongs to orgB.
Admins want to route users to their corresponding organization queue, so they 
have configured the following in capacity-scheduler.xml:
{noformat}
u:userA:orgA, u:userB:orgB, u:userC:orgB
{noformat}
In addition, YarnConfiguration.DEFAULT_QUEUE_NAME is set to "default".

In my proposed change, when 
{{yarn.scheduler.capacity.queue-mappings-override.enable}} is set to false, 
user's application will always get submitted to whichever queue the user 
requests, or root.default if user does not specify a queue.

When {{yarn.scheduler.capacity.queue-mappings-override.enable}} is set to true, 
we have the following possible scenarios:
# userA/userB/userC submits jobs which do not specify a queue. My proposed 
change will override the application's queue with the one specified in the 
queue mappings configuration.
# userA/userB/userC submits jobs to root.default, root.orgA, or root.orgB. 

[jira] [Comment Edited] (YARN-5520) [Capacity Scheduler] Change the logic for when to trigger user/group mappings to queues

2016-08-15 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421682#comment-15421682
 ] 

Min Shen edited comment on YARN-5520 at 8/15/16 9:14 PM:
-

[~venkateshrin],

Thanks for providing feedbacks on this ticket.

To answer your questions, I'd like to use the following example to make the 
explanation more clear:
Assume we have 4 queues configured, i.e. root.orgA, root.orgB, root.default, 
and root.public
root.orgA's capacity and max capacity are configured as 45% and 45%, 
respectively. Same for root.orgB.
root.default is configured as 0%, 0% while root.public is configured as 5% and 
30%.
Preemption is also enabled. Thus, root.orgA and root.orgB each has 45% of 
guaranteed resources, while root.public has access to certain elastic resources 
w/o too much guarantee.

Also, assume we have 3 users, userA which belongs to orgA, userB which belongs 
to orgB, and userC which also belongs to orgB.
Admins want to route users to their corresponding organization queue, so they 
have configured the following in capacity-scheduler.xml:
{noformat}
u:userA:orgA, u:userB:orgB, u:userC:orgB
{noformat}
In addition, YarnConfiguration.DEFAULT_QUEUE_NAME is set to "default".

In my proposed change, when 
{{yarn.scheduler.capacity.queue-mappings-override.enable}} is set to false, 
user's application will always get submitted to whichever queue the user 
requests, or root.default if user does not specify a queue.

When {{yarn.scheduler.capacity.queue-mappings-override.enable}} is set to true, 
we have the following possible scenarios:
# userA/userB/userC submits jobs which do not specify a queue. My proposed 
change will override the application's queue with the one specified in the 
queue mappings configuration.
# userA/userB/userC submits jobs to root.default, root.orgA, or root.orgB. The 
application's queue will still be overridden with the one specified in the 
queue mappings configuration.
# userA/userB/userC submits jobs to root.public. The application will be 
submitted to root.public. This could happen in the following case: userB 
consumed all available resources in root.orgB but userA is not using resources 
in root.orgA. In the mean time, userC wants to launch his job. If we enforce 
queue overriding for all queues, then userC has to wait for userB's job to 
release resources. However, if we disable queue overriding for root.public, 
userC can use root.public to get resources much more quickly. 

In this way, the admin can override queues for applications submitted to a 
certain subset of queues in the cluster, while still allowing users to use the 
"adhoc" queues. It also distinguishes well between the cases when queue 
overriding is enabled vs. when it's not, since users have to explicitly specify 
to use the "adhoc" queues in order to disable queue overriding. As a result, 
the users should understand disabling queue overriding comes with the cost of 
less resource guarantees (because of preemption).

We can introduce an additional parameter 
{{yarn.scheduler.capacity.queue-mappings.disabled.queues}} to control the list 
of queues where queue overriding is disabled when 
{yarn.scheduler.capacity.queue-mappings-override.enable}} is set to true.


was (Author: mshen):
[~venkateshrin],

Thanks for providing feedbacks on this ticket.

To answer your questions, I'd like to use the following example to make the 
explanation more clear:
Assume we have 4 queues configured, i.e. root.orgA, root.orgB, root.default, 
and root.public
root.orgA's capacity and max capacity are configured as 45% and 45%, 
respectively. Same for root.orgB.
root.default is configured as 0%, 0% while root.public is configured as 5% and 
30%.
Preemption is also enabled. Thus, root.orgA and root.orgB each has 45% of 
guaranteed resources, while root.public has access to certain elastic resources 
w/o too much guarantee.

Also, assume we have 3 users, userA which belongs to orgA, userB which belongs 
to orgB, and userC which also belongs to orgB.
Admins want to route users to their corresponding organization queue, so they 
have configured the following in capacity-scheduler.xml:
{noformat}
u:userA:orgA, u:userB:orgB, u:userC:orgB
{noformat}
In addition, YarnConfiguration.DEFAULT_QUEUE_NAME is set to "default".

In my proposed change, when 
{{yarn.scheduler.capacity.queue-mappings-override.enable}} is set to false, 
user's application will always get submitted to whichever queue the user 
requests, or root.default if user does not specify a queue.

When {{yarn.scheduler.capacity.queue-mappings-override.enable}} is set to true, 
we have the following possible scenarios:
# userA/userB/userC submits jobs which do not specify a queue. My proposed 
change will override the application's queue with the one specified in the 
queue mappings configuration.
# userA/userB/userC submits jobs to root.default, root.orgA, or root.orgB. The 

[jira] [Commented] (YARN-5520) [Capacity Scheduler] Change the logic for when to trigger user/group mappings to queues

2016-08-15 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421682#comment-15421682
 ] 

Min Shen commented on YARN-5520:


[~venkateshrin],

Thanks for providing feedbacks on this ticket.

To answer your questions, I'd like to use the following example to make the 
explanation more clear:
Assume we have 4 queues configured, i.e. root.orgA, root.orgB, root.default, 
and root.public
root.orgA's capacity and max capacity are configured as 45% and 45%, 
respectively. Same for root.orgB.
root.default is configured as 0%, 0% while root.public is configured as 5% and 
30%.
Preemption is also enabled. Thus, root.orgA and root.orgB each has 45% of 
guaranteed resources, while root.public has access to certain elastic resources 
w/o too much guarantee.

Also, assume we have 3 users, userA which belongs to orgA, userB which belongs 
to orgB, and userC which also belongs to orgB.
Admins want to route users to their corresponding organization queue, so they 
have configured the following in capacity-scheduler.xml:
{noformat}
u:userA:orgA, u:userB:orgB, u:userC:orgB
{noformat}
In addition, YarnConfiguration.DEFAULT_QUEUE_NAME is set to "default".

In my proposed change, when 
{{yarn.scheduler.capacity.queue-mappings-override.enable}} is set to false, 
user's application will always get submitted to whichever queue the user 
requests, or root.default if user does not specify a queue.

When {{yarn.scheduler.capacity.queue-mappings-override.enable}} is set to true, 
we have the following possible scenarios:
# userA/userB/userC submits jobs which do not specify a queue. My proposed 
change will override the application's queue with the one specified in the 
queue mappings configuration.
# userA/userB/userC submits jobs to root.default, root.orgA, or root.orgB. The 
application's queue will still be overridden with the one specified in the 
queue mappings configuration.
# userA/userB/userC submits jobs to root.public. The application will be 
submitted to root.public. This could happen in the following case: userB 
consumed all available resources in root.orgB but userA is not using resources 
in root.orgA. In the mean time, userC wants to launch his job. If we enforce 
queue overriding for all queues, then userC has to wait for userB's job to 
release resources. However, if we disable queue overriding for root.public, 
userC can use root.public to get resources much more quickly. 

In this way, the admin can override queues for applications submitted to a 
certain subset of queues in the cluster, while still allowing users to use the 
"adhoc" queues. It also distinguishes well between the cases when queue 
overriding is enabled vs. when it's not, since users have to explicitly specify 
to use the "adhoc" queues in order to disable queue overriding. As a result, 
the users should understand disabling queue overriding comes with the cost of 
less resource guarantees (because of preemption).

> [Capacity Scheduler] Change the logic for when to trigger user/group mappings 
> to queues
> ---
>
> Key: YARN-5520
> URL: https://issues.apache.org/jira/browse/YARN-5520
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.6.0, 2.7.0, 2.6.1
>Reporter: Min Shen
>
> In YARN-2411, the feature in Capacity Scheduler to support user/group based 
> mappings to queues was introduced.
> In the original implementation, the configuration key 
> {{yarn.scheduler.capacity.queue-mappings-override.enable}} was added to 
> control when to enable overriding user requested queues.
> However, even if this configuration is set to false, queue overriding could 
> still happen if the user didn't request for any specific queue or choose to 
> simply submit his job to "default" queue, according to the following if 
> condition which triggers queue overriding:
> {code}
> if (queueName.equals(YarnConfiguration.DEFAULT_QUEUE_NAME)
>   || overrideWithQueueMappings)
> {code}
> This logic does not seem very reasonable, as there's no way to fully disable 
> queue overriding when mappings are configured inside capacity-scheduler.xml.
> In addition, in our environment, we have setup a few organization dedicated 
> queues as well as some "adhoc" queues. The organization dedicated queues have 
> better resource guarantees and we want to be able to route users to the 
> corresponding organization queues. On the other hand, the "adhoc" queues have 
> less resource guarantees but everyone can use it to get some opportunistic 
> resources when the cluster is free.
> The current logic will also prevent this type of use cases as when you enable 
> queue overriding, users cannot use these "adhoc" queues any more. They will 
> always be routed to the dedicated organization queues.
> To address the 

[jira] [Commented] (YARN-5520) [Capacity Scheduler] Change the logic for when to trigger user/group mappings to queues

2016-08-12 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419734#comment-15419734
 ] 

Min Shen commented on YARN-5520:


Will attach patch soon.

> [Capacity Scheduler] Change the logic for when to trigger user/group mappings 
> to queues
> ---
>
> Key: YARN-5520
> URL: https://issues.apache.org/jira/browse/YARN-5520
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.6.0, 2.7.0, 2.6.1
>Reporter: Min Shen
>
> In YARN-2411, the feature in Capacity Scheduler to support user/group based 
> mappings to queues was introduced.
> In the original implementation, the configuration key 
> {{yarn.scheduler.capacity.queue-mappings-override.enable}} was added to 
> control when to enable overriding user requested queues.
> However, even if this configuration is set to false, queue overriding could 
> still happen if the user didn't request for any specific queue or choose to 
> simply submit his job to "default" queue, according to the following if 
> condition which triggers queue overriding:
> {code}
> if (queueName.equals(YarnConfiguration.DEFAULT_QUEUE_NAME)
>   || overrideWithQueueMappings)
> {code}
> This logic does not seem very reasonable, as there's no way to fully disable 
> queue overriding when mappings are configured inside capacity-scheduler.xml.
> In addition, in our environment, we have setup a few organization dedicated 
> queues as well as some "adhoc" queues. The organization dedicated queues have 
> better resource guarantees and we want to be able to route users to the 
> corresponding organization queues. On the other hand, the "adhoc" queues have 
> less resource guarantees but everyone can use it to get some opportunistic 
> resources when the cluster is free.
> The current logic will also prevent this type of use cases as when you enable 
> queue overriding, users cannot use these "adhoc" queues any more. They will 
> always be routed to the dedicated organization queues.
> To address the above 2 issues, I propose to change the implementation so that:
> * Admin can fully disable queue overriding even if mappings are already 
> configured.
> * Admin have finer grained control to cope queue overriding with the above 
> mentioned organization/adhoc queue setups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-5520) [Capacity Scheduler] Change the logic for when to trigger user/group mappings to queues

2016-08-12 Thread Min Shen (JIRA)
Min Shen created YARN-5520:
--

 Summary: [Capacity Scheduler] Change the logic for when to trigger 
user/group mappings to queues
 Key: YARN-5520
 URL: https://issues.apache.org/jira/browse/YARN-5520
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.6.1, 2.7.0, 2.6.0
Reporter: Min Shen


In YARN-2411, the feature in Capacity Scheduler to support user/group based 
mappings to queues was introduced.
In the original implementation, the configuration key 
{{yarn.scheduler.capacity.queue-mappings-override.enable}} was added to control 
when to enable overriding user requested queues.
However, even if this configuration is set to false, queue overriding could 
still happen if the user didn't request for any specific queue or choose to 
simply submit his job to "default" queue, according to the following if 
condition which triggers queue overriding:
{code}
if (queueName.equals(YarnConfiguration.DEFAULT_QUEUE_NAME)
  || overrideWithQueueMappings)
{code}

This logic does not seem very reasonable, as there's no way to fully disable 
queue overriding when mappings are configured inside capacity-scheduler.xml.

In addition, in our environment, we have setup a few organization dedicated 
queues as well as some "adhoc" queues. The organization dedicated queues have 
better resource guarantees and we want to be able to route users to the 
corresponding organization queues. On the other hand, the "adhoc" queues have 
less resource guarantees but everyone can use it to get some opportunistic 
resources when the cluster is free.
The current logic will also prevent this type of use cases as when you enable 
queue overriding, users cannot use these "adhoc" queues any more. They will 
always be routed to the dedicated organization queues.

To address the above 2 issues, I propose to change the implementation so that:
* Admin can fully disable queue overriding even if mappings are already 
configured.
* Admin have finer grained control to cope queue overriding with the above 
mentioned organization/adhoc queue setups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org