[jira] [Resolved] (YARN-8849) DynoYARN: A simulation and testing infrastructure for YARN clusters

2021-09-20 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung resolved YARN-8849.
-
Resolution: Fixed

FYI we have open source DynoYARN on Github: https://github.com/linkedin/dynoyarn

> DynoYARN: A simulation and testing infrastructure for YARN clusters
> ---
>
> Key: YARN-8849
> URL: https://issues.apache.org/jira/browse/YARN-8849
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun Suresh
>Assignee: Jonathan Hung
>Priority: Major
>
> Traditionally, YARN workload simulation is performed using SLS (Scheduler 
> Load Simulator) which is packaged with YARN. It Essentially, starts a full 
> fledged *ResourceManager*, but runs simulators for the *NodeManager* and the 
> *ApplicationMaster* Containers. These simulators are lightweight and run in a 
> threadpool. The NM simulators do not open any external ports and send 
> (in-process) heartbeats to the ResourceManager.
> There are a couple of drawbacks with using the SLS:
>  * It might be difficult to simulate really large clusters without having 
> access to a very beefy box - since the NMs are launched as tasks in a 
> threadpool, and each NM has to send periodic heartbeats to the RM.
>  * Certain features (like YARN-1011) requires changes to the NodeManager - 
> aspects such as queuing and selectively killing containers have to be 
> incorporated into the existing NM Simulator which might make the simulator a 
> bit heavy weight - there is a need for locking and synchronization.
>  * Since the NM and AM are simulations, only the Scheduler is faithfully 
> tested - it does not really perform an end-2-end test of a cluster.
> Therefore, drawing inspiration from 
> [Dynamometer|https://github.com/linkedin/dynamometer], we propose a framework 
> for YARN deployable YARN cluster - *DynoYARN* - for testing, with the 
> following features:
>  * The NM already has hooks to plug-in custom *ContainerExecutor* and 
> *NodeResourceMonitor*. If we can plug-in a custom *ContainersMonitorImpl*'s 
> Monitoring thread (and other modules like the LocalizationService), We can 
> probably inject an Executor that does not actually launch containers and a 
> Node and Container resource monitor that reports synthetic pre-specified 
> Utilization metrics back to the RM.
>  * Since we are launching fake containers, we cannot run normal AM 
> containers. We can therefore, use *Unmanaged AM*'s to launch synthetic jobs.
> Essentially, a test workflow would look like this:
>  * Launch a DynoYARN cluster.
>  * Use the Unmanaged AM feature to directly negotiate with the DynaYARN 
> Resource Manager for container tokens.
>  * Use the container tokens from the RM to directly ask the DynoYARN Node 
> Managers to start fake containers.
>  * The DynoYARN NodeManagers will start the fake containers and report to the 
> DynoYARN Resource Manager synthetically generated resource utilization for 
> the containers (which will be injected via the *ContainerLaunchContext* and 
> parsed by the plugged-in Container Executor).
>  * The Scheduler will use the utilization report to schedule containers - we 
> will be able to test allocation of *Opportunistic* containers based on 
> resource utilization.
>  * Since the DynoYARN Node Managers run the actual code paths, all preemption 
> and queuing logic will be faithfully executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10697) Resources are displayed in bytes in UI for schedulers other than capacity

2021-03-18 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304467#comment-17304467
 ] 

Jonathan Hung commented on YARN-10697:
--

[~Jim_Brennan] [~BilwaST] I agree, I don't think we should make the 
Resource#toString change. IMO users expect this to be bytes and making this 
change could have some unintended consequences e.g. breaking log parsing 
tooling.

> Resources are displayed in bytes in UI for schedulers other than capacity
> -
>
> Key: YARN-10697
> URL: https://issues.apache.org/jira/browse/YARN-10697
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10697.001.patch, image-2021-03-17-11-30-57-216.png
>
>
> Resources.newInstance expects MB as memory whereas in MetricsOverviewTable 
> passes resources in bytes . Also we should display memory in GB for better 
> readability for user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10651) CapacityScheduler crashed with NPE in AbstractYarnScheduler.updateNodeResource()

2021-02-25 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291305#comment-17291305
 ] 

Jonathan Hung edited comment on YARN-10651 at 2/26/21, 12:04 AM:
-

+1 from me. I pushed this to trunk~branch-2.10. Thanks [~haibochen] for the 
contribution.


was (Author: jhung):
I pushed this to trunk~branch-2.10. Thanks [~haibochen] for the contribution.

> CapacityScheduler crashed with NPE in 
> AbstractYarnScheduler.updateNodeResource() 
> -
>
> Key: YARN-10651
> URL: https://issues.apache.org/jira/browse/YARN-10651
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0, 2.10.1
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: YARN-10651.00.patch, YARN-10651.01.patch, event_seq.jpg
>
>
> {code:java}
> 2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
> Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> 2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
> Exiting, bbye..{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10651) CapacityScheduler crashed with NPE in AbstractYarnScheduler.updateNodeResource()

2021-02-25 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291225#comment-17291225
 ] 

Jonathan Hung commented on YARN-10651:
--

Thanks [~haibochen] - should we add some logging in this case?

Also, any way to reproduce this issue in a test?

> CapacityScheduler crashed with NPE in 
> AbstractYarnScheduler.updateNodeResource() 
> -
>
> Key: YARN-10651
> URL: https://issues.apache.org/jira/browse/YARN-10651
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0, 2.10.1
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-10651.00.patch, event_seq.jpg
>
>
> {code:java}
> 2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
> Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> 2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
> Exiting, bbye..{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-28 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222309#comment-17222309
 ] 

Jonathan Hung edited comment on YARN-10467 at 10/28/20, 8:14 PM:
-

I committed this to trunk/branch-3.3/branch-3.2/branch-3.1/branch-2.10. Thanks 
[~haibochen] for the contribution and [~Jim_Brennan] for the review.


was (Author: jhung):
I committed this to trunk~branch-2.10. Thanks [~haibochen] for the contribution 
and [~Jim_Brennan] for the review.

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0, 3.0.3, 3.2.1, 3.1.4
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.02.patch, YARN-10467.branch-2.10.00.patch, 
> YARN-10467.branch-2.10.01.patch, YARN-10467.branch-2.10.02.patch, 
> YARN-10467.branch-2.10.03.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-27 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221747#comment-17221747
 ] 

Jonathan Hung commented on YARN-10467:
--

[~haibochen], thank you for the patch. It looks good. It looks like you added 
some extra files in placement/schema in [^YARN-10467.branch-2.10.01.patch], can 
we remove those? 

Other than that, +1.

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Fix For: 2.10.2
>
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-15 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214939#comment-17214939
 ] 

Jonathan Hung commented on YARN-10450:
--

Thanks [~Jim_Brennan] that is fine with me.

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
> Attachments: NodesPage.png, YARN-10450-branch-2.10.003.patch, 
> YARN-10450-branch-3.1.003.patch, YARN-10450-branch-3.2.003.patch, 
> YARN-10450.001.patch, YARN-10450.002.patch, YARN-10450.003.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-12 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212676#comment-17212676
 ] 

Jonathan Hung commented on YARN-10450:
--

[~Jim_Brennan], Physical Mem Used % makes sense to me. We also refer to this as 
"Memory Efficiency" internally. 

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: NodesPage.png, YARN-10450.001.patch, YARN-10450.002.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8210) AMRMClient logging on every heartbeat to track updation of AM RM token causes too many log lines to be generated in AM logs

2020-09-10 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-8210:

Fix Version/s: (was: 2.10.2)
   2.10.1

> AMRMClient logging on every heartbeat to track updation of AM RM token causes 
> too many log lines to be generated in AM logs
> ---
>
> Key: YARN-8210
> URL: https://issues.apache.org/jira/browse/YARN-8210
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.0, 3.0.0-alpha1
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Fix For: 3.2.0, 3.1.1, 3.0.3, 2.10.1
>
> Attachments: YARN-8210.1.patch
>
>
> YARN-4682 added logs to track when AM RM token is updated for debuggability 
> purposes. However this is printed on every heartbeat and could cause the AM 
> logs to be flooded with this whenever RM's Master key is rolled over 
> especially if its a long running AM. Hence proposing to remove this log line. 
> As explained in 
> https://issues.apache.org/jira/browse/YARN-3104?focusedCommentId=14298692&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14298692
>  , the AM-RM connection  is not re-established so the updated token in the 
> client's UGI is never re-sent to the RPC server andRM continues to send the 
> token each heartbeat since it  cannot be sure whether the client really has 
> the new token. Hence the log lines are printed on every heartbeat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8210) AMRMClient logging on every heartbeat to track updation of AM RM token causes too many log lines to be generated in AM logs

2020-09-10 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-8210:

Fix Version/s: 2.10.2

Pushed to branch-2.10.

> AMRMClient logging on every heartbeat to track updation of AM RM token causes 
> too many log lines to be generated in AM logs
> ---
>
> Key: YARN-8210
> URL: https://issues.apache.org/jira/browse/YARN-8210
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.0, 3.0.0-alpha1
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Fix For: 3.2.0, 3.1.1, 3.0.3, 2.10.2
>
> Attachments: YARN-8210.1.patch
>
>
> YARN-4682 added logs to track when AM RM token is updated for debuggability 
> purposes. However this is printed on every heartbeat and could cause the AM 
> logs to be flooded with this whenever RM's Master key is rolled over 
> especially if its a long running AM. Hence proposing to remove this log line. 
> As explained in 
> https://issues.apache.org/jira/browse/YARN-3104?focusedCommentId=14298692&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14298692
>  , the AM-RM connection  is not re-established so the updated token in the 
> client's UGI is never re-sent to the RPC server andRM continues to send the 
> token each heartbeat since it  cannot be sure whether the client really has 
> the new token. Hence the log lines are printed on every heartbeat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10251) Show extended resources on legacy RM UI.

2020-08-07 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173478#comment-17173478
 ] 

Jonathan Hung commented on YARN-10251:
--

Unit test failures aren't related.

[^YARN-10251.branch-3.2.007.patch], [^YARN-10251.branch-2.10.007.patch], 
[^YARN-10251.007.patch] lgtm. I'll commit by EOD.

> Show extended resources on legacy RM UI.
> 
>
> Key: YARN-10251
> URL: https://issues.apache.org/jira/browse/YARN-10251
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Legacy RM UI With Not All Resources Shown.png, Updated 
> NodesPage UI With GPU columns.png, Updated RM UI With All Resources 
> Shown.png.png, YARN-10251.003.patch, YARN-10251.004.patch, 
> YARN-10251.005.patch, YARN-10251.006.patch, YARN-10251.007.patch, 
> YARN-10251.branch-2.10.001.patch, YARN-10251.branch-2.10.002.patch, 
> YARN-10251.branch-2.10.003.patch, YARN-10251.branch-2.10.005.patch, 
> YARN-10251.branch-2.10.006.patch, YARN-10251.branch-2.10.007.patch, 
> YARN-10251.branch-3.2.004.patch, YARN-10251.branch-3.2.005.patch, 
> YARN-10251.branch-3.2.006.patch, YARN-10251.branch-3.2.007.patch
>
>
> It would be great to update the legacy RM UI to include GPU resources in the 
> overview and in the per-app sections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10343) Legacy RM UI should include labeled metrics for allocated, total, and reserved resources.

2020-07-28 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166685#comment-17166685
 ] 

Jonathan Hung commented on YARN-10343:
--

[^YARN-10343.branch-3.2.001.patch], [^YARN-10343.branch-2.10.001.patch] LGTM.

Timeline test failures related to YARN-9338. FSSchedulerConfigurationStore 
failures related to YARN-9875. TestZKConfigurationStore fails locally pre-patch 
for me.

Committed to trunk~branch-2.10. Thanks [~epayne] for the contribution and 
[~Jim_Brennan] for the review.

> Legacy RM UI should include labeled metrics for allocated, total, and 
> reserved resources.
> -
>
> Key: YARN-10343
> URL: https://issues.apache.org/jira/browse/YARN-10343
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2020-07-07 at 1.00.22 PM.png, Screen Shot 
> 2020-07-07 at 1.03.26 PM.png, YARN-10343.000.patch, YARN-10343.001.patch, 
> YARN-10343.branch-2.10.001.patch, YARN-10343.branch-3.2.001.patch
>
>
> The current legacy RM UI only includes resources metrics for the default 
> partition. If a cluster has labeled nodes, those are not included in the 
> resource metrics for allocated, total, and reserved resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10343) Legacy RM UI should include labeled metrics for allocated, total, and reserved resources.

2020-07-17 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17160242#comment-17160242
 ] 

Jonathan Hung commented on YARN-10343:
--

+1 for [^YARN-10343.001.patch]. Test failure looks related to YARN-9333.

> Legacy RM UI should include labeled metrics for allocated, total, and 
> reserved resources.
> -
>
> Key: YARN-10343
> URL: https://issues.apache.org/jira/browse/YARN-10343
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2020-07-07 at 1.00.22 PM.png, Screen Shot 
> 2020-07-07 at 1.03.26 PM.png, YARN-10343.000.patch, YARN-10343.001.patch
>
>
> The current legacy RM UI only includes resources metrics for the default 
> partition. If a cluster has labeled nodes, those are not included in the 
> resource metrics for allocated, total, and reserved resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10263) Application summary is logged multiple times due to RM recovery

2020-07-14 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157748#comment-17157748
 ] 

Jonathan Hung commented on YARN-10263:
--

My understanding is that the sequence looks like:

(1) app finishes -> (2) app saved to state store -> (3) app summary is logged

I think if we only check that an app is recovered or not, we will miss some 
apps if the RM is restarted between states 2 and 3. We would somehow need to 
tell the state store whether something has been logged, but it seems a bit 
overkill to add a new event for this.

Any other thoughts?

> Application summary is logged multiple times due to RM recovery
> ---
>
> Key: YARN-10263
> URL: https://issues.apache.org/jira/browse/YARN-10263
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
>
> App finishes, and is logged to RM app summary. Restart RM. Then this app is 
> logged to RM app summary again.
> We would somehow need to know cross-restart whether an app has been logged or 
> not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10343) Legacy RM UI should include labeled metrics for allocated, total, and reserved resources.

2020-07-13 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156970#comment-17156970
 ] 

Jonathan Hung commented on YARN-10343:
--

thanks [~epayne] generally looks fine. I think the running containers case is a 
bit tricky, I couldn't find an API for that either.

For getAllUsed, where did you see that it includes reserved resources? I 
couldn't find that.

> Legacy RM UI should include labeled metrics for allocated, total, and 
> reserved resources.
> -
>
> Key: YARN-10343
> URL: https://issues.apache.org/jira/browse/YARN-10343
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2020-07-07 at 1.00.22 PM.png, Screen Shot 
> 2020-07-07 at 1.03.26 PM.png, YARN-10343.000.patch
>
>
> The current legacy RM UI only includes resources metrics for the default 
> partition. If a cluster has labeled nodes, those are not included in the 
> resource metrics for allocated, total, and reserved resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10251) Show extended resources on legacy RM UI.

2020-07-13 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156889#comment-17156889
 ] 

Jonathan Hung commented on YARN-10251:
--

Hey [~epayne] do you plan to post a follow up patch regarding what we discussed 
above (making used/total/reserved report only default partition)?

> Show extended resources on legacy RM UI.
> 
>
> Key: YARN-10251
> URL: https://issues.apache.org/jira/browse/YARN-10251
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Legacy RM UI With Not All Resources Shown.png, Updated 
> NodesPage UI With GPU columns.png, Updated RM UI With All Resources 
> Shown.png.png, YARN-10251.003.patch, YARN-10251.004.patch, 
> YARN-10251.005.patch, YARN-10251.006.patch, YARN-10251.branch-2.10.001.patch, 
> YARN-10251.branch-2.10.002.patch, YARN-10251.branch-2.10.003.patch, 
> YARN-10251.branch-2.10.005.patch, YARN-10251.branch-2.10.006.patch, 
> YARN-10251.branch-3.2.004.patch, YARN-10251.branch-3.2.005.patch, 
> YARN-10251.branch-3.2.006.patch
>
>
> It would be great to update the legacy RM UI to include GPU resources in the 
> overview and in the per-app sections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10251) Show extended resources on legacy RM UI.

2020-07-02 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150555#comment-17150555
 ] 

Jonathan Hung commented on YARN-10251:
--

I see. Can you file a follow up jira for this? I think we should at least have 
it consistent. In this jira we can leave it as default partition for used, 
total, reserved. The follow up jira can change used, total, reserved to count 
for all partitions.

> Show extended resources on legacy RM UI.
> 
>
> Key: YARN-10251
> URL: https://issues.apache.org/jira/browse/YARN-10251
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Legacy RM UI With Not All Resources Shown.png, Updated 
> NodesPage UI With GPU columns.png, Updated RM UI With All Resources 
> Shown.png.png, YARN-10251.003.patch, YARN-10251.004.patch, 
> YARN-10251.005.patch, YARN-10251.006.patch, YARN-10251.branch-2.10.001.patch, 
> YARN-10251.branch-2.10.002.patch, YARN-10251.branch-2.10.003.patch, 
> YARN-10251.branch-2.10.005.patch, YARN-10251.branch-2.10.006.patch, 
> YARN-10251.branch-3.2.004.patch, YARN-10251.branch-3.2.005.patch, 
> YARN-10251.branch-3.2.006.patch
>
>
> It would be great to update the legacy RM UI to include GPU resources in the 
> overview and in the per-app sections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10251) Show extended resources on legacy RM UI.

2020-07-01 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149697#comment-17149697
 ] 

Jonathan Hung commented on YARN-10251:
--

[~epayne] thanks, generally 006 looks good to me, but have a question:
{noformat}totalReservedResourcesAcrossPartition = new ResourceInfo(
cs.getClusterResourceUsage().getReserved());{noformat}
This seems to fetch reserved resources for default partition only? Should we 
change to fetch across partitions like we do for usedResources?

> Show extended resources on legacy RM UI.
> 
>
> Key: YARN-10251
> URL: https://issues.apache.org/jira/browse/YARN-10251
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Legacy RM UI With Not All Resources Shown.png, Updated 
> NodesPage UI With GPU columns.png, Updated RM UI With All Resources 
> Shown.png.png, YARN-10251.003.patch, YARN-10251.004.patch, 
> YARN-10251.005.patch, YARN-10251.006.patch, YARN-10251.branch-2.10.001.patch, 
> YARN-10251.branch-2.10.002.patch, YARN-10251.branch-2.10.003.patch, 
> YARN-10251.branch-2.10.005.patch, YARN-10251.branch-2.10.006.patch, 
> YARN-10251.branch-3.2.004.patch, YARN-10251.branch-3.2.005.patch, 
> YARN-10251.branch-3.2.006.patch
>
>
> It would be great to update the legacy RM UI to include GPU resources in the 
> overview and in the per-app sections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6492) Generate queue metrics for each partition

2020-06-01 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-6492:

Fix Version/s: 2.10.1
   2.9.3

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 2.9.3, 3.2.2, 2.10.1, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.10.019.patch, 
> YARN-6492-branch-2.8.014.patch, YARN-6492-branch-2.9.015.patch, 
> YARN-6492-branch-3.1.018.patch, YARN-6492-branch-3.2.017.patch, 
> YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, 
> YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, 
> YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, 
> YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, 
> YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently

2020-05-30 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120351#comment-17120351
 ] 

Jonathan Hung commented on YARN-10297:
--

Thanks. TestContinuousScheduling stops rm on teardown. Also I tried stopping rm 
in the setup method, and it still failed with the same exception somehow.

> TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails 
> intermittently
> ---
>
> Key: YARN-10297
> URL: https://issues.apache.org/jira/browse/YARN-10297
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Priority: Major
>
> After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails 
> intermittently when running {{mvn test -Dtest=TestContinuousScheduling}}
> {noformat}[INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
> [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 
> s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
> [ERROR] 
> testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling)
>   Time elapsed: 0.194 s  <<< ERROR!
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition= already exists!
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>   at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6492) Generate queue metrics for each partition

2020-05-30 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-6492:

Attachment: YARN-6492-branch-2.10.019.patch

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.10.019.patch, 
> YARN-6492-branch-2.8.014.patch, YARN-6492-branch-2.9.015.patch, 
> YARN-6492-branch-3.1.018.patch, YARN-6492-branch-3.2.017.patch, 
> YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, 
> YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, 
> YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, 
> YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, 
> YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6492) Generate queue metrics for each partition

2020-05-30 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120349#comment-17120349
 ] 

Jonathan Hung commented on YARN-6492:
-

Thanks. I see this method is only used in tests in trunk too. I prefer to keep 
this method, remove the partition==null / partition == empty string check as in 
the trunk patch, and remove this method in another JIRA so that the branches 
are consistent. [~maniraj...@gmail.com] what do you think?

I attached [^YARN-6492-branch-2.10.019.patch] for this. Can you take a look?

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.10.019.patch, 
> YARN-6492-branch-2.8.014.patch, YARN-6492-branch-2.9.015.patch, 
> YARN-6492-branch-3.1.018.patch, YARN-6492-branch-3.2.017.patch, 
> YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, 
> YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, 
> YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, 
> YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, 
> YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently

2020-05-29 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-10297:
-
Comment: was deleted

(was: | (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
44s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
1s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 49s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
34s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
40s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
39s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 48s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
40s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 87m 
44s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
38s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}147m 58s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26090/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10297 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13004381/YARN-10297.001.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 99aae03f4838 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / 19f26a020e2 |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/26090/testReport/ |
| Max. process+thread count | 891 (vs. ulimit of 5500) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoo

[jira] [Updated] (YARN-6492) Generate queue metrics for each partition

2020-05-29 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-6492:

Attachment: YARN-6492-branch-3.2.017.patch
YARN-6492-branch-3.1.018.patch

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, 
> YARN-6492-branch-2.9.015.patch, YARN-6492-branch-3.1.018.patch, 
> YARN-6492-branch-3.2.017.patch, YARN-6492-junits.patch, YARN-6492.001.patch, 
> YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, 
> YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, 
> YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, 
> YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, 
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently

2020-05-29 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung reassigned YARN-10297:


Assignee: (was: Jonathan Hung)

> TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails 
> intermittently
> ---
>
> Key: YARN-10297
> URL: https://issues.apache.org/jira/browse/YARN-10297
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Priority: Major
>
> After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails 
> intermittently when running {{mvn test -Dtest=TestContinuousScheduling}}
> {noformat}[INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
> [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 
> s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
> [ERROR] 
> testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling)
>   Time elapsed: 0.194 s  <<< ERROR!
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition= already exists!
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>   at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently

2020-05-29 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-10297:
-
Description: 
After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails 
intermittently when running {{mvn test -Dtest=TestContinuousScheduling}}
{noformat}[INFO] Running 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
[ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 s 
<<< FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
[ERROR] 
testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling)
  Time elapsed: 0.194 s  <<< ERROR!
org.apache.hadoop.metrics2.MetricsException: Metrics source 
PartitionQueueMetrics,partition= already exists!
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
{noformat}

  was:
After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails 
intermittently.
{noformat}[INFO] Running 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
[ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 s 
<<< FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
[ERROR] 
testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling)
  Time elapsed: 0.194 s  <<< ERROR!
org.apache.hadoop.metrics2.MetricsException: Metrics source 
PartitionQueueMetrics,partition= already exists!
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:45

[jira] [Commented] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently

2020-05-29 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119970#comment-17119970
 ] 

Jonathan Hung commented on YARN-10297:
--

[~maniraj...@gmail.com] while debugging this, I noticed getPartitionMetrics is 
not synchronized. I added this and it did not fix the issue in this JIRA, but 
it seems like we still may need to add this?

> TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails 
> intermittently
> ---
>
> Key: YARN-10297
> URL: https://issues.apache.org/jira/browse/YARN-10297
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>
> After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails 
> intermittently.
> {noformat}[INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
> [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 
> s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
> [ERROR] 
> testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling)
>   Time elapsed: 0.194 s  <<< ERROR!
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition= already exists!
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>   at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently

2020-05-29 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-10297:
-
Attachment: (was: YARN-10297.001.patch)

> TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails 
> intermittently
> ---
>
> Key: YARN-10297
> URL: https://issues.apache.org/jira/browse/YARN-10297
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>
> After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails 
> intermittently.
> {noformat}[INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
> [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 
> s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
> [ERROR] 
> testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling)
>   Time elapsed: 0.194 s  <<< ERROR!
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition= already exists!
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>   at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently

2020-05-29 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung reassigned YARN-10297:


Assignee: Jonathan Hung

> TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails 
> intermittently
> ---
>
> Key: YARN-10297
> URL: https://issues.apache.org/jira/browse/YARN-10297
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-10297.001.patch
>
>
> After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails 
> intermittently.
> {noformat}[INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
> [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 
> s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
> [ERROR] 
> testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling)
>   Time elapsed: 0.194 s  <<< ERROR!
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition= already exists!
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>   at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently

2020-05-29 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-10297:
-
Attachment: YARN-10297.001.patch

> TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails 
> intermittently
> ---
>
> Key: YARN-10297
> URL: https://issues.apache.org/jira/browse/YARN-10297
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Priority: Major
> Attachments: YARN-10297.001.patch
>
>
> After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails 
> intermittently.
> {noformat}[INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
> [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 
> s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
> [ERROR] 
> testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling)
>   Time elapsed: 0.194 s  <<< ERROR!
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition= already exists!
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>   at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition

2020-05-29 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119915#comment-17119915
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/29/20, 9:26 PM:
---

Looks like TestContinuousScheduling is failing intermittently. I filed 
YARN-10297 for this issue.


was (Author: jhung):
Looks like TestContinuousScheduling is failing in branch-3.1 and below (it 
succeeds in branch-3.2). I'm able to trigger it by running:

mvn test 
-Dtest=TestContinuousScheduling#testBasic,TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime

None of the other tests which run before 
testFairSchedulerContinuousSchedulingInitTime seem to trigger the issue.

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, 
> YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, 
> YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, 
> YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, 
> YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, 
> YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, 
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently

2020-05-29 Thread Jonathan Hung (Jira)
Jonathan Hung created YARN-10297:


 Summary: 
TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails 
intermittently
 Key: YARN-10297
 URL: https://issues.apache.org/jira/browse/YARN-10297
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jonathan Hung


After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails 
intermittently.
{noformat}[INFO] Running 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
[ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 s 
<<< FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling
[ERROR] 
testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling)
  Time elapsed: 0.194 s  <<< ERROR!
org.apache.hadoop.metrics2.MetricsException: Metrics source 
PartitionQueueMetrics,partition= already exists!
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition

2020-05-29 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119915#comment-17119915
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/29/20, 8:43 PM:
---

Looks like TestContinuousScheduling is failing in branch-3.1 and below (it 
succeeds in branch-3.2). I'm able to trigger it by running:

mvn test 
-Dtest=TestContinuousScheduling#testBasic,TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime

None of the other tests which run before 
testFairSchedulerContinuousSchedulingInitTime seem to trigger the issue.


was (Author: jhung):
Looks like TestContinuousScheduling is failing in branch-3.1 and below (it 
succeeds in branch-3.2).

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, 
> YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, 
> YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, 
> YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, 
> YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, 
> YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, 
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6492) Generate queue metrics for each partition

2020-05-29 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119915#comment-17119915
 ] 

Jonathan Hung commented on YARN-6492:
-

Looks like TestContinuousScheduling is failing in branch-3.1 and below (it 
succeeds in branch-3.2).

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, 
> YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, 
> YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, 
> YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, 
> YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, 
> YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, 
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6492) Generate queue metrics for each partition

2020-05-29 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-6492:

Fix Version/s: 3.1.5
   3.3.1
   3.2.2

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, 
> YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, 
> YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, 
> YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, 
> YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, 
> YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, 
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6492) Generate queue metrics for each partition

2020-05-29 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119823#comment-17119823
 ] 

Jonathan Hung commented on YARN-6492:
-

Thanks [~maniraj...@gmail.com]. For the branch-2.10 patch, do we need to remove 
the {noformat}if (partition == null || 
partition.equals(RMNodeLabelsManager.NO_LABEL)) {{noformat} check in 
{noformat}public void allocateResources(String partition, String user, Resource 
res) {{noformat} ?
Other than that, branch-2.10 and branch-2.9 patch LGTM. Since branch-2.8 is EOL 
we don't need to port it there.

I attached branch-3.2 and branch-3.1 patches containing trivial fixes. Pushed 
this to branch-3.3, branch-3.2, branch-3.1.



> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, 
> YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, 
> YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, 
> YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, 
> YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, 
> YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, 
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6492) Generate queue metrics for each partition

2020-05-29 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-6492:

Attachment: (was: YARN-6492-branch-3.2.017.patch)

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, 
> YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, 
> YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, 
> YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, 
> YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, 
> YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, 
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6492) Generate queue metrics for each partition

2020-05-29 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-6492:

Attachment: (was: YARN-6492-branch-3.1.018.patch)

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, 
> YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, 
> YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, 
> YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, 
> YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, 
> YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, 
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6492) Generate queue metrics for each partition

2020-05-29 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-6492:

Attachment: YARN-6492-branch-3.1.018.patch

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, 
> YARN-6492-branch-2.9.015.patch, YARN-6492-branch-3.1.018.patch, 
> YARN-6492-branch-3.2.017.patch, YARN-6492-junits.patch, YARN-6492.001.patch, 
> YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, 
> YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, 
> YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, 
> YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, 
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6492) Generate queue metrics for each partition

2020-05-29 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-6492:

Attachment: YARN-6492-branch-3.2.017.patch

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, 
> YARN-6492-branch-2.9.015.patch, YARN-6492-branch-3.2.017.patch, 
> YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, 
> YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, 
> YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, 
> YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, 
> YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6492) Generate queue metrics for each partition

2020-05-26 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-6492:

Fix Version/s: 3.4.0

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, 
> YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, 
> YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, 
> YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, 
> YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6492) Generate queue metrics for each partition

2020-05-26 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-6492:

Attachment: YARN-6492.013.patch

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, 
> YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, 
> YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, 
> YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, 
> YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6492) Generate queue metrics for each partition

2020-05-26 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117171#comment-17117171
 ] 

Jonathan Hung commented on YARN-6492:
-

Attached [^YARN-6492.013.patch] which fixes the whitespace issues and pushed to 
trunk.

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, 
> YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, 
> YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, 
> YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, 
> YARN-6492.012.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6492) Generate queue metrics for each partition

2020-05-26 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116955#comment-17116955
 ] 

Jonathan Hung commented on YARN-6492:
-

Thanks [~maniraj...@gmail.com]. [^YARN-6492.012.WIP.patch] LGTM. 
TestCapacitySchedulerAutoQueueCreation passes locally. I will commit EOD 
pending jenkins if no objections. I can review the branch specific patches once 
those are uploaded.

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, 
> YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, 
> YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, 
> YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, 
> YARN-6492.012.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6492) Generate queue metrics for each partition

2020-05-25 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116274#comment-17116274
 ] 

Jonathan Hung commented on YARN-6492:
-

Ok, I see. On line 2542, can we remove the nm1 heartbeats and change the 
asserts accordingly? This appears to test that requesting default partition 
containers will get allocated to nm2, but if we heartbeat to nm1 before nm2, 
then they will get allocated to nm1 and we lose this test case.

Can we fix the two whitespace issues too?

For TestCapacitySchedulerAutoQueueCreation test failures, it seems to be 
specific to PartitionQueueMetrics/PartitionMetrics somehow. I ran these tests 
before the patch and it succeeds, meaning metrics system is getting reset 
properly.

Also, once we resolve these issues, will you upload patches for branches up to 
branch-2.10?

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, 
> YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, 
> YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, 
> YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, 
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition

2020-05-22 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114350#comment-17114350
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/22/20, 11:34 PM:


Thank you [~maniraj...@gmail.com]. Some more comments:
* Delete printlns in PartitionQueueMetrics
* Delete VisibleForTesting import in CSQueueMetrics
* Can we address the checkstyle/whitespace/javadoc/findbugs/asflicense issues?
* TestCapacitySchedulerAutoQueueCreation failure looks related to this patch. 
TestFairSchedulerPreemption passes locally for me.

Ran some tests on a live cluster, everything looks good. I noticed we don't 
have CSQueueMetrics for partitioned metrics, it would be good to have those, 
but we can address this in another jira.


was (Author: jhung):
Thank you [~maniraj...@gmail.com]. Some more comments:
* Delete printlns in PartitionQueueMetrics
* Delete VisibleForTesting import in CSQueueMetrics
* Can we address the checkstyle/whitespace/javadoc/findbugs/asflicense issues?
* Not sure if unit test failures are related. Let's see the next jenkins run.

Ran some tests on a live cluster, everything looks good. I noticed we don't 
have CSQueueMetrics for partitioned metrics, it would be good to have those, 
but we can address this in another jira.

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, 
> YARN-6492.010.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition

2020-05-22 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114350#comment-17114350
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/22/20, 10:50 PM:


Thank you [~maniraj...@gmail.com]. Some more comments:
* Delete printlns in PartitionQueueMetrics
* Delete VisibleForTesting import in CSQueueMetrics
* Can we address the checkstyle/whitespace/javadoc/findbugs/asflicense issues?
* Not sure if unit test failures are related. Let's see the next jenkins run.

Ran some tests on a live cluster, everything looks good. I noticed we don't 
have CSQueueMetrics for partitioned metrics, it would be good to have those, 
but we can address this in another jira.


was (Author: jhung):
Thank you [~maniraj...@gmail.com]. Some more comments:
* Delete printlns in PartitionQueueMetrics
* Delete VisibleForTesting import in CSQueueMetrics
* Can we address the checkstyle/whitespace/javadoc/findbugs/asflicense issues?
* Not sure if unit test failures are related. Let's see the next jenkins run.
I'll run some tests on a live cluster in the meantime.

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, 
> YARN-6492.010.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6492) Generate queue metrics for each partition

2020-05-22 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114350#comment-17114350
 ] 

Jonathan Hung commented on YARN-6492:
-

Thank you [~maniraj...@gmail.com]. Some more comments:
* Delete printlns in PartitionQueueMetrics
* Delete VisibleForTesting import in CSQueueMetrics
* Can we address the checkstyle/whitespace/javadoc/findbugs/asflicense issues?
* Not sure if unit test failures are related. Let's see the next jenkins run.
I'll run some tests on a live cluster in the meantime.

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, 
> YARN-6492.010.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition

2020-05-20 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/20/20, 10:59 PM:


Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of 
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + 
userName", can we change this to "partition + '.' + userName + '.' + queuePath" 
? In particular the queuePath + userName part could cause conflicts (e.g. queue 
named "root.auser" could conflict with user metrics under queue "root.a" and 
username "user"). Putting the user before the queue and adding the delimiter 
should prevent the user from being interpreted as part of the queue path. I see 
a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name 
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add the metrics object to QUEUE_METRICS, since we're 
accessing user metrics via the {{users}} map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add queue path to the key, since the {{users}} map is 
not static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we 
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same 
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these 
changes needed?
* For partition metrics, I don't think setAvailableResourcesToQueue is handled 
correctly. It appears to update partition metrics no matter which queue this 
method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics:
{noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 
2);{noformat}
should be
{noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 
2);{noformat}
Perhaps we should only update partition metrics in setAvailableResourcesToQueue 
if the queue is root?
* Delete {noformat}System.out.println(" final is " + 
parentQueueSource_X.toString());{noformat}
* Same in TestQueueMetrics, there should not be capacity scheduler specific 
logic here, can we remove these changes?
* On line 2539 of TestNodeLabelContainerAllocation, should
{noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
be 
{noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
?
* Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The 
stuff removed seems to be non-exclusive node label functionality (default 
partition node heartbeating, and checking queue metrics are correct), so we 
probably want to keep these tests.
* On line 2566, how is node1 getting 8 containers if queue A's max capacity is 
only 50% of 10GB = 5GB?


was (Author: jhung):
Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of 
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + 
userName", can we change this to "partition + '.' + userName + '.' + queuePath" 
? In particular the queuePath + userName part could cause conflicts (e.g. queue 
named "root.auser" could conflict with user metrics under queue "root.a" and 
username "user"). Putting the user before the queue and adding the delimiter 
should prevent the user from being interpreted as part of the queue path. I see 
a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name 
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add the metrics object to QUEUE_METRICS, since we're 
accessing user metrics via the user map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add queue path to the key, since the users map is not 
static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we 
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same 
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these 
changes ne

[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition

2020-05-20 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/20/20, 10:58 PM:


Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of 
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + 
userName", can we change this to "partition + '.' + userName + '.' + queuePath" 
? In particular the queuePath + userName part could cause conflicts (e.g. queue 
named "root.auser" could conflict with user metrics under queue "root.a" and 
username "user"). Putting the user before the queue and adding the delimiter 
should prevent the user from being interpreted as part of the queue path. I see 
a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name 
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add the metrics object to QUEUE_METRICS, since we're 
accessing user metrics via the user map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add queue path to the key, since the users map is not 
static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we 
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same 
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these 
changes needed?
* For partition metrics, I don't think setAvailableResourcesToQueue is handled 
correctly. It appears to update partition metrics no matter which queue this 
method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics:
{noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 
2);{noformat}
should be
{noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 
2);{noformat}
Perhaps we should only update partition metrics in setAvailableResourcesToQueue 
if the queue is root?
* Delete {noformat}System.out.println(" final is " + 
parentQueueSource_X.toString());{noformat}
* Same in TestQueueMetrics, there should not be capacity scheduler specific 
logic here, can we remove these changes?
* On line 2539 of TestNodeLabelContainerAllocation, should
{noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
be 
{noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
?
* Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The 
stuff removed seems to be non-exclusive node label functionality (default 
partition node heartbeating, and checking queue metrics are correct), so we 
probably want to keep these tests.
* On line 2566, how is node1 getting 8 containers if queue A's max capacity is 
only 50% of 10GB = 5GB?


was (Author: jhung):
Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of 
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + 
userName", can we change this to "partition + '.' + userName + '.' + queuePath" 
? In particular the queuePath + userName part could cause conflicts (e.g. queue 
named "root.auser" could conflict with user metrics under queue "root.a" and 
username "user"). I see a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name 
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add the metrics object to QUEUE_METRICS, since we're 
accessing user metrics via the user map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add queue path to the key, since the users map is not 
static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we 
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same 
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these 
changes needed?
* For partition metrics, I don't think setAvailableResourcesToQueue is handled 
correctly. It appears to update partition metrics no matte

[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition

2020-05-20 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/20/20, 10:57 PM:


Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of 
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + 
userName", can we change this to "partition + '.' + userName + '.' + queuePath" 
? In particular the queuePath + userName part could cause conflicts (e.g. queue 
named "root.auser" could conflict with user metrics under queue "root.a" and 
username "user"). I see a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name 
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add the metrics object to QUEUE_METRICS, since we're 
accessing user metrics via the user map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add queue path to the key, since the users map is not 
static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we 
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same 
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these 
changes needed?
* For partition metrics, I don't think setAvailableResourcesToQueue is handled 
correctly. It appears to update partition metrics no matter which queue this 
method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics:
{noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 
2);{noformat}
should be
{noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 
2);{noformat}
Perhaps we should only update partition metrics in setAvailableResourcesToQueue 
if the queue is root?
* Delete {noformat}System.out.println(" final is " + 
parentQueueSource_X.toString());{noformat}
* Same in TestQueueMetrics, there should not be capacity scheduler specific 
logic here, can we remove these changes?
* On line 2539 of TestNodeLabelContainerAllocation, should
{noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
be 
{noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
?
* Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The 
stuff removed seems to be non-exclusive node label functionality (default 
partition node heartbeating, and checking queue metrics are correct), so we 
probably want to keep these tests.
* On line 2566, how is node1 getting 8 containers if queue A's max capacity is 
only 50% of 10GB = 5GB?


was (Author: jhung):
Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of 
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + 
userName", can we change this to "partition + '.' + userName + '.' + queuePath" 
? In particular the queuePath + userName part could cause conflicts (e.g. queue 
named "root.auser" could conflict with user metrics under queue "root.a" and 
username "user"). I see a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name 
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add the metrics object to QUEUE_METRICS, since we're 
accessing user metrics via the user map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add queue path to the key, since the users map is not 
static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we 
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same 
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these 
changes needed?
* For partition metrics, I don't think setAvailableResourcesToQueue is handled 
correctly. It appears to update partition metrics no matter which queue this 
method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics:
{noformat}checkResources(partition

[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition

2020-05-20 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/20/20, 10:57 PM:


Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of 
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + 
userName", can we change this to "partition + '.' + userName + '.' + queuePath" 
? In particular the queuePath + userName part could cause conflicts (e.g. queue 
named "root.auser" could conflict with user metrics under queue "root.a" and 
username "user"). I see a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name 
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add the metrics object to QUEUE_METRICS, since we're 
accessing user metrics via the user map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add queue path to the key, since the users map is not 
static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we 
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same 
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these 
changes needed?
* For partition metrics, I don't think setAvailableResourcesToQueue is handled 
correctly. It appears to update partition metrics no matter which queue this 
method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics:
{noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 
2);{noformat}
should be
{noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 
2);{noformat}
Perhaps we should only update partition metrics in setAvailableResourcesToQueue 
if the queue is root?
* Delete {noformat}println System.out.println(" final is " + 
parentQueueSource_X.toString());{noformat}
* Same in TestQueueMetrics, there should not be capacity scheduler specific 
logic here, can we remove these changes?
* On line 2539 of TestNodeLabelContainerAllocation, should
{noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
be 
{noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
?
* Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The 
stuff removed seems to be non-exclusive node label functionality (default 
partition node heartbeating, and checking queue metrics are correct), so we 
probably want to keep these tests.
* On line 2566, how is node1 getting 8 containers if queue A's max capacity is 
only 50% of 10GB = 5GB?


was (Author: jhung):
Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of 
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + 
userName", can we change this to "partition + '.' + userName + '.' + queuePath" 
? In particular the queuePath + userName part could cause conflicts (e.g. queue 
named "root.auser" could conflict with user metrics under queue "root.a" and 
username "user"). I see a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name 
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add the metrics object to QUEUE_METRICS, since we're 
accessing user metrics via the user map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add queue path to the key, since the users map is not 
static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we 
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same 
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these 
changes needed?
* I don't think setAvailableResourcesToQueue is handled correctly. It appears 
to update partition metrics no matter which queue this method is invoked for. 
Thus for example on line 87 of TestPartitionQueueMetrics:
{noformat}checkResources(partitionSource, 0, 0, 0

[jira] [Commented] (YARN-6492) Generate queue metrics for each partition

2020-05-20 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648
 ] 

Jonathan Hung commented on YARN-6492:
-

Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of 
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + 
userName", can we change this to "partition + '.' + userName + '.' + queuePath" 
? In particular the queuePath + userName part could cause conflicts (e.g. queue 
named "root.auser" could conflict with user metrics under queue "root.a" and 
username "user"). I see a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name 
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add the metrics object to QUEUE_METRICS, since we're 
accessing user metrics via the user map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add queue path to the key, since the users map is not 
static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we 
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same 
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these 
changes needed?
* I don't think setAvailableResourcesToQueue is handled correctly. It appears 
to update partition metrics no matter which queue this method is invoked for. 
Thus for example on line 87 of TestPartitionQueueMetrics:
{noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 
2);{noformat}
should be
{noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 
2);{noformat}
Perhaps we should only update partition metrics in setAvailableResourcesToQueue 
if the queue is root?
* Delete {noformat}println System.out.println(" final is " + 
parentQueueSource_X.toString());{noformat}
* Same in TestQueueMetrics, there should not be capacity scheduler specific 
logic here, can we remove these changes?
* On line 2539 of TestNodeLabelContainerAllocation, should
{noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
be 
{noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
?
* Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The 
stuff removed seems to be non-exclusive node label functionality (default 
partition node heartbeating, and checking queue metrics are correct), so we 
probably want to keep these tests.
* On line 2566, how is node1 getting 8 containers if queue A's max capacity is 
only 50% of 10GB = 5GB?

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, 
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10260) Allow transitioning queue from DRAINING to RUNNING state

2020-05-11 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104940#comment-17104940
 ] 

Jonathan Hung commented on YARN-10260:
--

+1 looks fine to me. I'll commit this tomorrow if no objections.

> Allow transitioning queue from DRAINING to RUNNING state
> 
>
> Key: YARN-10260
> URL: https://issues.apache.org/jira/browse/YARN-10260
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10260.001.patch
>
>
> We found in our cluster, a queue was erroneously stopped. Then queue is 
> internally in DRAINING state. It cannot be moved back to RUNNING state until 
> the queue is finished draining. For queues with large workloads, this can 
> block other apps from submitting to this queue for a long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10263) Application summary is logged multiple times due to RM recovery

2020-05-11 Thread Jonathan Hung (Jira)
Jonathan Hung created YARN-10263:


 Summary: Application summary is logged multiple times due to RM 
recovery
 Key: YARN-10263
 URL: https://issues.apache.org/jira/browse/YARN-10263
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jonathan Hung


App finishes, and is logged to RM app summary. Restart RM. Then this app is 
logged to RM app summary again.

We would somehow need to know cross-restart whether an app has been logged or 
not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6492) Generate queue metrics for each partition

2020-05-11 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104672#comment-17104672
 ] 

Jonathan Hung commented on YARN-6492:
-

IMO we should still have {noformat} "name" : 
"Hadoop:service=ResourceManager,name=QueueMetrics,q0=root,q1=a" ...{noformat} 
report queue metrics for default partition only. Users could also use 
{noformat}name=PartitionQueueMetrics,partition=default,q0=root{noformat} (or, 
{noformat}name=PartitionQueueMetrics,partition=,q0=root{noformat}) for default 
queue metrics, but if people are already using {noformat} "name" : 
"Hadoop:service=ResourceManager,name=QueueMetrics,q0=root,q1=a" ...{noformat}  
for default queue metrics (since this has already gone into many releases) I 
don't think we can justify breaking this behavior.

If we want to change this behavior so {noformat} "name" : 
"Hadoop:service=ResourceManager,name=QueueMetrics,q0=root,q1=a" ...{noformat} 
reports metrics for all partitions, as it was before YARN-6467, we can revisit 
that in a later JIRA. But I don't think we should do it here.

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6492) Generate queue metrics for each partition

2020-05-09 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103520#comment-17103520
 ] 

Jonathan Hung commented on YARN-6492:
-

Ok I see. This was not my original understanding. I assumed YARN-6467 was filed 
standalone, then I filed this ticket because I saw YARN-6467 would remove 
partitioned metrics. IMO if there's multiple JIRAs that require a feature to 
work properly, they shouldn't be committed separately.

In any case, YARN-6467 has already made its way into releases, so we have 
already broken compatibility. Hence, I think we should treat "original 
queuemetrics computation" as behavior *after* YARN-6467 (I don't want this JIRA 
to reverse the behavior from YARN-6467, thus breaking compatibility again).

[~maniraj...@gmail.com] [~epayne] let me know if this makes sense.

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition

2020-05-07 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101990#comment-17101990
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/7/20, 7:28 PM:
--

[~maniraj...@gmail.com], thanks. Seems you missed uploading 
PartitionQueueMetrics class.

I definitely think we should address #2, #3, and #4 in this JIRA. I don't think 
#3 is addressed by YARN-9767. For example it edits the tests in the same way, 
i.e.  {noformat}assertEquals(10 * GB, 
leafQueueA.getMetrics().getAvailableMB());{noformat} is changed to 
{noformat}assertEquals(22 * GB, 
leafQueueA.getMetrics().getAvailableMB());{noformat}, but this assert should 
still be 0 GB, since the default partition has no resources. IMO the bottom 
line is that after this JIRA is committed, the existing QueueMetrics should 
still only contain metrics for default partition, and partitioned queue metrics 
should only be in the newly added metrics. It will get very confusing if we 
break this behavior in this JIRA and then patch it in another. What do you 
think?

Also, regarding your first point in YARN-9767 about non exclusive node labels, 
this issue seems to exist even before YARN-6492, so I think we can address this 
issue in YARN-9767.


was (Author: jhung):
[~maniraj...@gmail.com], thanks. Seems you missed uploading 
PartitionQueueMetrics class.

I definitely think we should address #2, #3, and #4 in this JIRA. I don't think 
#3 is addressed by YARN-9767. For example it edits the tests in the same way, 
i.e.  {noformat}assertEquals(10 * GB, 
leafQueueA.getMetrics().getAvailableMB());{noformat} is changed to 
{noformat}assertEquals(22 * GB, 
leafQueueA.getMetrics().getAvailableMB());{noformat}, but this assert should 
still be 0 GB, since the default partition has no resources. IMO the bottom 
line is that after this JIRA is committed, the existing QueueMetrics should 
still only contain metrics for default partition, and partitioned queue metrics 
should only be in the newly added metrics. What do you think?

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition

2020-05-07 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101990#comment-17101990
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/7/20, 7:28 PM:
--

[~maniraj...@gmail.com], thanks. Seems you missed uploading 
PartitionQueueMetrics class.

I definitely think we should address #2, #3, and #4 in this JIRA. Also, I don't 
think #3 is addressed by YARN-9767. For example it edits the tests in the same 
way, i.e.  {noformat}assertEquals(10 * GB, 
leafQueueA.getMetrics().getAvailableMB());{noformat} is changed to 
{noformat}assertEquals(22 * GB, 
leafQueueA.getMetrics().getAvailableMB());{noformat}, but this assert should 
still be 0 GB, since the default partition has no resources. IMO the bottom 
line is that after this JIRA is committed, the existing QueueMetrics should 
still only contain metrics for default partition, and partitioned queue metrics 
should only be in the newly added metrics. It will get very confusing if we 
break this behavior in this JIRA and then patch it in another. What do you 
think?

Also, regarding your first point in YARN-9767 about non exclusive node labels, 
this issue seems to exist even before YARN-6492, so I think we can address this 
issue in YARN-9767.


was (Author: jhung):
[~maniraj...@gmail.com], thanks. Seems you missed uploading 
PartitionQueueMetrics class.

I definitely think we should address #2, #3, and #4 in this JIRA. I don't think 
#3 is addressed by YARN-9767. For example it edits the tests in the same way, 
i.e.  {noformat}assertEquals(10 * GB, 
leafQueueA.getMetrics().getAvailableMB());{noformat} is changed to 
{noformat}assertEquals(22 * GB, 
leafQueueA.getMetrics().getAvailableMB());{noformat}, but this assert should 
still be 0 GB, since the default partition has no resources. IMO the bottom 
line is that after this JIRA is committed, the existing QueueMetrics should 
still only contain metrics for default partition, and partitioned queue metrics 
should only be in the newly added metrics. It will get very confusing if we 
break this behavior in this JIRA and then patch it in another. What do you 
think?

Also, regarding your first point in YARN-9767 about non exclusive node labels, 
this issue seems to exist even before YARN-6492, so I think we can address this 
issue in YARN-9767.

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6492) Generate queue metrics for each partition

2020-05-07 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101990#comment-17101990
 ] 

Jonathan Hung commented on YARN-6492:
-

[~maniraj...@gmail.com], thanks. Seems you missed uploading 
PartitionQueueMetrics class.

I definitely think we should address #2, #3, and #4 in this JIRA. I don't think 
#3 is addressed by YARN-9767. For example it edits the tests in the same way, 
i.e.  {noformat}assertEquals(10 * GB, 
leafQueueA.getMetrics().getAvailableMB());{noformat} is changed to 
{noformat}assertEquals(22 * GB, 
leafQueueA.getMetrics().getAvailableMB());{noformat}, but this assert should 
still be 0 GB, since the default partition has no resources. IMO the bottom 
line is that after this JIRA is committed, the existing QueueMetrics should 
still only contain metrics for default partition, and partitioned queue metrics 
should only be in the newly added metrics. What do you think?

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10260) Allow transitioning queue from DRAINING to RUNNING state

2020-05-06 Thread Jonathan Hung (Jira)
Jonathan Hung created YARN-10260:


 Summary: Allow transitioning queue from DRAINING to RUNNING state
 Key: YARN-10260
 URL: https://issues.apache.org/jira/browse/YARN-10260
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jonathan Hung


We found in our cluster, a queue was erroneously stopped. Then queue is 
internally in DRAINING state. It cannot be moved back to RUNNING state until 
the queue is finished draining. For queues with large workloads, this can block 
other apps from submitting to this queue for a long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition

2020-05-05 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100369#comment-17100369
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/6/20, 1:15 AM:
--

OK thanks [~maniraj...@gmail.com] for the explanation. Sorry for the long 
delay, took some time to grok the latest 007 patch.
* Can we rename getPartitionQueueMetrics to something different? My initial 
confusion was that getPartitionQueueMetrics for QueueMetrics and 
PartitionQueueMetrics serve different purposes...the former for queue*partition 
and the latter for partition only. It's especially confusing in the case of 
PartitionQueueMetrics#getPartitionQueueMetrics, since this has nothing to do 
with queues. We can update the comment for 
PartitionQueueMetrics#getPartitionQueueMetrics as well, it also says Partition 
* Queue.
* Mentioned this earlier, can we remove the {noformat}   if (parent != null) {
  parent.setAvailableResourcesToUser(partition, user, limit);
}{noformat}
check in QueueMetrics#setAvailableResourcesToUser?  I think it should be 
addressed here rather than YARN-9767.
* I don't think the asserts in TestNodeLabelContainerAllocation should change. 
leafQueue.getMetrics should return metrics for default partition. I think we 
still need to check in QueueMetrics#setAvailableResourcesToUser and 
QueueMetrics#setAvailableResourcesToQueue whether partition is null or empty 
string. (This will break updating partition queue metrics, so we need to find a 
way to distinguish whether we're updating default partition queue metrics or 
partitioned queue metrics within the 
setAvailableResourcesToUser/setAvailableResourcesToQueue function.)
* Mentioned before, can we update everywhere we're creating a new metricName 
for partition/user/queue metrics to use a delimiter? e.g. {noformat}String 
metricName = partition + this.queueName + userName;{noformat}. Otherwise 
there's a chance that these metric names could collide.


was (Author: jhung):
OK thanks [~maniraj...@gmail.com] for the explanation. Sorry for the long 
delay, took some time to grok the latest 007 patch.
* Can we rename getPartitionQueueMetrics to something different? My initial 
confusion was that getPartitionQueueMetrics for QueueMetrics and 
PartitionQueueMetrics serve different purposes...the former for queue*partition 
and the latter for partition only. It's especially confusing in the case of 
PartitionQueueMetrics#getPartitionQueueMetrics, since this has nothing to do 
with queues. We can update the comment for 
PartitionQueueMetrics#getPartitionQueueMetrics as well, it also says Partition 
* Queue.
* Mentioned this earlier, can we remove the {noformat}   if (parent != null) {
  parent.setAvailableResourcesToUser(partition, user, limit);
}{noformat}
check in QueueMetrics#setAvailableResourcesToUser?  I think it should be 
addressed here rather than YARN-9767.
* I don't think the asserts in TestNodeLabelContainerAllocation should change. 
leafQueue.getMetrics should return metrics for default partition. I think we 
still need to check in QueueMetrics#setAvailableResourcesToUser and 
QueueMetrics#setAvailableResourcesToQueue whether partition is null or empty 
string. (This will break updating partition queue metrics, so we need to find a 
way to distinguish whether we're updating default partition queue metrics or 
partitioned queue metrics.)
* Mentioned before, can we update everywhere we're creating a new metricName 
for partition/user/queue metrics to use a delimiter? e.g. {noformat}String 
metricName = partition + this.queueName + userName;{noformat}. Otherwise 
there's a chance that these metric names could collide.

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition

2020-05-05 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100369#comment-17100369
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/6/20, 1:14 AM:
--

OK thanks [~maniraj...@gmail.com] for the explanation. Sorry for the long 
delay, took some time to grok the latest 007 patch.
* Can we rename getPartitionQueueMetrics to something different? My initial 
confusion was that getPartitionQueueMetrics for QueueMetrics and 
PartitionQueueMetrics serve different purposes...the former for queue*partition 
and the latter for partition only. It's especially confusing in the case of 
PartitionQueueMetrics#getPartitionQueueMetrics, since this has nothing to do 
with queues. We can update the comment for 
PartitionQueueMetrics#getPartitionQueueMetrics as well, it also says Partition 
* Queue.
* Mentioned this earlier, can we remove the {noformat}   if (parent != null) {
  parent.setAvailableResourcesToUser(partition, user, limit);
}{noformat}
check in QueueMetrics#setAvailableResourcesToUser?  I think it should be 
addressed here rather than YARN-9767.
* I don't think the asserts in TestNodeLabelContainerAllocation should change. 
leafQueue.getMetrics should return metrics for default partition. I think we 
still need to check in QueueMetrics#setAvailableResourcesToUser and 
QueueMetrics#setAvailableResourcesToQueue whether partition is null or empty 
string. (This will break updating partition queue metrics, so we need to find a 
way to distinguish whether we're updating default partition queue metrics or 
partitioned queue metrics.)
* Mentioned before, can we update everywhere we're creating a new metricName 
for partition/user/queue metrics to use a delimiter? e.g. {noformat}String 
metricName = partition + this.queueName + userName;{noformat}. Otherwise 
there's a chance that these metric names could collide.


was (Author: jhung):
OK thanks [~maniraj...@gmail.com] for the explanation. Sorry for the long 
delay, took some time to grok the latest 007 patch.
* Can we rename getPartitionQueueMetrics to something different? My initial 
confusion was that getPartitionQueueMetrics for QueueMetrics and 
PartitionQueueMetrics serve different purposes...the former for queue*partition 
and the latter for partition only. It's especially confusing in the case of 
PartitionQueueMetrics#getPartitionQueueMetrics, since this has nothing to do 
with queues. We can update the comment for 
PartitionQueueMetrics#getPartitionQueueMetrics as well, it also says Partition 
* Queue.
* Mentioned this earlier, can we remove the {noformat}   if (parent != null) {
  parent.setAvailableResourcesToUser(partition, user, limit);
}{noformat}
check in QueueMetrics#setAvailableResourcesToUser?  I think it should be 
addressed here rather than YARN-9767.
* I don't think the asserts in TestNodeLabelContainerAllocation should change. 
leafQueue.getMetrics should return metrics for default partition. I think we 
still need to check in QueueMetrics#setAvailableResourcesToUser and 
QueueMetrics#setAvailableResour cesToQueue whether partition is null or empty 
string. (This will break updating partition queue metrics, so we need to find a 
way to distinguish whether we're updating default partition queue metrics or 
partitioned queue metrics.)
* Mentioned before, can we update everywhere we're creating a new metricName 
for partition/user/queue metrics to use a delimiter? e.g. {noformat}String 
metricName = partition + this.queueName + userName;{noformat}. Otherwise 
there's a chance that these metric names could collide.

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For addit

[jira] [Commented] (YARN-6492) Generate queue metrics for each partition

2020-05-05 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100369#comment-17100369
 ] 

Jonathan Hung commented on YARN-6492:
-

OK thanks [~maniraj...@gmail.com] for the explanation. Sorry for the long 
delay, took some time to grok the latest 007 patch.
* Can we rename getPartitionQueueMetrics to something different? My initial 
confusion was that getPartitionQueueMetrics for QueueMetrics and 
PartitionQueueMetrics serve different purposes...the former for queue*partition 
and the latter for partition only. It's especially confusing in the case of 
PartitionQueueMetrics#getPartitionQueueMetrics, since this has nothing to do 
with queues. We can update the comment for 
PartitionQueueMetrics#getPartitionQueueMetrics as well, it also says Partition 
* Queue.
* Mentioned this earlier, can we remove the {noformat}   if (parent != null) {
  parent.setAvailableResourcesToUser(partition, user, limit);
}{noformat}
check in QueueMetrics#setAvailableResourcesToUser?  I think it should be 
addressed here rather than YARN-9767.
* I don't think the asserts in TestNodeLabelContainerAllocation should change. 
leafQueue.getMetrics should return metrics for default partition. I think we 
still need to check in QueueMetrics#setAvailableResourcesToUser and 
QueueMetrics#setAvailableResour cesToQueue whether partition is null or empty 
string. (This will break updating partition queue metrics, so we need to find a 
way to distinguish whether we're updating default partition queue metrics or 
partitioned queue metrics.)
* Mentioned before, can we update everywhere we're creating a new metricName 
for partition/user/queue metrics to use a delimiter? e.g. {noformat}String 
metricName = partition + this.queueName + userName;{noformat}. Otherwise 
there's a chance that these metric names could collide.

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6492) Generate queue metrics for each partition

2020-05-05 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung reassigned YARN-6492:
---

Assignee: Manikandan R  (was: Jonathan Hung)

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6492) Generate queue metrics for each partition

2020-05-05 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung reassigned YARN-6492:
---

Assignee: Jonathan Hung  (was: Manikandan R)

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2020-04-30 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096901#comment-17096901
 ] 

Jonathan Hung commented on YARN-8193:
-

javadoc complains about AbstractYarnScheduler which this patch doesn't touch. 
Seems unrelated. I pushed [^YARN-8193-branch-2.10-001.patch] to branch-2.10

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Blocker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8193-branch-2-001.patch, 
> YARN-8193-branch-2.10-001.patch, YARN-8193-branch-2.9.0-001.patch, 
> YARN-8193.001.patch, YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2020-04-30 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096807#comment-17096807
 ] 

Jonathan Hung commented on YARN-8193:
-

Hit this issue on 2.10.0 cluster. Reuploading patch

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Blocker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8193-branch-2-001.patch, 
> YARN-8193-branch-2.10-001.patch, YARN-8193-branch-2.9.0-001.patch, 
> YARN-8193.001.patch, YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2020-04-30 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-8193:

Attachment: YARN-8193-branch-2.10-001.patch

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Blocker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8193-branch-2-001.patch, 
> YARN-8193-branch-2.10-001.patch, YARN-8193-branch-2.9.0-001.patch, 
> YARN-8193.001.patch, YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2020-04-30 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096807#comment-17096807
 ] 

Jonathan Hung edited comment on YARN-8193 at 4/30/20, 5:32 PM:
---

Hit this issue on 2.10.0 cluster. Reuploading patch to trigger jenkins


was (Author: jhung):
Hit this issue on 2.10.0 cluster. Reuploading patch

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Blocker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8193-branch-2-001.patch, 
> YARN-8193-branch-2.10-001.patch, YARN-8193-branch-2.9.0-001.patch, 
> YARN-8193.001.patch, YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8382) cgroup file leak in NM

2020-04-27 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-8382:

Fix Version/s: 2.10.1

> cgroup file leak in NM
> --
>
> Key: YARN-8382
> URL: https://issues.apache.org/jira/browse/YARN-8382
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
> Environment: we write an container with a shutdownHook which has a 
> piece of code like  "while(true) sleep(100)" . when 
> *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms <* 
> *yarn.nodemanager.sleep-delay-before-sigkill.ms , cgourp file leak happens; 
> when* *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms >* 
> ** *yarn.nodemanager.sleep-delay-before-sigkill.ms, cgroup file is deleted 
> successfully***
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Major
> Fix For: 3.2.0, 3.1.1, 3.0.4, 2.10.1
>
> Attachments: YARN-8382-branch-2.8.3.001.patch, 
> YARN-8382-branch-2.8.3.002.patch, YARN-8382.001.patch, YARN-8382.002.patch
>
>
> As Jiandan said in YARN-6562, NM may delete  Cgroup container file timeout 
> with logs like below:
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to 
> delete for 1000ms
>  
> we found one situation is that when we set 
> *yarn.nodemanager.sleep-delay-before-sigkill.ms* bigger than 
> *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms*, the 
> cgroup file leak happens *.* 
>  
> One container process tree looks like follow graph:
> bash(16097)───java(16099)─┬─\{java}(16100) 
>                                                   ├─\{java}(16101) 
> {{                       ├─\{java}(16102)}}
>  
> {{when NM kills a container, NM sends kill -15 -pid to kill container process 
> group. Bash process will exit when it received sigterm, but java process may 
> do some job (shutdownHook etc.), and doesn't exit unit receive sigkill. And 
> when bash process exits, CgroupsLCEResourcesHandler begin to try to delete 
> cgroup files. So when 
> *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* 
> arrived, the java processes may still running and cgourp/tasks still not 
> empty and cause a cgroup file leak.}}
>  
> {{we add a condition that 
> *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* must 
> bigger than *yarn.nodemanager.sleep-delay-before-sigkill.ms* to solve this 
> problem.}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8382) cgroup file leak in NM

2020-04-27 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093935#comment-17093935
 ] 

Jonathan Hung commented on YARN-8382:
-

Pushed to branch-2.10

> cgroup file leak in NM
> --
>
> Key: YARN-8382
> URL: https://issues.apache.org/jira/browse/YARN-8382
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
> Environment: we write an container with a shutdownHook which has a 
> piece of code like  "while(true) sleep(100)" . when 
> *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms <* 
> *yarn.nodemanager.sleep-delay-before-sigkill.ms , cgourp file leak happens; 
> when* *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms >* 
> ** *yarn.nodemanager.sleep-delay-before-sigkill.ms, cgroup file is deleted 
> successfully***
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Major
> Fix For: 3.2.0, 3.1.1, 3.0.4, 2.10.1
>
> Attachments: YARN-8382-branch-2.8.3.001.patch, 
> YARN-8382-branch-2.8.3.002.patch, YARN-8382.001.patch, YARN-8382.002.patch
>
>
> As Jiandan said in YARN-6562, NM may delete  Cgroup container file timeout 
> with logs like below:
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to 
> delete for 1000ms
>  
> we found one situation is that when we set 
> *yarn.nodemanager.sleep-delay-before-sigkill.ms* bigger than 
> *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms*, the 
> cgroup file leak happens *.* 
>  
> One container process tree looks like follow graph:
> bash(16097)───java(16099)─┬─\{java}(16100) 
>                                                   ├─\{java}(16101) 
> {{                       ├─\{java}(16102)}}
>  
> {{when NM kills a container, NM sends kill -15 -pid to kill container process 
> group. Bash process will exit when it received sigterm, but java process may 
> do some job (shutdownHook etc.), and doesn't exit unit receive sigkill. And 
> when bash process exits, CgroupsLCEResourcesHandler begin to try to delete 
> cgroup files. So when 
> *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* 
> arrived, the java processes may still running and cgourp/tasks still not 
> empty and cause a cgroup file leak.}}
>  
> {{we add a condition that 
> *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* must 
> bigger than *yarn.nodemanager.sleep-delay-before-sigkill.ms* to solve this 
> problem.}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9954) Configurable max application tags and max tag length

2020-04-16 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085196#comment-17085196
 ] 

Jonathan Hung commented on YARN-9954:
-

Thanks [~BilwaST], can you add some tests verifying that app submission fails 
if tags too long/too many tags/tags not ASCII?

Also seems like we need two patches, a trunk patch with these Evolving fields 
removed and a branch-3.3 patch with the fields deprecated?

> Configurable max application tags and max tag length
> 
>
> Key: YARN-9954
> URL: https://issues.apache.org/jira/browse/YARN-9954
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9954-branch-3.3.patch, YARN-9954.001.patch
>
>
> Currently max tags and max tag length is hardcoded, it should be configurable
> {noformat}
> @Evolving
> public static final int APPLICATION_MAX_TAGS = 10;
> @Evolving
> public static final int APPLICATION_MAX_TAG_LENGTH = 100; {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9954) Configurable max application tags and max tag length

2020-04-14 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083602#comment-17083602
 ] 

Jonathan Hung edited comment on YARN-9954 at 4/14/20, 8:54 PM:
---

Thanks [~BilwaST], a few comments
* Let's change {noformat}/**Max size of application tags.*/{noformat} -> 
{noformat}/** Max number of application tags.*/{noformat}
* Also in yarn-default.xml, let's change {noformat}Max size of 
application tags {noformat} -> {noformat}Max number of application 
tags {noformat}
* Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException 
via RPCUtil.getRemoteException

Also, this jira will be useful to have in older minor versions. But we cannot 
remove @Evolving fields within a minor version. Shall we open a separate jira 
to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to 
APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to 
APPLICATION_MAX_TAG_LENGTH ?


was (Author: jhung):
Thanks [~BilwaST], a few comments
* Let's change {noformat}/**Max size of application tags.*/{noformat} -> 
{noformat}/** Max number of application tags.*/{noformat}
* {{Max size of application tags }} -> {{Max number 
of application tags }}
* Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException 
via RPCUtil.getRemoteException

Also, this jira will be useful to have in older minor versions. But we cannot 
remove @Evolving fields within a minor version. Shall we open a separate jira 
to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to 
APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to 
APPLICATION_MAX_TAG_LENGTH ?

> Configurable max application tags and max tag length
> 
>
> Key: YARN-9954
> URL: https://issues.apache.org/jira/browse/YARN-9954
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9954.001.patch
>
>
> Currently max tags and max tag length is hardcoded, it should be configurable
> {noformat}
> @Evolving
> public static final int APPLICATION_MAX_TAGS = 10;
> @Evolving
> public static final int APPLICATION_MAX_TAG_LENGTH = 100; {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9954) Configurable max application tags and max tag length

2020-04-14 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083602#comment-17083602
 ] 

Jonathan Hung edited comment on YARN-9954 at 4/14/20, 8:53 PM:
---

Thanks [~BilwaST], a few comments
* Let's change {noformat}/**Max size of application tags.*/{noformat} -> 
{noformat}/** Max number of application tags.*/{noformat}
* {{Max size of application tags }} -> {{Max number 
of application tags }}
* Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException 
via RPCUtil.getRemoteException

Also, this jira will be useful to have in older minor versions. But we cannot 
remove @Evolving fields within a minor version. Shall we open a separate jira 
to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to 
APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to 
APPLICATION_MAX_TAG_LENGTH ?


was (Author: jhung):
Thanks [~BilwaST], a few comments
* Let's change {{/** Max size of application tags.*/}} -> {{/** Max number of 
application tags.*/}}
* {{Max size of application tags }} -> {{Max number 
of application tags }}
* Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException 
via RPCUtil.getRemoteException

Also, this jira will be useful to have in older minor versions. But we cannot 
remove @Evolving fields within a minor version. Shall we open a separate jira 
to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to 
APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to 
APPLICATION_MAX_TAG_LENGTH ?

> Configurable max application tags and max tag length
> 
>
> Key: YARN-9954
> URL: https://issues.apache.org/jira/browse/YARN-9954
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9954.001.patch
>
>
> Currently max tags and max tag length is hardcoded, it should be configurable
> {noformat}
> @Evolving
> public static final int APPLICATION_MAX_TAGS = 10;
> @Evolving
> public static final int APPLICATION_MAX_TAG_LENGTH = 100; {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9954) Configurable max application tags and max tag length

2020-04-14 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083602#comment-17083602
 ] 

Jonathan Hung commented on YARN-9954:
-

Thanks [~BilwaST], a few comments
* Let's change {{/**Max size of application tags.*/}} -> {{/** Max number of 
application tags.*/}}
* {{Max size of application tags }} -> {{Max number 
of application tags }}
* Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException 
via RPCUtil.getRemoteException

Also, this jira will be useful to have in older minor versions. But we cannot 
remove @Evolving fields within a minor version. Shall we open a separate jira 
to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to 
APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to 
APPLICATION_MAX_TAG_LENGTH ?

> Configurable max application tags and max tag length
> 
>
> Key: YARN-9954
> URL: https://issues.apache.org/jira/browse/YARN-9954
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9954.001.patch
>
>
> Currently max tags and max tag length is hardcoded, it should be configurable
> {noformat}
> @Evolving
> public static final int APPLICATION_MAX_TAGS = 10;
> @Evolving
> public static final int APPLICATION_MAX_TAG_LENGTH = 100; {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9954) Configurable max application tags and max tag length

2020-04-14 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083602#comment-17083602
 ] 

Jonathan Hung edited comment on YARN-9954 at 4/14/20, 8:52 PM:
---

Thanks [~BilwaST], a few comments
* Let's change {{/** Max size of application tags.*/}} -> {{/** Max number of 
application tags.*/}}
* {{Max size of application tags }} -> {{Max number 
of application tags }}
* Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException 
via RPCUtil.getRemoteException

Also, this jira will be useful to have in older minor versions. But we cannot 
remove @Evolving fields within a minor version. Shall we open a separate jira 
to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to 
APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to 
APPLICATION_MAX_TAG_LENGTH ?


was (Author: jhung):
Thanks [~BilwaST], a few comments
* Let's change {{/**Max size of application tags.*/}} -> {{/** Max number of 
application tags.*/}}
* {{Max size of application tags }} -> {{Max number 
of application tags }}
* Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException 
via RPCUtil.getRemoteException

Also, this jira will be useful to have in older minor versions. But we cannot 
remove @Evolving fields within a minor version. Shall we open a separate jira 
to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to 
APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to 
APPLICATION_MAX_TAG_LENGTH ?

> Configurable max application tags and max tag length
> 
>
> Key: YARN-9954
> URL: https://issues.apache.org/jira/browse/YARN-9954
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-9954.001.patch
>
>
> Currently max tags and max tag length is hardcoded, it should be configurable
> {noformat}
> @Evolving
> public static final int APPLICATION_MAX_TAGS = 10;
> @Evolving
> public static final int APPLICATION_MAX_TAG_LENGTH = 100; {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10227) Pull YARN-8242 back to branch-2.10

2020-04-09 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079656#comment-17079656
 ] 

Jonathan Hung commented on YARN-10227:
--

Thanks Jim for fixing this. Belated +1 from me.

> Pull YARN-8242 back to branch-2.10
> --
>
> Key: YARN-10227
> URL: https://issues.apache.org/jira/browse/YARN-10227
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.10.0, 2.10.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 2.10.1
>
> Attachments: YARN-10227-branch-2.10.001.patch
>
>
> We have recently seen the nodemanager OOM issue reported in YARN-8242 during 
> a rolling upgrade.  Our code is currently based on branch-2.8, but we are in 
> the process of moving to 2.10.  I checked and YARN-8242 pulls back to 
> branch-2.10 pretty cleanly.  The only conflict was a minor one in 
> TestNMLeveldbStateStoreService.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10212) Create separate configuration for max global AM attempts

2020-04-08 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078598#comment-17078598
 ] 

Jonathan Hung edited comment on YARN-10212 at 4/8/20, 6:56 PM:
---

Thanks [~BilwaST], in general looks good, some minor style issues:
 * In TestResourceManager.java, can we change {{fail("Exception is expected 
because the global max attempts" +}} to {{fail("Exception is expected because 
AM max attempts" +}}
 * In YarnConfiguration.java: {{* an application,if unset by user.}} -> can we 
add a space after the comma
 * In yarn-default.xml, for the  comment for 
yarn.resourcemanager.am.max-attempts:
 * 
{noformat}
The maximum number of application attempts. Each application 
master can specify
its individual maximum number of application attempts via the API, but the
individual number cannot be more than the global upper bound.This value is 
being set
only if global max attempts is unset. The default number is set to 2, to
{noformat}
can we change this to

 * 
{noformat}
The default maximum number of application attempts, if unset by
the user. Each application master can specify its individual maximum number of 
application
attempts via the API, but the individual number cannot be more than the global 
upper bound in
yarn.resourcemanager.am.global.max-attempts. The default number is set to 2, 
to{noformat}


was (Author: jhung):
Thanks [~BilwaST], in general looks good, some minor style issues:
 * In TestResourceManager.java, can we change {{fail("Exception is expected 
because the global max attempts" +}} to {{fail("Exception is expected because 
AM max attempts" +}}
 * In YarnConfiguration.java: {{* an application,if unset by user.}} -> can we 
add a space after the comma
 * In yarn-default.xml, for the  comment for 
yarn.resourcemanager.am.max-attempts:
 * 
{noformat}
The maximum number of application attempts. Each application 
master can specify
its individual maximum number of application attempts via the API, but the
individual number cannot be more than the global upper bound.This value is 
being set
only if global max attempts is unset. The default number is set to 2, to
{noformat}
can we change this to
 * 
{noformat}
The default maximum number of application attempts, if unset by 
the user. Each application master can specify its individual maximum number of 
application attempts via the API, but the individual number cannot be more than 
the global upper bound in yarn.resourcemanager.am.global.max-attempts. The 
default number is set to 2, to{noformat}

> Create separate configuration for max global AM attempts
> 
>
> Key: YARN-10212
> URL: https://issues.apache.org/jira/browse/YARN-10212
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10212.001.patch, YARN-10212.002.patch, 
> YARN-10212.003.patch
>
>
> Right now user's default max AM attempts is set to the same as global max AM 
> attempts:
> {noformat}
> int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
> YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat}
> If we want to increase global max AM attempts, it will also increase the 
> default. So we should create a separate global AM max attempts config to 
> separate the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10212) Create separate configuration for max global AM attempts

2020-04-08 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078598#comment-17078598
 ] 

Jonathan Hung commented on YARN-10212:
--

Thanks [~BilwaST], in general looks good, some minor style issues:
 * In TestResourceManager.java, can we change {{fail("Exception is expected 
because the global max attempts" +}} to {{fail("Exception is expected because 
AM max attempts" +}}
 * In YarnConfiguration.java: {{* an application,if unset by user.}} -> can we 
add a space after the comma
 * In yarn-default.xml, for the  comment for 
yarn.resourcemanager.am.max-attempts:
 * 
{noformat}
The maximum number of application attempts. Each application 
master can specify
its individual maximum number of application attempts via the API, but the
individual number cannot be more than the global upper bound.This value is 
being set
only if global max attempts is unset. The default number is set to 2, to
{noformat}
can we change this to
 * 
{noformat}
The default maximum number of application attempts, if unset by 
the user. Each application master can specify its individual maximum number of 
application attempts via the API, but the individual number cannot be more than 
the global upper bound in yarn.resourcemanager.am.global.max-attempts. The 
default number is set to 2, to{noformat}

> Create separate configuration for max global AM attempts
> 
>
> Key: YARN-10212
> URL: https://issues.apache.org/jira/browse/YARN-10212
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10212.001.patch, YARN-10212.002.patch, 
> YARN-10212.003.patch
>
>
> Right now user's default max AM attempts is set to the same as global max AM 
> attempts:
> {noformat}
> int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
> YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat}
> If we want to increase global max AM attempts, it will also increase the 
> default. So we should create a separate global AM max attempts config to 
> separate the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10212) Create separate configuration for max global AM attempts

2020-04-07 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077515#comment-17077515
 ] 

Jonathan Hung commented on YARN-10212:
--

Thanks [~BilwaST]. A few comments:
 * The javadoc for RM_AM_MAX_ATTEMPTS, can we change it to something like "The 
maximum number of application attempts for an application, if unset by the user"
 * Can we change the new config to at least have "am"? e.g.  
yarn.resourcemanager.global.max-attempts to 
yarn.resourcemanager.am.global-max-attempts? 
 * In ResourceManager.java I think we should validate both RM_AM_MAX_ATTEMPTS 
and GLOBAL_RM_AM_MAX_ATTEMPTS (and change the message in the RuntimeExceptions 
accordingly)

 * In RMAppImpl, we need to split this case into two:
{noformat}
if (individualMaxAppAttempts <= 0 ||
individualMaxAppAttempts > globalMaxAppAttempts) {
  this.maxAppAttempts = globalMaxAppAttempts; {noformat}
If individualMaxAppAttempts <= 0, set this.maxAppAttempts to 
RM_AM_MAX_ATTEMPTS. If individualMaxAppAttempts > globalMaxAppAttempts, set 
this.maxAppAttempts to globalMaxAppAttempts

 * In the test case:
{noformat}
​ int[] rmAmMaxAttempts = new int[] { 8, 0 };{noformat}
I don't think 0 is a valid config for RM_AM_MAX_ATTEMPTS, can we set this to \{ 
8, 1 }?
 * Based on the above changes we will need to change the expected values in the 
test case from
{noformat}
int[][] expectedNums = new int[][]{
new int[]{ 9, 10, 10, 10 }, {noformat}
to 

 * 
{noformat}
int[][] expectedNums = new int[][]{
new int[]{ 9, 10, 10, 8 }, {noformat}

> Create separate configuration for max global AM attempts
> 
>
> Key: YARN-10212
> URL: https://issues.apache.org/jira/browse/YARN-10212
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10212.001.patch
>
>
> Right now user's default max AM attempts is set to the same as global max AM 
> attempts:
> {noformat}
> int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
> YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat}
> If we want to increase global max AM attempts, it will also increase the 
> default. So we should create a separate global AM max attempts config to 
> separate the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10212) Create separate configuration for max global AM attempts

2020-04-03 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17075044#comment-17075044
 ] 

Jonathan Hung commented on YARN-10212:
--

[~BilwaST] it's used in RMAppImpl when validating user's desired max app 
attempts:
{noformat}
    if (individualMaxAppAttempts <= 0 || 
        individualMaxAppAttempts > globalMaxAppAttempts) {
      this.maxAppAttempts = globalMaxAppAttempts;
      LOG.warn("The specific max attempts: " + individualMaxAppAttempts
          + " for application: " + applicationId.getId()
          + " is invalid, because it is out of the range [1, "
          + globalMaxAppAttempts + "]. Use the global max attempts instead.");
    } else {
      this.maxAppAttempts = individualMaxAppAttempts;
    }     {noformat}

> Create separate configuration for max global AM attempts
> 
>
> Key: YARN-10212
> URL: https://issues.apache.org/jira/browse/YARN-10212
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
>
> Right now user's default max AM attempts is set to the same as global max AM 
> attempts:
> {noformat}
> int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
> YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat}
> If we want to increase global max AM attempts, it will also increase the 
> default. So we should create a separate global AM max attempts config to 
> separate the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10212) Create separate configuration for max global AM attempts

2020-04-03 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074847#comment-17074847
 ] 

Jonathan Hung edited comment on YARN-10212 at 4/3/20, 7:35 PM:
---

[~BilwaST] user's default max AM attempts is how many AM attempts they get if 
they don't set individualMaxAppAttempts on client side.

I am proposing adding a new configuration like GLOBAL_RM_AM_MAX_ATTEMPTS, and 
changing the code snippet above to something like: 
{noformat}
int globalMaxAppAttempts = 
conf.getInt(YarnConfiguration.GLOBAL_RM_AM_MAX_ATTEMPTS, 
conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, 
YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS));{noformat}
If GLOBAL_RM_AM_MAX_ATTEMPTS is unset, it will fall back to current behavior. 
But if GLOBAL_RM_AM_MAX_ATTEMPTS is set to something higher than 
RM_AM_MAX_ATTEMPTS, then if user does not set individualMaxAppAttempts on 
client side, their app's number of attempts will still use RM_AM_MAX_ATTEMPTS, 
but user can set individualMaxAppAttempts to RM_AM_MAX_ATTEMPTS < 
individualMaxAppAttempts <= GLOBAL_RM_AM_MAX_ATTEMPTS if they like.


was (Author: jhung):
[~BilwaST] user's default max AM attempts is how many AM attempts they get if 
they don't set individualMaxAppAttempts on client side.

I am proposing adding a new configuration like GLOBAL_RM_AM_MAX_ATTEMPTS, and 
changing the code snippet above to something like: 
{noformat}
int globalMaxAppAttempts = 
conf.getInt(YarnConfiguration.GLOBAL_RM_AM_MAX_ATTEMPTS, 
conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, 
YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS));{noformat}
If GLOBAL_RM_AM_MAX_ATTEMPTS is unset, it will fall back to current behavior. 
But if GLOBAL_RM_AM_MAX_ATTEMPTS is set to something higher (e.g. 4), then if 
user does not set individualMaxAppAttempts on client side, their app's number 
of attempts will still use RM_AM_MAX_ATTEMPTS, but user can set 
individualMaxAppAttempts to RM_AM_MAX_ATTEMPTS < individualMaxAppAttempts <= 
GLOBAL_RM_AM_MAX_ATTEMPTS if they like.

> Create separate configuration for max global AM attempts
> 
>
> Key: YARN-10212
> URL: https://issues.apache.org/jira/browse/YARN-10212
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
>
> Right now user's default max AM attempts is set to the same as global max AM 
> attempts:
> {noformat}
> int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
> YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat}
> If we want to increase global max AM attempts, it will also increase the 
> default. So we should create a separate global AM max attempts config to 
> separate the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10212) Create separate configuration for max global AM attempts

2020-04-03 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074847#comment-17074847
 ] 

Jonathan Hung edited comment on YARN-10212 at 4/3/20, 7:34 PM:
---

[~BilwaST] user's default max AM attempts is how many AM attempts they get if 
they don't set individualMaxAppAttempts on client side.

I am proposing adding a new configuration like GLOBAL_RM_AM_MAX_ATTEMPTS, and 
changing the code snippet above to something like: 
{noformat}
int globalMaxAppAttempts = 
conf.getInt(YarnConfiguration.GLOBAL_RM_AM_MAX_ATTEMPTS, 
conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, 
YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS));{noformat}
If GLOBAL_RM_AM_MAX_ATTEMPTS is unset, it will fall back to current behavior. 
But if GLOBAL_RM_AM_MAX_ATTEMPTS is set to something higher (e.g. 4), then if 
user does not set individualMaxAppAttempts on client side, their app's number 
of attempts will still use RM_AM_MAX_ATTEMPTS, but user can set 
individualMaxAppAttempts to RM_AM_MAX_ATTEMPTS < individualMaxAppAttempts <= 
GLOBAL_RM_AM_MAX_ATTEMPTS if they like.


was (Author: jhung):
[~BilwaST] user's default max AM attempts is how many AM attempts they get if 
they don't set individualMaxAppAttempts on client side.

I am proposing changing the code snippet above to something like: 
{noformat}
int globalMaxAppAttempts = 
conf.getInt(YarnConfiguration.GLOBAL_RM_AM_MAX_ATTEMPTS, 
conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, 
YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS));{noformat}
If GLOBAL_RM_AM_MAX_ATTEMPTS is unset, it will fall back to current behavior. 
But if GLOBAL_RM_AM_MAX_ATTEMPTS is set to something higher (e.g. 4), then if 
user does not set individualMaxAppAttempts on client side, their app's number 
of attempts will still use RM_AM_MAX_ATTEMPTS, but user can set 
individualMaxAppAttempts to RM_AM_MAX_ATTEMPTS < individualMaxAppAttempts <= 
GLOBAL_RM_AM_MAX_ATTEMPTS if they like.

> Create separate configuration for max global AM attempts
> 
>
> Key: YARN-10212
> URL: https://issues.apache.org/jira/browse/YARN-10212
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
>
> Right now user's default max AM attempts is set to the same as global max AM 
> attempts:
> {noformat}
> int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
> YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat}
> If we want to increase global max AM attempts, it will also increase the 
> default. So we should create a separate global AM max attempts config to 
> separate the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10212) Create separate configuration for max global AM attempts

2020-04-03 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074847#comment-17074847
 ] 

Jonathan Hung commented on YARN-10212:
--

[~BilwaST] user's default max AM attempts is how many AM attempts they get if 
they don't set individualMaxAppAttempts on client side.

I am proposing changing the code snippet above to something like: 
{noformat}
int globalMaxAppAttempts = 
conf.getInt(YarnConfiguration.GLOBAL_RM_AM_MAX_ATTEMPTS, 
conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, 
YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS));{noformat}
If GLOBAL_RM_AM_MAX_ATTEMPTS is unset, it will fall back to current behavior. 
But if GLOBAL_RM_AM_MAX_ATTEMPTS is set to something higher (e.g. 4), then if 
user does not set individualMaxAppAttempts on client side, their app's number 
of attempts will still use RM_AM_MAX_ATTEMPTS, but user can set 
individualMaxAppAttempts to RM_AM_MAX_ATTEMPTS < individualMaxAppAttempts <= 
GLOBAL_RM_AM_MAX_ATTEMPTS if they like.

> Create separate configuration for max global AM attempts
> 
>
> Key: YARN-10212
> URL: https://issues.apache.org/jira/browse/YARN-10212
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
>
> Right now user's default max AM attempts is set to the same as global max AM 
> attempts:
> {noformat}
> int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
> YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat}
> If we want to increase global max AM attempts, it will also increase the 
> default. So we should create a separate global AM max attempts config to 
> separate the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10212) Create separate configuration for max global AM attempts

2020-03-30 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071249#comment-17071249
 ] 

Jonathan Hung commented on YARN-10212:
--

Hey [~BilwaST], do you plan to take on this task?

> Create separate configuration for max global AM attempts
> 
>
> Key: YARN-10212
> URL: https://issues.apache.org/jira/browse/YARN-10212
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Bilwa S T
>Priority: Major
>
> Right now user's default max AM attempts is set to the same as global max AM 
> attempts:
> {noformat}
> int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
> YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat}
> If we want to increase global max AM attempts, it will also increase the 
> default. So we should create a separate global AM max attempts config to 
> separate the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8213) Add Capacity Scheduler performance metrics

2020-03-27 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069084#comment-17069084
 ] 

Jonathan Hung commented on YARN-8213:
-

I ran the failed TestAbstractYarnScheduler#testContainerRecoveredByNode test 
locally and it succeeded.

> Add Capacity Scheduler performance metrics
> --
>
> Key: YARN-8213
> URL: https://issues.apache.org/jira/browse/YARN-8213
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8213-branch-2.10.001.patch, YARN-8213.001.patch, 
> YARN-8213.002.patch, YARN-8213.003.patch, YARN-8213.004.patch, 
> YARN-8213.005.patch
>
>
> Currently when tune CS performance, it is not that straightforward because 
> lacking of metrics. Right now we only have \{{QueueMetrics}} which mostly 
> only tracks queue level resource counters. Propose to add CS metrics to 
> collect and display more fine-grained perf metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8213) Add Capacity Scheduler performance metrics

2020-03-27 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069030#comment-17069030
 ] 

Jonathan Hung edited comment on YARN-8213 at 3/27/20, 8:33 PM:
---

Attached [^YARN-8213-branch-2.10.001.patch]. Diffs from trunk patch:
 * Set some variables as final in TestCapacitySchedulerMetrics.java
 * Replace lambdas with anonymous inner classes in 
TestCapacitySchedulerMetrics.java


was (Author: jhung):
Attached [^YARN-8213-branch-2.10.001.patch]. Diffs from trunk patch:
 * Set some variables as final in *TestCapacitySchedulerMetrics.java*
 * **Replace lambdas with anonymous inner classes in 
*TestCapacitySchedulerMetrics.java*

> Add Capacity Scheduler performance metrics
> --
>
> Key: YARN-8213
> URL: https://issues.apache.org/jira/browse/YARN-8213
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8213-branch-2.10.001.patch, YARN-8213.001.patch, 
> YARN-8213.002.patch, YARN-8213.003.patch, YARN-8213.004.patch, 
> YARN-8213.005.patch
>
>
> Currently when tune CS performance, it is not that straightforward because 
> lacking of metrics. Right now we only have \{{QueueMetrics}} which mostly 
> only tracks queue level resource counters. Propose to add CS metrics to 
> collect and display more fine-grained perf metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-8213) Add Capacity Scheduler performance metrics

2020-03-27 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung reopened YARN-8213:
-

> Add Capacity Scheduler performance metrics
> --
>
> Key: YARN-8213
> URL: https://issues.apache.org/jira/browse/YARN-8213
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8213-branch-2.10.001.patch, YARN-8213.001.patch, 
> YARN-8213.002.patch, YARN-8213.003.patch, YARN-8213.004.patch, 
> YARN-8213.005.patch
>
>
> Currently when tune CS performance, it is not that straightforward because 
> lacking of metrics. Right now we only have \{{QueueMetrics}} which mostly 
> only tracks queue level resource counters. Propose to add CS metrics to 
> collect and display more fine-grained perf metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8213) Add Capacity Scheduler performance metrics

2020-03-27 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-8213:

Attachment: YARN-8213-branch-2.10.001.patch

> Add Capacity Scheduler performance metrics
> --
>
> Key: YARN-8213
> URL: https://issues.apache.org/jira/browse/YARN-8213
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8213-branch-2.10.001.patch, YARN-8213.001.patch, 
> YARN-8213.002.patch, YARN-8213.003.patch, YARN-8213.004.patch, 
> YARN-8213.005.patch
>
>
> Currently when tune CS performance, it is not that straightforward because 
> lacking of metrics. Right now we only have \{{QueueMetrics}} which mostly 
> only tracks queue level resource counters. Propose to add CS metrics to 
> collect and display more fine-grained perf metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10212) Create separate configuration for max global AM attempts

2020-03-27 Thread Jonathan Hung (Jira)
Jonathan Hung created YARN-10212:


 Summary: Create separate configuration for max global AM attempts
 Key: YARN-10212
 URL: https://issues.apache.org/jira/browse/YARN-10212
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jonathan Hung


Right now user's default max AM attempts is set to the same as global max AM 
attempts:
{noformat}
int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat}
If we want to increase global max AM attempts, it will also increase the 
default. So we should create a separate global AM max attempts config to 
separate the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10200) Add number of containers to RMAppManager summary

2020-03-24 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066178#comment-17066178
 ] 

Jonathan Hung commented on YARN-10200:
--

Jenkins looks good, [~tangzhankun] mind having another look? Thanks!

> Add number of containers to RMAppManager summary
> 
>
> Key: YARN-10200
> URL: https://issues.apache.org/jira/browse/YARN-10200
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-10200.001.patch, YARN-10200.002.patch, 
> YARN-10200.003.patch
>
>
> It would be useful to persist this so we can track containers processed by RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10200) Add number of containers to RMAppManager summary

2020-03-24 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066051#comment-17066051
 ] 

Jonathan Hung commented on YARN-10200:
--

Thanks [~tangzhankun] for looking. Seems reasonable. Attached 
[^YARN-10200.003.patch] to address this.

> Add number of containers to RMAppManager summary
> 
>
> Key: YARN-10200
> URL: https://issues.apache.org/jira/browse/YARN-10200
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-10200.001.patch, YARN-10200.002.patch, 
> YARN-10200.003.patch
>
>
> It would be useful to persist this so we can track containers processed by RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10200) Add number of containers to RMAppManager summary

2020-03-24 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-10200:
-
Attachment: YARN-10200.003.patch

> Add number of containers to RMAppManager summary
> 
>
> Key: YARN-10200
> URL: https://issues.apache.org/jira/browse/YARN-10200
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-10200.001.patch, YARN-10200.002.patch, 
> YARN-10200.003.patch
>
>
> It would be useful to persist this so we can track containers processed by RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10200) Add number of containers to RMAppManager summary

2020-03-23 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065126#comment-17065126
 ] 

Jonathan Hung commented on YARN-10200:
--

Thanks Haibo, attached [^YARN-10200.002.patch] to fix checkstyle

> Add number of containers to RMAppManager summary
> 
>
> Key: YARN-10200
> URL: https://issues.apache.org/jira/browse/YARN-10200
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-10200.001.patch, YARN-10200.002.patch
>
>
> It would be useful to persist this so we can track containers processed by RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10200) Add number of containers to RMAppManager summary

2020-03-23 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-10200:
-
Attachment: YARN-10200.002.patch

> Add number of containers to RMAppManager summary
> 
>
> Key: YARN-10200
> URL: https://issues.apache.org/jira/browse/YARN-10200
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-10200.001.patch, YARN-10200.002.patch
>
>
> It would be useful to persist this so we can track containers processed by RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10192) CapacityScheduler stuck in loop rejecting allocation proposals

2020-03-18 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061976#comment-17061976
 ] 

Jonathan Hung commented on YARN-10192:
--

Thanks. Yeah Tao, agreed, we plan on turning DEBUG on for this class when we 
encounter this again.

[~epayne], we have patches on top of 2.10.0, but not YARN-10009. Looking at 
YARN-10009, seems it could be related. Thanks for the reference.

> CapacityScheduler stuck in loop rejecting allocation proposals
> --
>
> Key: YARN-10192
> URL: https://issues.apache.org/jira/browse/YARN-10192
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Priority: Major
>
> On a 2.10.0 cluster, we observed containers were being scheduled very slowly. 
> Based on logs, it seems to reject a bunch of allocation proposals, then 
> accept a bunch of reserved containers, but very few containers are actually 
> getting allocated:
> {noformat}
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.30113637 
> absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster=
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1582403122262_15460_01 
> container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=misc usedCapacity=0.0031771248 
> absoluteUsedCapacity=3.1771246E-4 used= 
> cluster=
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.30113637 
> absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster=
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> 2020-03-10 06:31:48,968 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1582403122262_15460_01 
> container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu
> 2020-03-10 06:31:48,968 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=misc usedCapacity=0.0031771248 
> absoluteUsedCapacity=3.1771246E-4 used= 
> cluster=
> 2020-03-10 06:31:48,968 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.30113637 
> absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster=
> 2020-03-10 06:31:48,968 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> 2020-03-10 06:31:48,977 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1582403122262_15460_01 
> container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu
> 2020-03-10 06:31:48,977 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=misc usedCapacity=0.0031771248 
> absoluteUsedCapacity=3.1771246E-4 used= 
> cluster=
> 2020-03-10 06:31:48,977 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.30113637 
> absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster=
> 2020-03-10 06:31:48,977 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> 2020-03-10 06:31:48,981 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1582403122262_15460_01 
> container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu
> 2020-03-10 06:31:48,982 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=misc usedCapacity=0.0031771248 
> absoluteUsedCapacity=3.1771246E-4 used= 

[jira] [Updated] (YARN-10200) Add number of containers to RMAppManager summary

2020-03-17 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-10200:
-
Attachment: YARN-10200.001.patch

> Add number of containers to RMAppManager summary
> 
>
> Key: YARN-10200
> URL: https://issues.apache.org/jira/browse/YARN-10200
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-10200.001.patch
>
>
> It would be useful to persist this so we can track containers processed by RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10200) Add number of containers to RMAppManager summary

2020-03-17 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung reassigned YARN-10200:


Assignee: Jonathan Hung

> Add number of containers to RMAppManager summary
> 
>
> Key: YARN-10200
> URL: https://issues.apache.org/jira/browse/YARN-10200
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>
> It would be useful to persist this so we can track containers processed by RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10200) Add number of containers to RMAppManager summary

2020-03-17 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061059#comment-17061059
 ] 

Jonathan Hung commented on YARN-10200:
--

Yeah [~maniraj...@gmail.com] I think that makes sense.

> Add number of containers to RMAppManager summary
> 
>
> Key: YARN-10200
> URL: https://issues.apache.org/jira/browse/YARN-10200
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Priority: Major
>
> It would be useful to persist this so we can track containers processed by RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >