[jira] [Resolved] (YARN-8849) DynoYARN: A simulation and testing infrastructure for YARN clusters
[ https://issues.apache.org/jira/browse/YARN-8849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung resolved YARN-8849. - Resolution: Fixed FYI we have open source DynoYARN on Github: https://github.com/linkedin/dynoyarn > DynoYARN: A simulation and testing infrastructure for YARN clusters > --- > > Key: YARN-8849 > URL: https://issues.apache.org/jira/browse/YARN-8849 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun Suresh >Assignee: Jonathan Hung >Priority: Major > > Traditionally, YARN workload simulation is performed using SLS (Scheduler > Load Simulator) which is packaged with YARN. It Essentially, starts a full > fledged *ResourceManager*, but runs simulators for the *NodeManager* and the > *ApplicationMaster* Containers. These simulators are lightweight and run in a > threadpool. The NM simulators do not open any external ports and send > (in-process) heartbeats to the ResourceManager. > There are a couple of drawbacks with using the SLS: > * It might be difficult to simulate really large clusters without having > access to a very beefy box - since the NMs are launched as tasks in a > threadpool, and each NM has to send periodic heartbeats to the RM. > * Certain features (like YARN-1011) requires changes to the NodeManager - > aspects such as queuing and selectively killing containers have to be > incorporated into the existing NM Simulator which might make the simulator a > bit heavy weight - there is a need for locking and synchronization. > * Since the NM and AM are simulations, only the Scheduler is faithfully > tested - it does not really perform an end-2-end test of a cluster. > Therefore, drawing inspiration from > [Dynamometer|https://github.com/linkedin/dynamometer], we propose a framework > for YARN deployable YARN cluster - *DynoYARN* - for testing, with the > following features: > * The NM already has hooks to plug-in custom *ContainerExecutor* and > *NodeResourceMonitor*. If we can plug-in a custom *ContainersMonitorImpl*'s > Monitoring thread (and other modules like the LocalizationService), We can > probably inject an Executor that does not actually launch containers and a > Node and Container resource monitor that reports synthetic pre-specified > Utilization metrics back to the RM. > * Since we are launching fake containers, we cannot run normal AM > containers. We can therefore, use *Unmanaged AM*'s to launch synthetic jobs. > Essentially, a test workflow would look like this: > * Launch a DynoYARN cluster. > * Use the Unmanaged AM feature to directly negotiate with the DynaYARN > Resource Manager for container tokens. > * Use the container tokens from the RM to directly ask the DynoYARN Node > Managers to start fake containers. > * The DynoYARN NodeManagers will start the fake containers and report to the > DynoYARN Resource Manager synthetically generated resource utilization for > the containers (which will be injected via the *ContainerLaunchContext* and > parsed by the plugged-in Container Executor). > * The Scheduler will use the utilization report to schedule containers - we > will be able to test allocation of *Opportunistic* containers based on > resource utilization. > * Since the DynoYARN Node Managers run the actual code paths, all preemption > and queuing logic will be faithfully executed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10697) Resources are displayed in bytes in UI for schedulers other than capacity
[ https://issues.apache.org/jira/browse/YARN-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304467#comment-17304467 ] Jonathan Hung commented on YARN-10697: -- [~Jim_Brennan] [~BilwaST] I agree, I don't think we should make the Resource#toString change. IMO users expect this to be bytes and making this change could have some unintended consequences e.g. breaking log parsing tooling. > Resources are displayed in bytes in UI for schedulers other than capacity > - > > Key: YARN-10697 > URL: https://issues.apache.org/jira/browse/YARN-10697 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10697.001.patch, image-2021-03-17-11-30-57-216.png > > > Resources.newInstance expects MB as memory whereas in MetricsOverviewTable > passes resources in bytes . Also we should display memory in GB for better > readability for user. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10651) CapacityScheduler crashed with NPE in AbstractYarnScheduler.updateNodeResource()
[ https://issues.apache.org/jira/browse/YARN-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291305#comment-17291305 ] Jonathan Hung edited comment on YARN-10651 at 2/26/21, 12:04 AM: - +1 from me. I pushed this to trunk~branch-2.10. Thanks [~haibochen] for the contribution. was (Author: jhung): I pushed this to trunk~branch-2.10. Thanks [~haibochen] for the contribution. > CapacityScheduler crashed with NPE in > AbstractYarnScheduler.updateNodeResource() > - > > Key: YARN-10651 > URL: https://issues.apache.org/jira/browse/YARN-10651 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.10.0, 2.10.1 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3 > > Attachments: YARN-10651.00.patch, YARN-10651.01.patch, event_seq.jpg > > > {code:java} > 2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: > Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:748) > 2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: > Exiting, bbye..{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10651) CapacityScheduler crashed with NPE in AbstractYarnScheduler.updateNodeResource()
[ https://issues.apache.org/jira/browse/YARN-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291225#comment-17291225 ] Jonathan Hung commented on YARN-10651: -- Thanks [~haibochen] - should we add some logging in this case? Also, any way to reproduce this issue in a test? > CapacityScheduler crashed with NPE in > AbstractYarnScheduler.updateNodeResource() > - > > Key: YARN-10651 > URL: https://issues.apache.org/jira/browse/YARN-10651 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.10.0, 2.10.1 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Major > Attachments: YARN-10651.00.patch, event_seq.jpg > > > {code:java} > 2021-02-24 17:07:39,798 FATAL org.apache.hadoop.yarn.event.EventDispatcher: > Error in handling event type NODE_RESOURCE_UPDATE to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:809) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:1116) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:748) > 2021-02-24 17:07:39,798 INFO org.apache.hadoop.yarn.event.EventDispatcher: > Exiting, bbye..{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
[ https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222309#comment-17222309 ] Jonathan Hung edited comment on YARN-10467 at 10/28/20, 8:14 PM: - I committed this to trunk/branch-3.3/branch-3.2/branch-3.1/branch-2.10. Thanks [~haibochen] for the contribution and [~Jim_Brennan] for the review. was (Author: jhung): I committed this to trunk~branch-2.10. Thanks [~haibochen] for the contribution and [~Jim_Brennan] for the review. > ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers > - > > Key: YARN-10467 > URL: https://issues.apache.org/jira/browse/YARN-10467 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.10.0, 3.0.3, 3.2.1, 3.1.4 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3 > > Attachments: YARN-10467.00.patch, YARN-10467.01.patch, > YARN-10467.02.patch, YARN-10467.branch-2.10.00.patch, > YARN-10467.branch-2.10.01.patch, YARN-10467.branch-2.10.02.patch, > YARN-10467.branch-2.10.03.patch > > > In one of our recent heap analysis, we found that the majority of the heap is > occupied by {{RMNodeImpl.completedContainers}}, which > accounts for 19GB, out of 24.3 GB. There are over 86 million > ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects > which represent the # of active containers that RM is still tracking. > Inspecting some ContainerIdPBImpl objects, they belong to applications that > have long finished. This indicates some sort of memory leak of > ContainerIdPBImpl objects in RMNodeImpl. > > Right now, when a container is reported by a NM as completed, it is > immediately added to RMNodeImpl.completedContainers and later cleaned up > after the AM has been notified of its completion in the AM-RM heartbeat. The > cleanup can be broken into a few steps. > * Step 1: the completed container is first added to > RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added > to {{RMNodeImpl.completedContainers}}). > * Step 2: During the heartbeat AM-RM heartbeat, the container is removed > from RMAppAttemptImpl.justFinishedContainers and added to > RMAppAttemptImpl.finishedContainersSentToAM > Once a completed container gets added to > RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned > up from {{RMNodeImpl.completedContainers}} > > However, if the AM exits (regardless of failure or success) before some > recently completed containers can be added to > RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there > won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, > these objects stay in RMNodeImpl.completedContainers forever. > We have observed in MR that AMs can decide to exit upon success of all it > tasks without waiting for notification of the completion of every container, > or AM may just die suddenly (e.g. OOM). Spark and other framework may just > be similar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
[ https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221747#comment-17221747 ] Jonathan Hung commented on YARN-10467: -- [~haibochen], thank you for the patch. It looks good. It looks like you added some extra files in placement/schema in [^YARN-10467.branch-2.10.01.patch], can we remove those? Other than that, +1. > ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers > - > > Key: YARN-10467 > URL: https://issues.apache.org/jira/browse/YARN-10467 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.10.0 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Major > Fix For: 2.10.2 > > Attachments: YARN-10467.00.patch, YARN-10467.01.patch, > YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch > > > In one of our recent heap analysis, we found that the majority of the heap is > occupied by {{RMNodeImpl.completedContainers}}, which > accounts for 19GB, out of 24.3 GB. There are over 86 million > ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects > which represent the # of active containers that RM is still tracking. > Inspecting some ContainerIdPBImpl objects, they belong to applications that > have long finished. This indicates some sort of memory leak of > ContainerIdPBImpl objects in RMNodeImpl. > > Right now, when a container is reported by a NM as completed, it is > immediately added to RMNodeImpl.completedContainers and later cleaned up > after the AM has been notified of its completion in the AM-RM heartbeat. The > cleanup can be broken into a few steps. > * Step 1: the completed container is first added to > RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added > to {{RMNodeImpl.completedContainers}}). > * Step 2: During the heartbeat AM-RM heartbeat, the container is removed > from RMAppAttemptImpl.justFinishedContainers and added to > RMAppAttemptImpl.finishedContainersSentToAM > Once a completed container gets added to > RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned > up from {{RMNodeImpl.completedContainers}} > > However, if the AM exits (regardless of failure or success) before some > recently completed containers can be added to > RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there > won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, > these objects stay in RMNodeImpl.completedContainers forever. > We have observed in MR that AMs can decide to exit upon success of all it > tasks without waiting for notification of the completion of every container, > or AM may just die suddenly (e.g. OOM). Spark and other framework may just > be similar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics
[ https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214939#comment-17214939 ] Jonathan Hung commented on YARN-10450: -- Thanks [~Jim_Brennan] that is fine with me. > Add cpu and memory utilization per node and cluster-wide metrics > > > Key: YARN-10450 > URL: https://issues.apache.org/jira/browse/YARN-10450 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.3.1 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Fix For: 3.4.0, 3.3.1 > > Attachments: NodesPage.png, YARN-10450-branch-2.10.003.patch, > YARN-10450-branch-3.1.003.patch, YARN-10450-branch-3.2.003.patch, > YARN-10450.001.patch, YARN-10450.002.patch, YARN-10450.003.patch > > > Add metrics to show actual cpu and memory utilization for each node and > aggregated for the entire cluster. This is information is already passed > from NM to RM in the node status update. > We have been running with this internally for quite a while and found it > useful to be able to quickly see the actual cpu/memory utilization on the > node/cluster. It's especially useful if some form of overcommit is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics
[ https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212676#comment-17212676 ] Jonathan Hung commented on YARN-10450: -- [~Jim_Brennan], Physical Mem Used % makes sense to me. We also refer to this as "Memory Efficiency" internally. > Add cpu and memory utilization per node and cluster-wide metrics > > > Key: YARN-10450 > URL: https://issues.apache.org/jira/browse/YARN-10450 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.3.1 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Attachments: NodesPage.png, YARN-10450.001.patch, YARN-10450.002.patch > > > Add metrics to show actual cpu and memory utilization for each node and > aggregated for the entire cluster. This is information is already passed > from NM to RM in the node status update. > We have been running with this internally for quite a while and found it > useful to be able to quickly see the actual cpu/memory utilization on the > node/cluster. It's especially useful if some form of overcommit is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8210) AMRMClient logging on every heartbeat to track updation of AM RM token causes too many log lines to be generated in AM logs
[ https://issues.apache.org/jira/browse/YARN-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-8210: Fix Version/s: (was: 2.10.2) 2.10.1 > AMRMClient logging on every heartbeat to track updation of AM RM token causes > too many log lines to be generated in AM logs > --- > > Key: YARN-8210 > URL: https://issues.apache.org/jira/browse/YARN-8210 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0, 3.0.0-alpha1 >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > Fix For: 3.2.0, 3.1.1, 3.0.3, 2.10.1 > > Attachments: YARN-8210.1.patch > > > YARN-4682 added logs to track when AM RM token is updated for debuggability > purposes. However this is printed on every heartbeat and could cause the AM > logs to be flooded with this whenever RM's Master key is rolled over > especially if its a long running AM. Hence proposing to remove this log line. > As explained in > https://issues.apache.org/jira/browse/YARN-3104?focusedCommentId=14298692&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14298692 > , the AM-RM connection is not re-established so the updated token in the > client's UGI is never re-sent to the RPC server andRM continues to send the > token each heartbeat since it cannot be sure whether the client really has > the new token. Hence the log lines are printed on every heartbeat. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8210) AMRMClient logging on every heartbeat to track updation of AM RM token causes too many log lines to be generated in AM logs
[ https://issues.apache.org/jira/browse/YARN-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-8210: Fix Version/s: 2.10.2 Pushed to branch-2.10. > AMRMClient logging on every heartbeat to track updation of AM RM token causes > too many log lines to be generated in AM logs > --- > > Key: YARN-8210 > URL: https://issues.apache.org/jira/browse/YARN-8210 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0, 3.0.0-alpha1 >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > Fix For: 3.2.0, 3.1.1, 3.0.3, 2.10.2 > > Attachments: YARN-8210.1.patch > > > YARN-4682 added logs to track when AM RM token is updated for debuggability > purposes. However this is printed on every heartbeat and could cause the AM > logs to be flooded with this whenever RM's Master key is rolled over > especially if its a long running AM. Hence proposing to remove this log line. > As explained in > https://issues.apache.org/jira/browse/YARN-3104?focusedCommentId=14298692&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14298692 > , the AM-RM connection is not re-established so the updated token in the > client's UGI is never re-sent to the RPC server andRM continues to send the > token each heartbeat since it cannot be sure whether the client really has > the new token. Hence the log lines are printed on every heartbeat. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10251) Show extended resources on legacy RM UI.
[ https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173478#comment-17173478 ] Jonathan Hung commented on YARN-10251: -- Unit test failures aren't related. [^YARN-10251.branch-3.2.007.patch], [^YARN-10251.branch-2.10.007.patch], [^YARN-10251.007.patch] lgtm. I'll commit by EOD. > Show extended resources on legacy RM UI. > > > Key: YARN-10251 > URL: https://issues.apache.org/jira/browse/YARN-10251 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Attachments: Legacy RM UI With Not All Resources Shown.png, Updated > NodesPage UI With GPU columns.png, Updated RM UI With All Resources > Shown.png.png, YARN-10251.003.patch, YARN-10251.004.patch, > YARN-10251.005.patch, YARN-10251.006.patch, YARN-10251.007.patch, > YARN-10251.branch-2.10.001.patch, YARN-10251.branch-2.10.002.patch, > YARN-10251.branch-2.10.003.patch, YARN-10251.branch-2.10.005.patch, > YARN-10251.branch-2.10.006.patch, YARN-10251.branch-2.10.007.patch, > YARN-10251.branch-3.2.004.patch, YARN-10251.branch-3.2.005.patch, > YARN-10251.branch-3.2.006.patch, YARN-10251.branch-3.2.007.patch > > > It would be great to update the legacy RM UI to include GPU resources in the > overview and in the per-app sections. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10343) Legacy RM UI should include labeled metrics for allocated, total, and reserved resources.
[ https://issues.apache.org/jira/browse/YARN-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166685#comment-17166685 ] Jonathan Hung commented on YARN-10343: -- [^YARN-10343.branch-3.2.001.patch], [^YARN-10343.branch-2.10.001.patch] LGTM. Timeline test failures related to YARN-9338. FSSchedulerConfigurationStore failures related to YARN-9875. TestZKConfigurationStore fails locally pre-patch for me. Committed to trunk~branch-2.10. Thanks [~epayne] for the contribution and [~Jim_Brennan] for the review. > Legacy RM UI should include labeled metrics for allocated, total, and > reserved resources. > - > > Key: YARN-10343 > URL: https://issues.apache.org/jira/browse/YARN-10343 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.10.0, 3.2.1, 3.1.3 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Attachments: Screen Shot 2020-07-07 at 1.00.22 PM.png, Screen Shot > 2020-07-07 at 1.03.26 PM.png, YARN-10343.000.patch, YARN-10343.001.patch, > YARN-10343.branch-2.10.001.patch, YARN-10343.branch-3.2.001.patch > > > The current legacy RM UI only includes resources metrics for the default > partition. If a cluster has labeled nodes, those are not included in the > resource metrics for allocated, total, and reserved resources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10343) Legacy RM UI should include labeled metrics for allocated, total, and reserved resources.
[ https://issues.apache.org/jira/browse/YARN-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17160242#comment-17160242 ] Jonathan Hung commented on YARN-10343: -- +1 for [^YARN-10343.001.patch]. Test failure looks related to YARN-9333. > Legacy RM UI should include labeled metrics for allocated, total, and > reserved resources. > - > > Key: YARN-10343 > URL: https://issues.apache.org/jira/browse/YARN-10343 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.10.0, 3.2.1, 3.1.3 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Attachments: Screen Shot 2020-07-07 at 1.00.22 PM.png, Screen Shot > 2020-07-07 at 1.03.26 PM.png, YARN-10343.000.patch, YARN-10343.001.patch > > > The current legacy RM UI only includes resources metrics for the default > partition. If a cluster has labeled nodes, those are not included in the > resource metrics for allocated, total, and reserved resources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10263) Application summary is logged multiple times due to RM recovery
[ https://issues.apache.org/jira/browse/YARN-10263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157748#comment-17157748 ] Jonathan Hung commented on YARN-10263: -- My understanding is that the sequence looks like: (1) app finishes -> (2) app saved to state store -> (3) app summary is logged I think if we only check that an app is recovered or not, we will miss some apps if the RM is restarted between states 2 and 3. We would somehow need to tell the state store whether something has been logged, but it seems a bit overkill to add a new event for this. Any other thoughts? > Application summary is logged multiple times due to RM recovery > --- > > Key: YARN-10263 > URL: https://issues.apache.org/jira/browse/YARN-10263 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > > App finishes, and is logged to RM app summary. Restart RM. Then this app is > logged to RM app summary again. > We would somehow need to know cross-restart whether an app has been logged or > not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10343) Legacy RM UI should include labeled metrics for allocated, total, and reserved resources.
[ https://issues.apache.org/jira/browse/YARN-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156970#comment-17156970 ] Jonathan Hung commented on YARN-10343: -- thanks [~epayne] generally looks fine. I think the running containers case is a bit tricky, I couldn't find an API for that either. For getAllUsed, where did you see that it includes reserved resources? I couldn't find that. > Legacy RM UI should include labeled metrics for allocated, total, and > reserved resources. > - > > Key: YARN-10343 > URL: https://issues.apache.org/jira/browse/YARN-10343 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.10.0, 3.2.1, 3.1.3 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Attachments: Screen Shot 2020-07-07 at 1.00.22 PM.png, Screen Shot > 2020-07-07 at 1.03.26 PM.png, YARN-10343.000.patch > > > The current legacy RM UI only includes resources metrics for the default > partition. If a cluster has labeled nodes, those are not included in the > resource metrics for allocated, total, and reserved resources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10251) Show extended resources on legacy RM UI.
[ https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156889#comment-17156889 ] Jonathan Hung commented on YARN-10251: -- Hey [~epayne] do you plan to post a follow up patch regarding what we discussed above (making used/total/reserved report only default partition)? > Show extended resources on legacy RM UI. > > > Key: YARN-10251 > URL: https://issues.apache.org/jira/browse/YARN-10251 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Attachments: Legacy RM UI With Not All Resources Shown.png, Updated > NodesPage UI With GPU columns.png, Updated RM UI With All Resources > Shown.png.png, YARN-10251.003.patch, YARN-10251.004.patch, > YARN-10251.005.patch, YARN-10251.006.patch, YARN-10251.branch-2.10.001.patch, > YARN-10251.branch-2.10.002.patch, YARN-10251.branch-2.10.003.patch, > YARN-10251.branch-2.10.005.patch, YARN-10251.branch-2.10.006.patch, > YARN-10251.branch-3.2.004.patch, YARN-10251.branch-3.2.005.patch, > YARN-10251.branch-3.2.006.patch > > > It would be great to update the legacy RM UI to include GPU resources in the > overview and in the per-app sections. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10251) Show extended resources on legacy RM UI.
[ https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150555#comment-17150555 ] Jonathan Hung commented on YARN-10251: -- I see. Can you file a follow up jira for this? I think we should at least have it consistent. In this jira we can leave it as default partition for used, total, reserved. The follow up jira can change used, total, reserved to count for all partitions. > Show extended resources on legacy RM UI. > > > Key: YARN-10251 > URL: https://issues.apache.org/jira/browse/YARN-10251 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Attachments: Legacy RM UI With Not All Resources Shown.png, Updated > NodesPage UI With GPU columns.png, Updated RM UI With All Resources > Shown.png.png, YARN-10251.003.patch, YARN-10251.004.patch, > YARN-10251.005.patch, YARN-10251.006.patch, YARN-10251.branch-2.10.001.patch, > YARN-10251.branch-2.10.002.patch, YARN-10251.branch-2.10.003.patch, > YARN-10251.branch-2.10.005.patch, YARN-10251.branch-2.10.006.patch, > YARN-10251.branch-3.2.004.patch, YARN-10251.branch-3.2.005.patch, > YARN-10251.branch-3.2.006.patch > > > It would be great to update the legacy RM UI to include GPU resources in the > overview and in the per-app sections. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10251) Show extended resources on legacy RM UI.
[ https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149697#comment-17149697 ] Jonathan Hung commented on YARN-10251: -- [~epayne] thanks, generally 006 looks good to me, but have a question: {noformat}totalReservedResourcesAcrossPartition = new ResourceInfo( cs.getClusterResourceUsage().getReserved());{noformat} This seems to fetch reserved resources for default partition only? Should we change to fetch across partitions like we do for usedResources? > Show extended resources on legacy RM UI. > > > Key: YARN-10251 > URL: https://issues.apache.org/jira/browse/YARN-10251 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Attachments: Legacy RM UI With Not All Resources Shown.png, Updated > NodesPage UI With GPU columns.png, Updated RM UI With All Resources > Shown.png.png, YARN-10251.003.patch, YARN-10251.004.patch, > YARN-10251.005.patch, YARN-10251.006.patch, YARN-10251.branch-2.10.001.patch, > YARN-10251.branch-2.10.002.patch, YARN-10251.branch-2.10.003.patch, > YARN-10251.branch-2.10.005.patch, YARN-10251.branch-2.10.006.patch, > YARN-10251.branch-3.2.004.patch, YARN-10251.branch-3.2.005.patch, > YARN-10251.branch-3.2.006.patch > > > It would be great to update the legacy RM UI to include GPU resources in the > overview and in the per-app sections. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-6492: Fix Version/s: 2.10.1 2.9.3 > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 2.9.3, 3.2.2, 2.10.1, 3.4.0, 3.3.1, 3.1.5 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.10.019.patch, > YARN-6492-branch-2.8.014.patch, YARN-6492-branch-2.9.015.patch, > YARN-6492-branch-3.1.018.patch, YARN-6492-branch-3.2.017.patch, > YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, > YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, > YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, > YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, > YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently
[ https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120351#comment-17120351 ] Jonathan Hung commented on YARN-10297: -- Thanks. TestContinuousScheduling stops rm on teardown. Also I tried stopping rm in the setup method, and it still failed with the same exception somehow. > TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails > intermittently > --- > > Key: YARN-10297 > URL: https://issues.apache.org/jira/browse/YARN-10297 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Priority: Major > > After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails > intermittently when running {{mvn test -Dtest=TestContinuousScheduling}} > {noformat}[INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling > [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 > s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling > [ERROR] > testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling) > Time elapsed: 0.194 s <<< ERROR! > org.apache.hadoop.metrics2.MetricsException: Metrics source > PartitionQueueMetrics,partition= already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-6492: Attachment: YARN-6492-branch-2.10.019.patch > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.10.019.patch, > YARN-6492-branch-2.8.014.patch, YARN-6492-branch-2.9.015.patch, > YARN-6492-branch-3.1.018.patch, YARN-6492-branch-3.2.017.patch, > YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, > YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, > YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, > YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, > YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120349#comment-17120349 ] Jonathan Hung commented on YARN-6492: - Thanks. I see this method is only used in tests in trunk too. I prefer to keep this method, remove the partition==null / partition == empty string check as in the trunk patch, and remove this method in another JIRA so that the branches are consistent. [~maniraj...@gmail.com] what do you think? I attached [^YARN-6492-branch-2.10.019.patch] for this. Can you take a look? > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.10.019.patch, > YARN-6492-branch-2.8.014.patch, YARN-6492-branch-2.9.015.patch, > YARN-6492-branch-3.1.018.patch, YARN-6492-branch-3.2.017.patch, > YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, > YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, > YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, > YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, > YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently
[ https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-10297: - Comment: was deleted (was: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 44s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 49s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 40s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 39s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 48s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 40s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 87m 44s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 38s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}147m 58s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26090/artifact/out/Dockerfile | | JIRA Issue | YARN-10297 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004381/YARN-10297.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 99aae03f4838 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 19f26a020e2 | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/26090/testReport/ | | Max. process+thread count | 891 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoo
[jira] [Updated] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-6492: Attachment: YARN-6492-branch-3.2.017.patch YARN-6492-branch-3.1.018.patch > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, > YARN-6492-branch-2.9.015.patch, YARN-6492-branch-3.1.018.patch, > YARN-6492-branch-3.2.017.patch, YARN-6492-junits.patch, YARN-6492.001.patch, > YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, > YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, > YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, > YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, > partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently
[ https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung reassigned YARN-10297: Assignee: (was: Jonathan Hung) > TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails > intermittently > --- > > Key: YARN-10297 > URL: https://issues.apache.org/jira/browse/YARN-10297 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Priority: Major > > After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails > intermittently when running {{mvn test -Dtest=TestContinuousScheduling}} > {noformat}[INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling > [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 > s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling > [ERROR] > testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling) > Time elapsed: 0.194 s <<< ERROR! > org.apache.hadoop.metrics2.MetricsException: Metrics source > PartitionQueueMetrics,partition= already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently
[ https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-10297: - Description: After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails intermittently when running {{mvn test -Dtest=TestContinuousScheduling}} {noformat}[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 s <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling [ERROR] testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling) Time elapsed: 0.194 s <<< ERROR! org.apache.hadoop.metrics2.MetricsException: Metrics source PartitionQueueMetrics,partition= already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) {noformat} was: After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails intermittently. {noformat}[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 s <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling [ERROR] testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling) Time elapsed: 0.194 s <<< ERROR! org.apache.hadoop.metrics2.MetricsException: Metrics source PartitionQueueMetrics,partition= already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:45
[jira] [Commented] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently
[ https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119970#comment-17119970 ] Jonathan Hung commented on YARN-10297: -- [~maniraj...@gmail.com] while debugging this, I noticed getPartitionMetrics is not synchronized. I added this and it did not fix the issue in this JIRA, but it seems like we still may need to add this? > TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails > intermittently > --- > > Key: YARN-10297 > URL: https://issues.apache.org/jira/browse/YARN-10297 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails > intermittently. > {noformat}[INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling > [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 > s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling > [ERROR] > testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling) > Time elapsed: 0.194 s <<< ERROR! > org.apache.hadoop.metrics2.MetricsException: Metrics source > PartitionQueueMetrics,partition= already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently
[ https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-10297: - Attachment: (was: YARN-10297.001.patch) > TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails > intermittently > --- > > Key: YARN-10297 > URL: https://issues.apache.org/jira/browse/YARN-10297 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails > intermittently. > {noformat}[INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling > [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 > s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling > [ERROR] > testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling) > Time elapsed: 0.194 s <<< ERROR! > org.apache.hadoop.metrics2.MetricsException: Metrics source > PartitionQueueMetrics,partition= already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently
[ https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung reassigned YARN-10297: Assignee: Jonathan Hung > TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails > intermittently > --- > > Key: YARN-10297 > URL: https://issues.apache.org/jira/browse/YARN-10297 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: YARN-10297.001.patch > > > After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails > intermittently. > {noformat}[INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling > [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 > s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling > [ERROR] > testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling) > Time elapsed: 0.194 s <<< ERROR! > org.apache.hadoop.metrics2.MetricsException: Metrics source > PartitionQueueMetrics,partition= already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently
[ https://issues.apache.org/jira/browse/YARN-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-10297: - Attachment: YARN-10297.001.patch > TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails > intermittently > --- > > Key: YARN-10297 > URL: https://issues.apache.org/jira/browse/YARN-10297 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Priority: Major > Attachments: YARN-10297.001.patch > > > After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails > intermittently. > {noformat}[INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling > [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 > s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling > [ERROR] > testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling) > Time elapsed: 0.194 s <<< ERROR! > org.apache.hadoop.metrics2.MetricsException: Metrics source > PartitionQueueMetrics,partition= already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119915#comment-17119915 ] Jonathan Hung edited comment on YARN-6492 at 5/29/20, 9:26 PM: --- Looks like TestContinuousScheduling is failing intermittently. I filed YARN-10297 for this issue. was (Author: jhung): Looks like TestContinuousScheduling is failing in branch-3.1 and below (it succeeds in branch-3.2). I'm able to trigger it by running: mvn test -Dtest=TestContinuousScheduling#testBasic,TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime None of the other tests which run before testFairSchedulerContinuousSchedulingInitTime seem to trigger the issue. > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, > YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, > YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, > YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, > YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, > YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, > partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10297) TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently
Jonathan Hung created YARN-10297: Summary: TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime fails intermittently Key: YARN-10297 URL: https://issues.apache.org/jira/browse/YARN-10297 Project: Hadoop YARN Issue Type: Improvement Reporter: Jonathan Hung After YARN-6492, testFairSchedulerContinuousSchedulingInitTime fails intermittently. {noformat}[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.682 s <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling [ERROR] testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling) Time elapsed: 0.194 s <<< ERROR! org.apache.hadoop.metrics2.MetricsException: Metrics source PartitionQueueMetrics,partition= already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionMetrics(QueueMetrics.java:362) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.incrPendingResources(QueueMetrics.java:601) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updatePendingResources(AppSchedulingInfo.java:388) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:320) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.internalAddResourceRequests(AppSchedulingInfo.java:347) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:183) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:456) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:898) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:375) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119915#comment-17119915 ] Jonathan Hung edited comment on YARN-6492 at 5/29/20, 8:43 PM: --- Looks like TestContinuousScheduling is failing in branch-3.1 and below (it succeeds in branch-3.2). I'm able to trigger it by running: mvn test -Dtest=TestContinuousScheduling#testBasic,TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime None of the other tests which run before testFairSchedulerContinuousSchedulingInitTime seem to trigger the issue. was (Author: jhung): Looks like TestContinuousScheduling is failing in branch-3.1 and below (it succeeds in branch-3.2). > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, > YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, > YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, > YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, > YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, > YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, > partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119915#comment-17119915 ] Jonathan Hung commented on YARN-6492: - Looks like TestContinuousScheduling is failing in branch-3.1 and below (it succeeds in branch-3.2). > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, > YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, > YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, > YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, > YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, > YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, > partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-6492: Fix Version/s: 3.1.5 3.3.1 3.2.2 > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, > YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, > YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, > YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, > YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, > YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, > partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119823#comment-17119823 ] Jonathan Hung commented on YARN-6492: - Thanks [~maniraj...@gmail.com]. For the branch-2.10 patch, do we need to remove the {noformat}if (partition == null || partition.equals(RMNodeLabelsManager.NO_LABEL)) {{noformat} check in {noformat}public void allocateResources(String partition, String user, Resource res) {{noformat} ? Other than that, branch-2.10 and branch-2.9 patch LGTM. Since branch-2.8 is EOL we don't need to port it there. I attached branch-3.2 and branch-3.1 patches containing trivial fixes. Pushed this to branch-3.3, branch-3.2, branch-3.1. > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 3.4.0 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, > YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, > YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, > YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, > YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, > YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, > partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-6492: Attachment: (was: YARN-6492-branch-3.2.017.patch) > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 3.4.0 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, > YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, > YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, > YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, > YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, > YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, > partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-6492: Attachment: (was: YARN-6492-branch-3.1.018.patch) > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 3.4.0 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, > YARN-6492-branch-2.9.015.patch, YARN-6492-junits.patch, YARN-6492.001.patch, > YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, > YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, > YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, > YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, > partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-6492: Attachment: YARN-6492-branch-3.1.018.patch > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 3.4.0 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, > YARN-6492-branch-2.9.015.patch, YARN-6492-branch-3.1.018.patch, > YARN-6492-branch-3.2.017.patch, YARN-6492-junits.patch, YARN-6492.001.patch, > YARN-6492.002.patch, YARN-6492.003.patch, YARN-6492.004.patch, > YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, > YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, > YARN-6492.011.WIP.patch, YARN-6492.012.WIP.patch, YARN-6492.013.patch, > partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-6492: Attachment: YARN-6492-branch-3.2.017.patch > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 3.4.0 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.8.014.patch, > YARN-6492-branch-2.9.015.patch, YARN-6492-branch-3.2.017.patch, > YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, > YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, > YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, > YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, > YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-6492: Fix Version/s: 3.4.0 > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Fix For: 3.4.0 > > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, > YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, > YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, > YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, > YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-6492: Attachment: YARN-6492.013.patch > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, > YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, > YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, > YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, > YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117171#comment-17117171 ] Jonathan Hung commented on YARN-6492: - Attached [^YARN-6492.013.patch] which fixes the whitespace issues and pushed to trunk. > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, > YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, > YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, > YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, > YARN-6492.012.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116955#comment-17116955 ] Jonathan Hung commented on YARN-6492: - Thanks [~maniraj...@gmail.com]. [^YARN-6492.012.WIP.patch] LGTM. TestCapacitySchedulerAutoQueueCreation passes locally. I will commit EOD pending jenkins if no objections. I can review the branch specific patches once those are uploaded. > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, > YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, > YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, > YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, > YARN-6492.012.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116274#comment-17116274 ] Jonathan Hung commented on YARN-6492: - Ok, I see. On line 2542, can we remove the nm1 heartbeats and change the asserts accordingly? This appears to test that requesting default partition containers will get allocated to nm2, but if we heartbeat to nm1 before nm2, then they will get allocated to nm1 and we lose this test case. Can we fix the two whitespace issues too? For TestCapacitySchedulerAutoQueueCreation test failures, it seems to be specific to PartitionQueueMetrics/PartitionMetrics somehow. I ran these tests before the patch and it succeeds, meaning metrics system is getting reset properly. Also, once we resolve these issues, will you upload patches for branches up to branch-2.10? > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, > YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, > YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, > YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, > partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114350#comment-17114350 ] Jonathan Hung edited comment on YARN-6492 at 5/22/20, 11:34 PM: Thank you [~maniraj...@gmail.com]. Some more comments: * Delete printlns in PartitionQueueMetrics * Delete VisibleForTesting import in CSQueueMetrics * Can we address the checkstyle/whitespace/javadoc/findbugs/asflicense issues? * TestCapacitySchedulerAutoQueueCreation failure looks related to this patch. TestFairSchedulerPreemption passes locally for me. Ran some tests on a live cluster, everything looks good. I noticed we don't have CSQueueMetrics for partitioned metrics, it would be good to have those, but we can address this in another jira. was (Author: jhung): Thank you [~maniraj...@gmail.com]. Some more comments: * Delete printlns in PartitionQueueMetrics * Delete VisibleForTesting import in CSQueueMetrics * Can we address the checkstyle/whitespace/javadoc/findbugs/asflicense issues? * Not sure if unit test failures are related. Let's see the next jenkins run. Ran some tests on a live cluster, everything looks good. I noticed we don't have CSQueueMetrics for partitioned metrics, it would be good to have those, but we can address this in another jira. > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, > YARN-6492.010.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114350#comment-17114350 ] Jonathan Hung edited comment on YARN-6492 at 5/22/20, 10:50 PM: Thank you [~maniraj...@gmail.com]. Some more comments: * Delete printlns in PartitionQueueMetrics * Delete VisibleForTesting import in CSQueueMetrics * Can we address the checkstyle/whitespace/javadoc/findbugs/asflicense issues? * Not sure if unit test failures are related. Let's see the next jenkins run. Ran some tests on a live cluster, everything looks good. I noticed we don't have CSQueueMetrics for partitioned metrics, it would be good to have those, but we can address this in another jira. was (Author: jhung): Thank you [~maniraj...@gmail.com]. Some more comments: * Delete printlns in PartitionQueueMetrics * Delete VisibleForTesting import in CSQueueMetrics * Can we address the checkstyle/whitespace/javadoc/findbugs/asflicense issues? * Not sure if unit test failures are related. Let's see the next jenkins run. I'll run some tests on a live cluster in the meantime. > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, > YARN-6492.010.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114350#comment-17114350 ] Jonathan Hung commented on YARN-6492: - Thank you [~maniraj...@gmail.com]. Some more comments: * Delete printlns in PartitionQueueMetrics * Delete VisibleForTesting import in CSQueueMetrics * Can we address the checkstyle/whitespace/javadoc/findbugs/asflicense issues? * Not sure if unit test failures are related. Let's see the next jenkins run. I'll run some tests on a live cluster in the meantime. > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, > YARN-6492.010.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648 ] Jonathan Hung edited comment on YARN-6492 at 5/20/20, 10:59 PM: Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments: * We can change parentQueue in QueueMetrics.java to be Queue instead of AbstractCSQueue (to fix test cases) * Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + userName", can we change this to "partition + '.' + userName + '.' + queuePath" ? In particular the queuePath + userName part could cause conflicts (e.g. queue named "root.auser" could conflict with user metrics under queue "root.a" and username "user"). Putting the user before the queue and adding the delimiter should prevent the user from being interpreted as part of the queue path. I see a few places for this: # PartitionQueueMetrics#constructor#parentMetricName # PartitionQueueMetrics#getUserMetrics#metricName # QueueMetrics#getUserMetrics#metricName # QueueMetrics#getPartitionQueueMetrics#metricName # Key for QueueMetrics#getPartitionMetrics could collide if the partition name is "root" * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add the metrics object to QUEUE_METRICS, since we're accessing user metrics via the {{users}} map (and not the QUEUE_METRICS map) * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add queue path to the key, since the {{users}} map is not static * QueueMetrics#queueSource method does not seem to be used anywhere, can we delete it? * How come we need a CSQueueMetrics#forQueue implementation? It looks the same as QueueMetrics#forQueue * We shouldn't add capacity scheduler specific things in QueueInfo, are these changes needed? * For partition metrics, I don't think setAvailableResourcesToQueue is handled correctly. It appears to update partition metrics no matter which queue this method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics: {noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 2);{noformat} should be {noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 2);{noformat} Perhaps we should only update partition metrics in setAvailableResourcesToQueue if the queue is root? * Delete {noformat}System.out.println(" final is " + parentQueueSource_X.toString());{noformat} * Same in TestQueueMetrics, there should not be capacity scheduler specific logic here, can we remove these changes? * On line 2539 of TestNodeLabelContainerAllocation, should {noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} be {noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} ? * Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The stuff removed seems to be non-exclusive node label functionality (default partition node heartbeating, and checking queue metrics are correct), so we probably want to keep these tests. * On line 2566, how is node1 getting 8 containers if queue A's max capacity is only 50% of 10GB = 5GB? was (Author: jhung): Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments: * We can change parentQueue in QueueMetrics.java to be Queue instead of AbstractCSQueue (to fix test cases) * Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + userName", can we change this to "partition + '.' + userName + '.' + queuePath" ? In particular the queuePath + userName part could cause conflicts (e.g. queue named "root.auser" could conflict with user metrics under queue "root.a" and username "user"). Putting the user before the queue and adding the delimiter should prevent the user from being interpreted as part of the queue path. I see a few places for this: # PartitionQueueMetrics#constructor#parentMetricName # PartitionQueueMetrics#getUserMetrics#metricName # QueueMetrics#getUserMetrics#metricName # QueueMetrics#getPartitionQueueMetrics#metricName # Key for QueueMetrics#getPartitionMetrics could collide if the partition name is "root" * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add the metrics object to QUEUE_METRICS, since we're accessing user metrics via the user map (and not the QUEUE_METRICS map) * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add queue path to the key, since the users map is not static * QueueMetrics#queueSource method does not seem to be used anywhere, can we delete it? * How come we need a CSQueueMetrics#forQueue implementation? It looks the same as QueueMetrics#forQueue * We shouldn't add capacity scheduler specific things in QueueInfo, are these changes ne
[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648 ] Jonathan Hung edited comment on YARN-6492 at 5/20/20, 10:58 PM: Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments: * We can change parentQueue in QueueMetrics.java to be Queue instead of AbstractCSQueue (to fix test cases) * Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + userName", can we change this to "partition + '.' + userName + '.' + queuePath" ? In particular the queuePath + userName part could cause conflicts (e.g. queue named "root.auser" could conflict with user metrics under queue "root.a" and username "user"). Putting the user before the queue and adding the delimiter should prevent the user from being interpreted as part of the queue path. I see a few places for this: # PartitionQueueMetrics#constructor#parentMetricName # PartitionQueueMetrics#getUserMetrics#metricName # QueueMetrics#getUserMetrics#metricName # QueueMetrics#getPartitionQueueMetrics#metricName # Key for QueueMetrics#getPartitionMetrics could collide if the partition name is "root" * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add the metrics object to QUEUE_METRICS, since we're accessing user metrics via the user map (and not the QUEUE_METRICS map) * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add queue path to the key, since the users map is not static * QueueMetrics#queueSource method does not seem to be used anywhere, can we delete it? * How come we need a CSQueueMetrics#forQueue implementation? It looks the same as QueueMetrics#forQueue * We shouldn't add capacity scheduler specific things in QueueInfo, are these changes needed? * For partition metrics, I don't think setAvailableResourcesToQueue is handled correctly. It appears to update partition metrics no matter which queue this method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics: {noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 2);{noformat} should be {noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 2);{noformat} Perhaps we should only update partition metrics in setAvailableResourcesToQueue if the queue is root? * Delete {noformat}System.out.println(" final is " + parentQueueSource_X.toString());{noformat} * Same in TestQueueMetrics, there should not be capacity scheduler specific logic here, can we remove these changes? * On line 2539 of TestNodeLabelContainerAllocation, should {noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} be {noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} ? * Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The stuff removed seems to be non-exclusive node label functionality (default partition node heartbeating, and checking queue metrics are correct), so we probably want to keep these tests. * On line 2566, how is node1 getting 8 containers if queue A's max capacity is only 50% of 10GB = 5GB? was (Author: jhung): Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments: * We can change parentQueue in QueueMetrics.java to be Queue instead of AbstractCSQueue (to fix test cases) * Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + userName", can we change this to "partition + '.' + userName + '.' + queuePath" ? In particular the queuePath + userName part could cause conflicts (e.g. queue named "root.auser" could conflict with user metrics under queue "root.a" and username "user"). I see a few places for this: # PartitionQueueMetrics#constructor#parentMetricName # PartitionQueueMetrics#getUserMetrics#metricName # QueueMetrics#getUserMetrics#metricName # QueueMetrics#getPartitionQueueMetrics#metricName # Key for QueueMetrics#getPartitionMetrics could collide if the partition name is "root" * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add the metrics object to QUEUE_METRICS, since we're accessing user metrics via the user map (and not the QUEUE_METRICS map) * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add queue path to the key, since the users map is not static * QueueMetrics#queueSource method does not seem to be used anywhere, can we delete it? * How come we need a CSQueueMetrics#forQueue implementation? It looks the same as QueueMetrics#forQueue * We shouldn't add capacity scheduler specific things in QueueInfo, are these changes needed? * For partition metrics, I don't think setAvailableResourcesToQueue is handled correctly. It appears to update partition metrics no matte
[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648 ] Jonathan Hung edited comment on YARN-6492 at 5/20/20, 10:57 PM: Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments: * We can change parentQueue in QueueMetrics.java to be Queue instead of AbstractCSQueue (to fix test cases) * Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + userName", can we change this to "partition + '.' + userName + '.' + queuePath" ? In particular the queuePath + userName part could cause conflicts (e.g. queue named "root.auser" could conflict with user metrics under queue "root.a" and username "user"). I see a few places for this: # PartitionQueueMetrics#constructor#parentMetricName # PartitionQueueMetrics#getUserMetrics#metricName # QueueMetrics#getUserMetrics#metricName # QueueMetrics#getPartitionQueueMetrics#metricName # Key for QueueMetrics#getPartitionMetrics could collide if the partition name is "root" * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add the metrics object to QUEUE_METRICS, since we're accessing user metrics via the user map (and not the QUEUE_METRICS map) * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add queue path to the key, since the users map is not static * QueueMetrics#queueSource method does not seem to be used anywhere, can we delete it? * How come we need a CSQueueMetrics#forQueue implementation? It looks the same as QueueMetrics#forQueue * We shouldn't add capacity scheduler specific things in QueueInfo, are these changes needed? * For partition metrics, I don't think setAvailableResourcesToQueue is handled correctly. It appears to update partition metrics no matter which queue this method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics: {noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 2);{noformat} should be {noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 2);{noformat} Perhaps we should only update partition metrics in setAvailableResourcesToQueue if the queue is root? * Delete {noformat}System.out.println(" final is " + parentQueueSource_X.toString());{noformat} * Same in TestQueueMetrics, there should not be capacity scheduler specific logic here, can we remove these changes? * On line 2539 of TestNodeLabelContainerAllocation, should {noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} be {noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} ? * Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The stuff removed seems to be non-exclusive node label functionality (default partition node heartbeating, and checking queue metrics are correct), so we probably want to keep these tests. * On line 2566, how is node1 getting 8 containers if queue A's max capacity is only 50% of 10GB = 5GB? was (Author: jhung): Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments: * We can change parentQueue in QueueMetrics.java to be Queue instead of AbstractCSQueue (to fix test cases) * Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + userName", can we change this to "partition + '.' + userName + '.' + queuePath" ? In particular the queuePath + userName part could cause conflicts (e.g. queue named "root.auser" could conflict with user metrics under queue "root.a" and username "user"). I see a few places for this: # PartitionQueueMetrics#constructor#parentMetricName # PartitionQueueMetrics#getUserMetrics#metricName # QueueMetrics#getUserMetrics#metricName # QueueMetrics#getPartitionQueueMetrics#metricName # Key for QueueMetrics#getPartitionMetrics could collide if the partition name is "root" * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add the metrics object to QUEUE_METRICS, since we're accessing user metrics via the user map (and not the QUEUE_METRICS map) * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add queue path to the key, since the users map is not static * QueueMetrics#queueSource method does not seem to be used anywhere, can we delete it? * How come we need a CSQueueMetrics#forQueue implementation? It looks the same as QueueMetrics#forQueue * We shouldn't add capacity scheduler specific things in QueueInfo, are these changes needed? * For partition metrics, I don't think setAvailableResourcesToQueue is handled correctly. It appears to update partition metrics no matter which queue this method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics: {noformat}checkResources(partition
[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648 ] Jonathan Hung edited comment on YARN-6492 at 5/20/20, 10:57 PM: Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments: * We can change parentQueue in QueueMetrics.java to be Queue instead of AbstractCSQueue (to fix test cases) * Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + userName", can we change this to "partition + '.' + userName + '.' + queuePath" ? In particular the queuePath + userName part could cause conflicts (e.g. queue named "root.auser" could conflict with user metrics under queue "root.a" and username "user"). I see a few places for this: # PartitionQueueMetrics#constructor#parentMetricName # PartitionQueueMetrics#getUserMetrics#metricName # QueueMetrics#getUserMetrics#metricName # QueueMetrics#getPartitionQueueMetrics#metricName # Key for QueueMetrics#getPartitionMetrics could collide if the partition name is "root" * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add the metrics object to QUEUE_METRICS, since we're accessing user metrics via the user map (and not the QUEUE_METRICS map) * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add queue path to the key, since the users map is not static * QueueMetrics#queueSource method does not seem to be used anywhere, can we delete it? * How come we need a CSQueueMetrics#forQueue implementation? It looks the same as QueueMetrics#forQueue * We shouldn't add capacity scheduler specific things in QueueInfo, are these changes needed? * For partition metrics, I don't think setAvailableResourcesToQueue is handled correctly. It appears to update partition metrics no matter which queue this method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics: {noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 2);{noformat} should be {noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 2);{noformat} Perhaps we should only update partition metrics in setAvailableResourcesToQueue if the queue is root? * Delete {noformat}println System.out.println(" final is " + parentQueueSource_X.toString());{noformat} * Same in TestQueueMetrics, there should not be capacity scheduler specific logic here, can we remove these changes? * On line 2539 of TestNodeLabelContainerAllocation, should {noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} be {noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} ? * Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The stuff removed seems to be non-exclusive node label functionality (default partition node heartbeating, and checking queue metrics are correct), so we probably want to keep these tests. * On line 2566, how is node1 getting 8 containers if queue A's max capacity is only 50% of 10GB = 5GB? was (Author: jhung): Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments: * We can change parentQueue in QueueMetrics.java to be Queue instead of AbstractCSQueue (to fix test cases) * Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + userName", can we change this to "partition + '.' + userName + '.' + queuePath" ? In particular the queuePath + userName part could cause conflicts (e.g. queue named "root.auser" could conflict with user metrics under queue "root.a" and username "user"). I see a few places for this: # PartitionQueueMetrics#constructor#parentMetricName # PartitionQueueMetrics#getUserMetrics#metricName # QueueMetrics#getUserMetrics#metricName # QueueMetrics#getPartitionQueueMetrics#metricName # Key for QueueMetrics#getPartitionMetrics could collide if the partition name is "root" * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add the metrics object to QUEUE_METRICS, since we're accessing user metrics via the user map (and not the QUEUE_METRICS map) * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add queue path to the key, since the users map is not static * QueueMetrics#queueSource method does not seem to be used anywhere, can we delete it? * How come we need a CSQueueMetrics#forQueue implementation? It looks the same as QueueMetrics#forQueue * We shouldn't add capacity scheduler specific things in QueueInfo, are these changes needed? * I don't think setAvailableResourcesToQueue is handled correctly. It appears to update partition metrics no matter which queue this method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics: {noformat}checkResources(partitionSource, 0, 0, 0
[jira] [Commented] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648 ] Jonathan Hung commented on YARN-6492: - Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments: * We can change parentQueue in QueueMetrics.java to be Queue instead of AbstractCSQueue (to fix test cases) * Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + userName", can we change this to "partition + '.' + userName + '.' + queuePath" ? In particular the queuePath + userName part could cause conflicts (e.g. queue named "root.auser" could conflict with user metrics under queue "root.a" and username "user"). I see a few places for this: # PartitionQueueMetrics#constructor#parentMetricName # PartitionQueueMetrics#getUserMetrics#metricName # QueueMetrics#getUserMetrics#metricName # QueueMetrics#getPartitionQueueMetrics#metricName # Key for QueueMetrics#getPartitionMetrics could collide if the partition name is "root" * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add the metrics object to QUEUE_METRICS, since we're accessing user metrics via the user map (and not the QUEUE_METRICS map) * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add queue path to the key, since the users map is not static * QueueMetrics#queueSource method does not seem to be used anywhere, can we delete it? * How come we need a CSQueueMetrics#forQueue implementation? It looks the same as QueueMetrics#forQueue * We shouldn't add capacity scheduler specific things in QueueInfo, are these changes needed? * I don't think setAvailableResourcesToQueue is handled correctly. It appears to update partition metrics no matter which queue this method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics: {noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 2);{noformat} should be {noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 2);{noformat} Perhaps we should only update partition metrics in setAvailableResourcesToQueue if the queue is root? * Delete {noformat}println System.out.println(" final is " + parentQueueSource_X.toString());{noformat} * Same in TestQueueMetrics, there should not be capacity scheduler specific logic here, can we remove these changes? * On line 2539 of TestNodeLabelContainerAllocation, should {noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} be {noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} ? * Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The stuff removed seems to be non-exclusive node label functionality (default partition node heartbeating, and checking queue metrics are correct), so we probably want to keep these tests. * On line 2566, how is node1 getting 8 containers if queue A's max capacity is only 50% of 10GB = 5GB? > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, > partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10260) Allow transitioning queue from DRAINING to RUNNING state
[ https://issues.apache.org/jira/browse/YARN-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104940#comment-17104940 ] Jonathan Hung commented on YARN-10260: -- +1 looks fine to me. I'll commit this tomorrow if no objections. > Allow transitioning queue from DRAINING to RUNNING state > > > Key: YARN-10260 > URL: https://issues.apache.org/jira/browse/YARN-10260 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10260.001.patch > > > We found in our cluster, a queue was erroneously stopped. Then queue is > internally in DRAINING state. It cannot be moved back to RUNNING state until > the queue is finished draining. For queues with large workloads, this can > block other apps from submitting to this queue for a long time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10263) Application summary is logged multiple times due to RM recovery
Jonathan Hung created YARN-10263: Summary: Application summary is logged multiple times due to RM recovery Key: YARN-10263 URL: https://issues.apache.org/jira/browse/YARN-10263 Project: Hadoop YARN Issue Type: Improvement Reporter: Jonathan Hung App finishes, and is logged to RM app summary. Restart RM. Then this app is logged to RM app summary again. We would somehow need to know cross-restart whether an app has been logged or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104672#comment-17104672 ] Jonathan Hung commented on YARN-6492: - IMO we should still have {noformat} "name" : "Hadoop:service=ResourceManager,name=QueueMetrics,q0=root,q1=a" ...{noformat} report queue metrics for default partition only. Users could also use {noformat}name=PartitionQueueMetrics,partition=default,q0=root{noformat} (or, {noformat}name=PartitionQueueMetrics,partition=,q0=root{noformat}) for default queue metrics, but if people are already using {noformat} "name" : "Hadoop:service=ResourceManager,name=QueueMetrics,q0=root,q1=a" ...{noformat} for default queue metrics (since this has already gone into many releases) I don't think we can justify breaking this behavior. If we want to change this behavior so {noformat} "name" : "Hadoop:service=ResourceManager,name=QueueMetrics,q0=root,q1=a" ...{noformat} reports metrics for all partitions, as it was before YARN-6467, we can revisit that in a later JIRA. But I don't think we should do it here. > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103520#comment-17103520 ] Jonathan Hung commented on YARN-6492: - Ok I see. This was not my original understanding. I assumed YARN-6467 was filed standalone, then I filed this ticket because I saw YARN-6467 would remove partitioned metrics. IMO if there's multiple JIRAs that require a feature to work properly, they shouldn't be committed separately. In any case, YARN-6467 has already made its way into releases, so we have already broken compatibility. Hence, I think we should treat "original queuemetrics computation" as behavior *after* YARN-6467 (I don't want this JIRA to reverse the behavior from YARN-6467, thus breaking compatibility again). [~maniraj...@gmail.com] [~epayne] let me know if this makes sense. > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101990#comment-17101990 ] Jonathan Hung edited comment on YARN-6492 at 5/7/20, 7:28 PM: -- [~maniraj...@gmail.com], thanks. Seems you missed uploading PartitionQueueMetrics class. I definitely think we should address #2, #3, and #4 in this JIRA. I don't think #3 is addressed by YARN-9767. For example it edits the tests in the same way, i.e. {noformat}assertEquals(10 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat} is changed to {noformat}assertEquals(22 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat}, but this assert should still be 0 GB, since the default partition has no resources. IMO the bottom line is that after this JIRA is committed, the existing QueueMetrics should still only contain metrics for default partition, and partitioned queue metrics should only be in the newly added metrics. It will get very confusing if we break this behavior in this JIRA and then patch it in another. What do you think? Also, regarding your first point in YARN-9767 about non exclusive node labels, this issue seems to exist even before YARN-6492, so I think we can address this issue in YARN-9767. was (Author: jhung): [~maniraj...@gmail.com], thanks. Seems you missed uploading PartitionQueueMetrics class. I definitely think we should address #2, #3, and #4 in this JIRA. I don't think #3 is addressed by YARN-9767. For example it edits the tests in the same way, i.e. {noformat}assertEquals(10 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat} is changed to {noformat}assertEquals(22 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat}, but this assert should still be 0 GB, since the default partition has no resources. IMO the bottom line is that after this JIRA is committed, the existing QueueMetrics should still only contain metrics for default partition, and partitioned queue metrics should only be in the newly added metrics. What do you think? > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101990#comment-17101990 ] Jonathan Hung edited comment on YARN-6492 at 5/7/20, 7:28 PM: -- [~maniraj...@gmail.com], thanks. Seems you missed uploading PartitionQueueMetrics class. I definitely think we should address #2, #3, and #4 in this JIRA. Also, I don't think #3 is addressed by YARN-9767. For example it edits the tests in the same way, i.e. {noformat}assertEquals(10 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat} is changed to {noformat}assertEquals(22 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat}, but this assert should still be 0 GB, since the default partition has no resources. IMO the bottom line is that after this JIRA is committed, the existing QueueMetrics should still only contain metrics for default partition, and partitioned queue metrics should only be in the newly added metrics. It will get very confusing if we break this behavior in this JIRA and then patch it in another. What do you think? Also, regarding your first point in YARN-9767 about non exclusive node labels, this issue seems to exist even before YARN-6492, so I think we can address this issue in YARN-9767. was (Author: jhung): [~maniraj...@gmail.com], thanks. Seems you missed uploading PartitionQueueMetrics class. I definitely think we should address #2, #3, and #4 in this JIRA. I don't think #3 is addressed by YARN-9767. For example it edits the tests in the same way, i.e. {noformat}assertEquals(10 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat} is changed to {noformat}assertEquals(22 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat}, but this assert should still be 0 GB, since the default partition has no resources. IMO the bottom line is that after this JIRA is committed, the existing QueueMetrics should still only contain metrics for default partition, and partitioned queue metrics should only be in the newly added metrics. It will get very confusing if we break this behavior in this JIRA and then patch it in another. What do you think? Also, regarding your first point in YARN-9767 about non exclusive node labels, this issue seems to exist even before YARN-6492, so I think we can address this issue in YARN-9767. > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101990#comment-17101990 ] Jonathan Hung commented on YARN-6492: - [~maniraj...@gmail.com], thanks. Seems you missed uploading PartitionQueueMetrics class. I definitely think we should address #2, #3, and #4 in this JIRA. I don't think #3 is addressed by YARN-9767. For example it edits the tests in the same way, i.e. {noformat}assertEquals(10 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat} is changed to {noformat}assertEquals(22 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat}, but this assert should still be 0 GB, since the default partition has no resources. IMO the bottom line is that after this JIRA is committed, the existing QueueMetrics should still only contain metrics for default partition, and partitioned queue metrics should only be in the newly added metrics. What do you think? > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10260) Allow transitioning queue from DRAINING to RUNNING state
Jonathan Hung created YARN-10260: Summary: Allow transitioning queue from DRAINING to RUNNING state Key: YARN-10260 URL: https://issues.apache.org/jira/browse/YARN-10260 Project: Hadoop YARN Issue Type: Improvement Reporter: Jonathan Hung We found in our cluster, a queue was erroneously stopped. Then queue is internally in DRAINING state. It cannot be moved back to RUNNING state until the queue is finished draining. For queues with large workloads, this can block other apps from submitting to this queue for a long time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100369#comment-17100369 ] Jonathan Hung edited comment on YARN-6492 at 5/6/20, 1:15 AM: -- OK thanks [~maniraj...@gmail.com] for the explanation. Sorry for the long delay, took some time to grok the latest 007 patch. * Can we rename getPartitionQueueMetrics to something different? My initial confusion was that getPartitionQueueMetrics for QueueMetrics and PartitionQueueMetrics serve different purposes...the former for queue*partition and the latter for partition only. It's especially confusing in the case of PartitionQueueMetrics#getPartitionQueueMetrics, since this has nothing to do with queues. We can update the comment for PartitionQueueMetrics#getPartitionQueueMetrics as well, it also says Partition * Queue. * Mentioned this earlier, can we remove the {noformat} if (parent != null) { parent.setAvailableResourcesToUser(partition, user, limit); }{noformat} check in QueueMetrics#setAvailableResourcesToUser? I think it should be addressed here rather than YARN-9767. * I don't think the asserts in TestNodeLabelContainerAllocation should change. leafQueue.getMetrics should return metrics for default partition. I think we still need to check in QueueMetrics#setAvailableResourcesToUser and QueueMetrics#setAvailableResourcesToQueue whether partition is null or empty string. (This will break updating partition queue metrics, so we need to find a way to distinguish whether we're updating default partition queue metrics or partitioned queue metrics within the setAvailableResourcesToUser/setAvailableResourcesToQueue function.) * Mentioned before, can we update everywhere we're creating a new metricName for partition/user/queue metrics to use a delimiter? e.g. {noformat}String metricName = partition + this.queueName + userName;{noformat}. Otherwise there's a chance that these metric names could collide. was (Author: jhung): OK thanks [~maniraj...@gmail.com] for the explanation. Sorry for the long delay, took some time to grok the latest 007 patch. * Can we rename getPartitionQueueMetrics to something different? My initial confusion was that getPartitionQueueMetrics for QueueMetrics and PartitionQueueMetrics serve different purposes...the former for queue*partition and the latter for partition only. It's especially confusing in the case of PartitionQueueMetrics#getPartitionQueueMetrics, since this has nothing to do with queues. We can update the comment for PartitionQueueMetrics#getPartitionQueueMetrics as well, it also says Partition * Queue. * Mentioned this earlier, can we remove the {noformat} if (parent != null) { parent.setAvailableResourcesToUser(partition, user, limit); }{noformat} check in QueueMetrics#setAvailableResourcesToUser? I think it should be addressed here rather than YARN-9767. * I don't think the asserts in TestNodeLabelContainerAllocation should change. leafQueue.getMetrics should return metrics for default partition. I think we still need to check in QueueMetrics#setAvailableResourcesToUser and QueueMetrics#setAvailableResourcesToQueue whether partition is null or empty string. (This will break updating partition queue metrics, so we need to find a way to distinguish whether we're updating default partition queue metrics or partitioned queue metrics.) * Mentioned before, can we update everywhere we're creating a new metricName for partition/user/queue metrics to use a delimiter? e.g. {noformat}String metricName = partition + this.queueName + userName;{noformat}. Otherwise there's a chance that these metric names could collide. > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100369#comment-17100369 ] Jonathan Hung edited comment on YARN-6492 at 5/6/20, 1:14 AM: -- OK thanks [~maniraj...@gmail.com] for the explanation. Sorry for the long delay, took some time to grok the latest 007 patch. * Can we rename getPartitionQueueMetrics to something different? My initial confusion was that getPartitionQueueMetrics for QueueMetrics and PartitionQueueMetrics serve different purposes...the former for queue*partition and the latter for partition only. It's especially confusing in the case of PartitionQueueMetrics#getPartitionQueueMetrics, since this has nothing to do with queues. We can update the comment for PartitionQueueMetrics#getPartitionQueueMetrics as well, it also says Partition * Queue. * Mentioned this earlier, can we remove the {noformat} if (parent != null) { parent.setAvailableResourcesToUser(partition, user, limit); }{noformat} check in QueueMetrics#setAvailableResourcesToUser? I think it should be addressed here rather than YARN-9767. * I don't think the asserts in TestNodeLabelContainerAllocation should change. leafQueue.getMetrics should return metrics for default partition. I think we still need to check in QueueMetrics#setAvailableResourcesToUser and QueueMetrics#setAvailableResourcesToQueue whether partition is null or empty string. (This will break updating partition queue metrics, so we need to find a way to distinguish whether we're updating default partition queue metrics or partitioned queue metrics.) * Mentioned before, can we update everywhere we're creating a new metricName for partition/user/queue metrics to use a delimiter? e.g. {noformat}String metricName = partition + this.queueName + userName;{noformat}. Otherwise there's a chance that these metric names could collide. was (Author: jhung): OK thanks [~maniraj...@gmail.com] for the explanation. Sorry for the long delay, took some time to grok the latest 007 patch. * Can we rename getPartitionQueueMetrics to something different? My initial confusion was that getPartitionQueueMetrics for QueueMetrics and PartitionQueueMetrics serve different purposes...the former for queue*partition and the latter for partition only. It's especially confusing in the case of PartitionQueueMetrics#getPartitionQueueMetrics, since this has nothing to do with queues. We can update the comment for PartitionQueueMetrics#getPartitionQueueMetrics as well, it also says Partition * Queue. * Mentioned this earlier, can we remove the {noformat} if (parent != null) { parent.setAvailableResourcesToUser(partition, user, limit); }{noformat} check in QueueMetrics#setAvailableResourcesToUser? I think it should be addressed here rather than YARN-9767. * I don't think the asserts in TestNodeLabelContainerAllocation should change. leafQueue.getMetrics should return metrics for default partition. I think we still need to check in QueueMetrics#setAvailableResourcesToUser and QueueMetrics#setAvailableResour cesToQueue whether partition is null or empty string. (This will break updating partition queue metrics, so we need to find a way to distinguish whether we're updating default partition queue metrics or partitioned queue metrics.) * Mentioned before, can we update everywhere we're creating a new metricName for partition/user/queue metrics to use a delimiter? e.g. {noformat}String metricName = partition + this.queueName + userName;{noformat}. Otherwise there's a chance that these metric names could collide. > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For addit
[jira] [Commented] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100369#comment-17100369 ] Jonathan Hung commented on YARN-6492: - OK thanks [~maniraj...@gmail.com] for the explanation. Sorry for the long delay, took some time to grok the latest 007 patch. * Can we rename getPartitionQueueMetrics to something different? My initial confusion was that getPartitionQueueMetrics for QueueMetrics and PartitionQueueMetrics serve different purposes...the former for queue*partition and the latter for partition only. It's especially confusing in the case of PartitionQueueMetrics#getPartitionQueueMetrics, since this has nothing to do with queues. We can update the comment for PartitionQueueMetrics#getPartitionQueueMetrics as well, it also says Partition * Queue. * Mentioned this earlier, can we remove the {noformat} if (parent != null) { parent.setAvailableResourcesToUser(partition, user, limit); }{noformat} check in QueueMetrics#setAvailableResourcesToUser? I think it should be addressed here rather than YARN-9767. * I don't think the asserts in TestNodeLabelContainerAllocation should change. leafQueue.getMetrics should return metrics for default partition. I think we still need to check in QueueMetrics#setAvailableResourcesToUser and QueueMetrics#setAvailableResour cesToQueue whether partition is null or empty string. (This will break updating partition queue metrics, so we need to find a way to distinguish whether we're updating default partition queue metrics or partitioned queue metrics.) * Mentioned before, can we update everywhere we're creating a new metricName for partition/user/queue metrics to use a delimiter? e.g. {noformat}String metricName = partition + this.queueName + userName;{noformat}. Otherwise there's a chance that these metric names could collide. > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung reassigned YARN-6492: --- Assignee: Manikandan R (was: Jonathan Hung) > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Manikandan R >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6492) Generate queue metrics for each partition
[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung reassigned YARN-6492: --- Assignee: Jonathan Hung (was: Manikandan R) > Generate queue metrics for each partition > - > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.
[ https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096901#comment-17096901 ] Jonathan Hung commented on YARN-8193: - javadoc complains about AbstractYarnScheduler which this patch doesn't touch. Seems unrelated. I pushed [^YARN-8193-branch-2.10-001.patch] to branch-2.10 > YARN RM hangs abruptly (stops allocating resources) when running successive > applications. > - > > Key: YARN-8193 > URL: https://issues.apache.org/jira/browse/YARN-8193 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Blocker > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8193-branch-2-001.patch, > YARN-8193-branch-2.10-001.patch, YARN-8193-branch-2.9.0-001.patch, > YARN-8193.001.patch, YARN-8193.002.patch > > > When running massive queries successively, at some point RM just hangs and > stops allocating resources. At the point RM get hangs, YARN throw > NullPointerException at RegularContainerAllocator.getLocalityWaitFactor. > There's sufficient space given to yarn.nodemanager.local-dirs (not a node > health issue, RM didn't report any node being unhealthy). There is no fixed > trigger for this (query or operation). > This problem goes away on restarting ResourceManager. No NM restart is > required. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.
[ https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096807#comment-17096807 ] Jonathan Hung commented on YARN-8193: - Hit this issue on 2.10.0 cluster. Reuploading patch > YARN RM hangs abruptly (stops allocating resources) when running successive > applications. > - > > Key: YARN-8193 > URL: https://issues.apache.org/jira/browse/YARN-8193 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Blocker > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8193-branch-2-001.patch, > YARN-8193-branch-2.10-001.patch, YARN-8193-branch-2.9.0-001.patch, > YARN-8193.001.patch, YARN-8193.002.patch > > > When running massive queries successively, at some point RM just hangs and > stops allocating resources. At the point RM get hangs, YARN throw > NullPointerException at RegularContainerAllocator.getLocalityWaitFactor. > There's sufficient space given to yarn.nodemanager.local-dirs (not a node > health issue, RM didn't report any node being unhealthy). There is no fixed > trigger for this (query or operation). > This problem goes away on restarting ResourceManager. No NM restart is > required. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.
[ https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-8193: Attachment: YARN-8193-branch-2.10-001.patch > YARN RM hangs abruptly (stops allocating resources) when running successive > applications. > - > > Key: YARN-8193 > URL: https://issues.apache.org/jira/browse/YARN-8193 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Blocker > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8193-branch-2-001.patch, > YARN-8193-branch-2.10-001.patch, YARN-8193-branch-2.9.0-001.patch, > YARN-8193.001.patch, YARN-8193.002.patch > > > When running massive queries successively, at some point RM just hangs and > stops allocating resources. At the point RM get hangs, YARN throw > NullPointerException at RegularContainerAllocator.getLocalityWaitFactor. > There's sufficient space given to yarn.nodemanager.local-dirs (not a node > health issue, RM didn't report any node being unhealthy). There is no fixed > trigger for this (query or operation). > This problem goes away on restarting ResourceManager. No NM restart is > required. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.
[ https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096807#comment-17096807 ] Jonathan Hung edited comment on YARN-8193 at 4/30/20, 5:32 PM: --- Hit this issue on 2.10.0 cluster. Reuploading patch to trigger jenkins was (Author: jhung): Hit this issue on 2.10.0 cluster. Reuploading patch > YARN RM hangs abruptly (stops allocating resources) when running successive > applications. > - > > Key: YARN-8193 > URL: https://issues.apache.org/jira/browse/YARN-8193 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Blocker > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8193-branch-2-001.patch, > YARN-8193-branch-2.10-001.patch, YARN-8193-branch-2.9.0-001.patch, > YARN-8193.001.patch, YARN-8193.002.patch > > > When running massive queries successively, at some point RM just hangs and > stops allocating resources. At the point RM get hangs, YARN throw > NullPointerException at RegularContainerAllocator.getLocalityWaitFactor. > There's sufficient space given to yarn.nodemanager.local-dirs (not a node > health issue, RM didn't report any node being unhealthy). There is no fixed > trigger for this (query or operation). > This problem goes away on restarting ResourceManager. No NM restart is > required. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8382) cgroup file leak in NM
[ https://issues.apache.org/jira/browse/YARN-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-8382: Fix Version/s: 2.10.1 > cgroup file leak in NM > -- > > Key: YARN-8382 > URL: https://issues.apache.org/jira/browse/YARN-8382 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Environment: we write an container with a shutdownHook which has a > piece of code like "while(true) sleep(100)" . when > *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms <* > *yarn.nodemanager.sleep-delay-before-sigkill.ms , cgourp file leak happens; > when* *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms >* > ** *yarn.nodemanager.sleep-delay-before-sigkill.ms, cgroup file is deleted > successfully*** >Reporter: Hu Ziqian >Assignee: Hu Ziqian >Priority: Major > Fix For: 3.2.0, 3.1.1, 3.0.4, 2.10.1 > > Attachments: YARN-8382-branch-2.8.3.001.patch, > YARN-8382-branch-2.8.3.002.patch, YARN-8382.001.patch, YARN-8382.002.patch > > > As Jiandan said in YARN-6562, NM may delete Cgroup container file timeout > with logs like below: > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to > delete for 1000ms > > we found one situation is that when we set > *yarn.nodemanager.sleep-delay-before-sigkill.ms* bigger than > *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms*, the > cgroup file leak happens *.* > > One container process tree looks like follow graph: > bash(16097)───java(16099)─┬─\{java}(16100) > ├─\{java}(16101) > {{ ├─\{java}(16102)}} > > {{when NM kills a container, NM sends kill -15 -pid to kill container process > group. Bash process will exit when it received sigterm, but java process may > do some job (shutdownHook etc.), and doesn't exit unit receive sigkill. And > when bash process exits, CgroupsLCEResourcesHandler begin to try to delete > cgroup files. So when > *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* > arrived, the java processes may still running and cgourp/tasks still not > empty and cause a cgroup file leak.}} > > {{we add a condition that > *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* must > bigger than *yarn.nodemanager.sleep-delay-before-sigkill.ms* to solve this > problem.}} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8382) cgroup file leak in NM
[ https://issues.apache.org/jira/browse/YARN-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093935#comment-17093935 ] Jonathan Hung commented on YARN-8382: - Pushed to branch-2.10 > cgroup file leak in NM > -- > > Key: YARN-8382 > URL: https://issues.apache.org/jira/browse/YARN-8382 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Environment: we write an container with a shutdownHook which has a > piece of code like "while(true) sleep(100)" . when > *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms <* > *yarn.nodemanager.sleep-delay-before-sigkill.ms , cgourp file leak happens; > when* *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms >* > ** *yarn.nodemanager.sleep-delay-before-sigkill.ms, cgroup file is deleted > successfully*** >Reporter: Hu Ziqian >Assignee: Hu Ziqian >Priority: Major > Fix For: 3.2.0, 3.1.1, 3.0.4, 2.10.1 > > Attachments: YARN-8382-branch-2.8.3.001.patch, > YARN-8382-branch-2.8.3.002.patch, YARN-8382.001.patch, YARN-8382.002.patch > > > As Jiandan said in YARN-6562, NM may delete Cgroup container file timeout > with logs like below: > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to > delete for 1000ms > > we found one situation is that when we set > *yarn.nodemanager.sleep-delay-before-sigkill.ms* bigger than > *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms*, the > cgroup file leak happens *.* > > One container process tree looks like follow graph: > bash(16097)───java(16099)─┬─\{java}(16100) > ├─\{java}(16101) > {{ ├─\{java}(16102)}} > > {{when NM kills a container, NM sends kill -15 -pid to kill container process > group. Bash process will exit when it received sigterm, but java process may > do some job (shutdownHook etc.), and doesn't exit unit receive sigkill. And > when bash process exits, CgroupsLCEResourcesHandler begin to try to delete > cgroup files. So when > *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* > arrived, the java processes may still running and cgourp/tasks still not > empty and cause a cgroup file leak.}} > > {{we add a condition that > *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* must > bigger than *yarn.nodemanager.sleep-delay-before-sigkill.ms* to solve this > problem.}} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9954) Configurable max application tags and max tag length
[ https://issues.apache.org/jira/browse/YARN-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085196#comment-17085196 ] Jonathan Hung commented on YARN-9954: - Thanks [~BilwaST], can you add some tests verifying that app submission fails if tags too long/too many tags/tags not ASCII? Also seems like we need two patches, a trunk patch with these Evolving fields removed and a branch-3.3 patch with the fields deprecated? > Configurable max application tags and max tag length > > > Key: YARN-9954 > URL: https://issues.apache.org/jira/browse/YARN-9954 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-9954-branch-3.3.patch, YARN-9954.001.patch > > > Currently max tags and max tag length is hardcoded, it should be configurable > {noformat} > @Evolving > public static final int APPLICATION_MAX_TAGS = 10; > @Evolving > public static final int APPLICATION_MAX_TAG_LENGTH = 100; {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9954) Configurable max application tags and max tag length
[ https://issues.apache.org/jira/browse/YARN-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083602#comment-17083602 ] Jonathan Hung edited comment on YARN-9954 at 4/14/20, 8:54 PM: --- Thanks [~BilwaST], a few comments * Let's change {noformat}/**Max size of application tags.*/{noformat} -> {noformat}/** Max number of application tags.*/{noformat} * Also in yarn-default.xml, let's change {noformat}Max size of application tags {noformat} -> {noformat}Max number of application tags {noformat} * Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException via RPCUtil.getRemoteException Also, this jira will be useful to have in older minor versions. But we cannot remove @Evolving fields within a minor version. Shall we open a separate jira to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to APPLICATION_MAX_TAG_LENGTH ? was (Author: jhung): Thanks [~BilwaST], a few comments * Let's change {noformat}/**Max size of application tags.*/{noformat} -> {noformat}/** Max number of application tags.*/{noformat} * {{Max size of application tags }} -> {{Max number of application tags }} * Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException via RPCUtil.getRemoteException Also, this jira will be useful to have in older minor versions. But we cannot remove @Evolving fields within a minor version. Shall we open a separate jira to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to APPLICATION_MAX_TAG_LENGTH ? > Configurable max application tags and max tag length > > > Key: YARN-9954 > URL: https://issues.apache.org/jira/browse/YARN-9954 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-9954.001.patch > > > Currently max tags and max tag length is hardcoded, it should be configurable > {noformat} > @Evolving > public static final int APPLICATION_MAX_TAGS = 10; > @Evolving > public static final int APPLICATION_MAX_TAG_LENGTH = 100; {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9954) Configurable max application tags and max tag length
[ https://issues.apache.org/jira/browse/YARN-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083602#comment-17083602 ] Jonathan Hung edited comment on YARN-9954 at 4/14/20, 8:53 PM: --- Thanks [~BilwaST], a few comments * Let's change {noformat}/**Max size of application tags.*/{noformat} -> {noformat}/** Max number of application tags.*/{noformat} * {{Max size of application tags }} -> {{Max number of application tags }} * Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException via RPCUtil.getRemoteException Also, this jira will be useful to have in older minor versions. But we cannot remove @Evolving fields within a minor version. Shall we open a separate jira to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to APPLICATION_MAX_TAG_LENGTH ? was (Author: jhung): Thanks [~BilwaST], a few comments * Let's change {{/** Max size of application tags.*/}} -> {{/** Max number of application tags.*/}} * {{Max size of application tags }} -> {{Max number of application tags }} * Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException via RPCUtil.getRemoteException Also, this jira will be useful to have in older minor versions. But we cannot remove @Evolving fields within a minor version. Shall we open a separate jira to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to APPLICATION_MAX_TAG_LENGTH ? > Configurable max application tags and max tag length > > > Key: YARN-9954 > URL: https://issues.apache.org/jira/browse/YARN-9954 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-9954.001.patch > > > Currently max tags and max tag length is hardcoded, it should be configurable > {noformat} > @Evolving > public static final int APPLICATION_MAX_TAGS = 10; > @Evolving > public static final int APPLICATION_MAX_TAG_LENGTH = 100; {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9954) Configurable max application tags and max tag length
[ https://issues.apache.org/jira/browse/YARN-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083602#comment-17083602 ] Jonathan Hung commented on YARN-9954: - Thanks [~BilwaST], a few comments * Let's change {{/**Max size of application tags.*/}} -> {{/** Max number of application tags.*/}} * {{Max size of application tags }} -> {{Max number of application tags }} * Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException via RPCUtil.getRemoteException Also, this jira will be useful to have in older minor versions. But we cannot remove @Evolving fields within a minor version. Shall we open a separate jira to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to APPLICATION_MAX_TAG_LENGTH ? > Configurable max application tags and max tag length > > > Key: YARN-9954 > URL: https://issues.apache.org/jira/browse/YARN-9954 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-9954.001.patch > > > Currently max tags and max tag length is hardcoded, it should be configurable > {noformat} > @Evolving > public static final int APPLICATION_MAX_TAGS = 10; > @Evolving > public static final int APPLICATION_MAX_TAG_LENGTH = 100; {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9954) Configurable max application tags and max tag length
[ https://issues.apache.org/jira/browse/YARN-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083602#comment-17083602 ] Jonathan Hung edited comment on YARN-9954 at 4/14/20, 8:52 PM: --- Thanks [~BilwaST], a few comments * Let's change {{/** Max size of application tags.*/}} -> {{/** Max number of application tags.*/}} * {{Max size of application tags }} -> {{Max number of application tags }} * Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException via RPCUtil.getRemoteException Also, this jira will be useful to have in older minor versions. But we cannot remove @Evolving fields within a minor version. Shall we open a separate jira to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to APPLICATION_MAX_TAG_LENGTH ? was (Author: jhung): Thanks [~BilwaST], a few comments * Let's change {{/**Max size of application tags.*/}} -> {{/** Max number of application tags.*/}} * {{Max size of application tags }} -> {{Max number of application tags }} * Agree with Adam, let's wrap the IllegalArgumentExceptions as YarnException via RPCUtil.getRemoteException Also, this jira will be useful to have in older minor versions. But we cannot remove @Evolving fields within a minor version. Shall we open a separate jira to remove these fields, and in this jira set DEFAULT_RM_APPLICATION_MAX_TAGS to APPLICATION_MAX_TAGS and set DEFAULT_RM_APPLICATION_MAX_TAG_LENGTH to APPLICATION_MAX_TAG_LENGTH ? > Configurable max application tags and max tag length > > > Key: YARN-9954 > URL: https://issues.apache.org/jira/browse/YARN-9954 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-9954.001.patch > > > Currently max tags and max tag length is hardcoded, it should be configurable > {noformat} > @Evolving > public static final int APPLICATION_MAX_TAGS = 10; > @Evolving > public static final int APPLICATION_MAX_TAG_LENGTH = 100; {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10227) Pull YARN-8242 back to branch-2.10
[ https://issues.apache.org/jira/browse/YARN-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079656#comment-17079656 ] Jonathan Hung commented on YARN-10227: -- Thanks Jim for fixing this. Belated +1 from me. > Pull YARN-8242 back to branch-2.10 > -- > > Key: YARN-10227 > URL: https://issues.apache.org/jira/browse/YARN-10227 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.10.0, 2.10.1 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 2.10.1 > > Attachments: YARN-10227-branch-2.10.001.patch > > > We have recently seen the nodemanager OOM issue reported in YARN-8242 during > a rolling upgrade. Our code is currently based on branch-2.8, but we are in > the process of moving to 2.10. I checked and YARN-8242 pulls back to > branch-2.10 pretty cleanly. The only conflict was a minor one in > TestNMLeveldbStateStoreService.java. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10212) Create separate configuration for max global AM attempts
[ https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078598#comment-17078598 ] Jonathan Hung edited comment on YARN-10212 at 4/8/20, 6:56 PM: --- Thanks [~BilwaST], in general looks good, some minor style issues: * In TestResourceManager.java, can we change {{fail("Exception is expected because the global max attempts" +}} to {{fail("Exception is expected because AM max attempts" +}} * In YarnConfiguration.java: {{* an application,if unset by user.}} -> can we add a space after the comma * In yarn-default.xml, for the comment for yarn.resourcemanager.am.max-attempts: * {noformat} The maximum number of application attempts. Each application master can specify its individual maximum number of application attempts via the API, but the individual number cannot be more than the global upper bound.This value is being set only if global max attempts is unset. The default number is set to 2, to {noformat} can we change this to * {noformat} The default maximum number of application attempts, if unset by the user. Each application master can specify its individual maximum number of application attempts via the API, but the individual number cannot be more than the global upper bound in yarn.resourcemanager.am.global.max-attempts. The default number is set to 2, to{noformat} was (Author: jhung): Thanks [~BilwaST], in general looks good, some minor style issues: * In TestResourceManager.java, can we change {{fail("Exception is expected because the global max attempts" +}} to {{fail("Exception is expected because AM max attempts" +}} * In YarnConfiguration.java: {{* an application,if unset by user.}} -> can we add a space after the comma * In yarn-default.xml, for the comment for yarn.resourcemanager.am.max-attempts: * {noformat} The maximum number of application attempts. Each application master can specify its individual maximum number of application attempts via the API, but the individual number cannot be more than the global upper bound.This value is being set only if global max attempts is unset. The default number is set to 2, to {noformat} can we change this to * {noformat} The default maximum number of application attempts, if unset by the user. Each application master can specify its individual maximum number of application attempts via the API, but the individual number cannot be more than the global upper bound in yarn.resourcemanager.am.global.max-attempts. The default number is set to 2, to{noformat} > Create separate configuration for max global AM attempts > > > Key: YARN-10212 > URL: https://issues.apache.org/jira/browse/YARN-10212 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10212.001.patch, YARN-10212.002.patch, > YARN-10212.003.patch > > > Right now user's default max AM attempts is set to the same as global max AM > attempts: > {noformat} > int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, > YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat} > If we want to increase global max AM attempts, it will also increase the > default. So we should create a separate global AM max attempts config to > separate the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10212) Create separate configuration for max global AM attempts
[ https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078598#comment-17078598 ] Jonathan Hung commented on YARN-10212: -- Thanks [~BilwaST], in general looks good, some minor style issues: * In TestResourceManager.java, can we change {{fail("Exception is expected because the global max attempts" +}} to {{fail("Exception is expected because AM max attempts" +}} * In YarnConfiguration.java: {{* an application,if unset by user.}} -> can we add a space after the comma * In yarn-default.xml, for the comment for yarn.resourcemanager.am.max-attempts: * {noformat} The maximum number of application attempts. Each application master can specify its individual maximum number of application attempts via the API, but the individual number cannot be more than the global upper bound.This value is being set only if global max attempts is unset. The default number is set to 2, to {noformat} can we change this to * {noformat} The default maximum number of application attempts, if unset by the user. Each application master can specify its individual maximum number of application attempts via the API, but the individual number cannot be more than the global upper bound in yarn.resourcemanager.am.global.max-attempts. The default number is set to 2, to{noformat} > Create separate configuration for max global AM attempts > > > Key: YARN-10212 > URL: https://issues.apache.org/jira/browse/YARN-10212 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10212.001.patch, YARN-10212.002.patch, > YARN-10212.003.patch > > > Right now user's default max AM attempts is set to the same as global max AM > attempts: > {noformat} > int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, > YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat} > If we want to increase global max AM attempts, it will also increase the > default. So we should create a separate global AM max attempts config to > separate the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10212) Create separate configuration for max global AM attempts
[ https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077515#comment-17077515 ] Jonathan Hung commented on YARN-10212: -- Thanks [~BilwaST]. A few comments: * The javadoc for RM_AM_MAX_ATTEMPTS, can we change it to something like "The maximum number of application attempts for an application, if unset by the user" * Can we change the new config to at least have "am"? e.g. yarn.resourcemanager.global.max-attempts to yarn.resourcemanager.am.global-max-attempts? * In ResourceManager.java I think we should validate both RM_AM_MAX_ATTEMPTS and GLOBAL_RM_AM_MAX_ATTEMPTS (and change the message in the RuntimeExceptions accordingly) * In RMAppImpl, we need to split this case into two: {noformat} if (individualMaxAppAttempts <= 0 || individualMaxAppAttempts > globalMaxAppAttempts) { this.maxAppAttempts = globalMaxAppAttempts; {noformat} If individualMaxAppAttempts <= 0, set this.maxAppAttempts to RM_AM_MAX_ATTEMPTS. If individualMaxAppAttempts > globalMaxAppAttempts, set this.maxAppAttempts to globalMaxAppAttempts * In the test case: {noformat} int[] rmAmMaxAttempts = new int[] { 8, 0 };{noformat} I don't think 0 is a valid config for RM_AM_MAX_ATTEMPTS, can we set this to \{ 8, 1 }? * Based on the above changes we will need to change the expected values in the test case from {noformat} int[][] expectedNums = new int[][]{ new int[]{ 9, 10, 10, 10 }, {noformat} to * {noformat} int[][] expectedNums = new int[][]{ new int[]{ 9, 10, 10, 8 }, {noformat} > Create separate configuration for max global AM attempts > > > Key: YARN-10212 > URL: https://issues.apache.org/jira/browse/YARN-10212 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10212.001.patch > > > Right now user's default max AM attempts is set to the same as global max AM > attempts: > {noformat} > int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, > YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat} > If we want to increase global max AM attempts, it will also increase the > default. So we should create a separate global AM max attempts config to > separate the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10212) Create separate configuration for max global AM attempts
[ https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17075044#comment-17075044 ] Jonathan Hung commented on YARN-10212: -- [~BilwaST] it's used in RMAppImpl when validating user's desired max app attempts: {noformat} if (individualMaxAppAttempts <= 0 || individualMaxAppAttempts > globalMaxAppAttempts) { this.maxAppAttempts = globalMaxAppAttempts; LOG.warn("The specific max attempts: " + individualMaxAppAttempts + " for application: " + applicationId.getId() + " is invalid, because it is out of the range [1, " + globalMaxAppAttempts + "]. Use the global max attempts instead."); } else { this.maxAppAttempts = individualMaxAppAttempts; } {noformat} > Create separate configuration for max global AM attempts > > > Key: YARN-10212 > URL: https://issues.apache.org/jira/browse/YARN-10212 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > > Right now user's default max AM attempts is set to the same as global max AM > attempts: > {noformat} > int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, > YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat} > If we want to increase global max AM attempts, it will also increase the > default. So we should create a separate global AM max attempts config to > separate the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10212) Create separate configuration for max global AM attempts
[ https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074847#comment-17074847 ] Jonathan Hung edited comment on YARN-10212 at 4/3/20, 7:35 PM: --- [~BilwaST] user's default max AM attempts is how many AM attempts they get if they don't set individualMaxAppAttempts on client side. I am proposing adding a new configuration like GLOBAL_RM_AM_MAX_ATTEMPTS, and changing the code snippet above to something like: {noformat} int globalMaxAppAttempts = conf.getInt(YarnConfiguration.GLOBAL_RM_AM_MAX_ATTEMPTS, conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS));{noformat} If GLOBAL_RM_AM_MAX_ATTEMPTS is unset, it will fall back to current behavior. But if GLOBAL_RM_AM_MAX_ATTEMPTS is set to something higher than RM_AM_MAX_ATTEMPTS, then if user does not set individualMaxAppAttempts on client side, their app's number of attempts will still use RM_AM_MAX_ATTEMPTS, but user can set individualMaxAppAttempts to RM_AM_MAX_ATTEMPTS < individualMaxAppAttempts <= GLOBAL_RM_AM_MAX_ATTEMPTS if they like. was (Author: jhung): [~BilwaST] user's default max AM attempts is how many AM attempts they get if they don't set individualMaxAppAttempts on client side. I am proposing adding a new configuration like GLOBAL_RM_AM_MAX_ATTEMPTS, and changing the code snippet above to something like: {noformat} int globalMaxAppAttempts = conf.getInt(YarnConfiguration.GLOBAL_RM_AM_MAX_ATTEMPTS, conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS));{noformat} If GLOBAL_RM_AM_MAX_ATTEMPTS is unset, it will fall back to current behavior. But if GLOBAL_RM_AM_MAX_ATTEMPTS is set to something higher (e.g. 4), then if user does not set individualMaxAppAttempts on client side, their app's number of attempts will still use RM_AM_MAX_ATTEMPTS, but user can set individualMaxAppAttempts to RM_AM_MAX_ATTEMPTS < individualMaxAppAttempts <= GLOBAL_RM_AM_MAX_ATTEMPTS if they like. > Create separate configuration for max global AM attempts > > > Key: YARN-10212 > URL: https://issues.apache.org/jira/browse/YARN-10212 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > > Right now user's default max AM attempts is set to the same as global max AM > attempts: > {noformat} > int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, > YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat} > If we want to increase global max AM attempts, it will also increase the > default. So we should create a separate global AM max attempts config to > separate the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10212) Create separate configuration for max global AM attempts
[ https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074847#comment-17074847 ] Jonathan Hung edited comment on YARN-10212 at 4/3/20, 7:34 PM: --- [~BilwaST] user's default max AM attempts is how many AM attempts they get if they don't set individualMaxAppAttempts on client side. I am proposing adding a new configuration like GLOBAL_RM_AM_MAX_ATTEMPTS, and changing the code snippet above to something like: {noformat} int globalMaxAppAttempts = conf.getInt(YarnConfiguration.GLOBAL_RM_AM_MAX_ATTEMPTS, conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS));{noformat} If GLOBAL_RM_AM_MAX_ATTEMPTS is unset, it will fall back to current behavior. But if GLOBAL_RM_AM_MAX_ATTEMPTS is set to something higher (e.g. 4), then if user does not set individualMaxAppAttempts on client side, their app's number of attempts will still use RM_AM_MAX_ATTEMPTS, but user can set individualMaxAppAttempts to RM_AM_MAX_ATTEMPTS < individualMaxAppAttempts <= GLOBAL_RM_AM_MAX_ATTEMPTS if they like. was (Author: jhung): [~BilwaST] user's default max AM attempts is how many AM attempts they get if they don't set individualMaxAppAttempts on client side. I am proposing changing the code snippet above to something like: {noformat} int globalMaxAppAttempts = conf.getInt(YarnConfiguration.GLOBAL_RM_AM_MAX_ATTEMPTS, conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS));{noformat} If GLOBAL_RM_AM_MAX_ATTEMPTS is unset, it will fall back to current behavior. But if GLOBAL_RM_AM_MAX_ATTEMPTS is set to something higher (e.g. 4), then if user does not set individualMaxAppAttempts on client side, their app's number of attempts will still use RM_AM_MAX_ATTEMPTS, but user can set individualMaxAppAttempts to RM_AM_MAX_ATTEMPTS < individualMaxAppAttempts <= GLOBAL_RM_AM_MAX_ATTEMPTS if they like. > Create separate configuration for max global AM attempts > > > Key: YARN-10212 > URL: https://issues.apache.org/jira/browse/YARN-10212 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > > Right now user's default max AM attempts is set to the same as global max AM > attempts: > {noformat} > int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, > YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat} > If we want to increase global max AM attempts, it will also increase the > default. So we should create a separate global AM max attempts config to > separate the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10212) Create separate configuration for max global AM attempts
[ https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074847#comment-17074847 ] Jonathan Hung commented on YARN-10212: -- [~BilwaST] user's default max AM attempts is how many AM attempts they get if they don't set individualMaxAppAttempts on client side. I am proposing changing the code snippet above to something like: {noformat} int globalMaxAppAttempts = conf.getInt(YarnConfiguration.GLOBAL_RM_AM_MAX_ATTEMPTS, conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS));{noformat} If GLOBAL_RM_AM_MAX_ATTEMPTS is unset, it will fall back to current behavior. But if GLOBAL_RM_AM_MAX_ATTEMPTS is set to something higher (e.g. 4), then if user does not set individualMaxAppAttempts on client side, their app's number of attempts will still use RM_AM_MAX_ATTEMPTS, but user can set individualMaxAppAttempts to RM_AM_MAX_ATTEMPTS < individualMaxAppAttempts <= GLOBAL_RM_AM_MAX_ATTEMPTS if they like. > Create separate configuration for max global AM attempts > > > Key: YARN-10212 > URL: https://issues.apache.org/jira/browse/YARN-10212 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > > Right now user's default max AM attempts is set to the same as global max AM > attempts: > {noformat} > int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, > YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat} > If we want to increase global max AM attempts, it will also increase the > default. So we should create a separate global AM max attempts config to > separate the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10212) Create separate configuration for max global AM attempts
[ https://issues.apache.org/jira/browse/YARN-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071249#comment-17071249 ] Jonathan Hung commented on YARN-10212: -- Hey [~BilwaST], do you plan to take on this task? > Create separate configuration for max global AM attempts > > > Key: YARN-10212 > URL: https://issues.apache.org/jira/browse/YARN-10212 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Bilwa S T >Priority: Major > > Right now user's default max AM attempts is set to the same as global max AM > attempts: > {noformat} > int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, > YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat} > If we want to increase global max AM attempts, it will also increase the > default. So we should create a separate global AM max attempts config to > separate the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8213) Add Capacity Scheduler performance metrics
[ https://issues.apache.org/jira/browse/YARN-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069084#comment-17069084 ] Jonathan Hung commented on YARN-8213: - I ran the failed TestAbstractYarnScheduler#testContainerRecoveredByNode test locally and it succeeded. > Add Capacity Scheduler performance metrics > -- > > Key: YARN-8213 > URL: https://issues.apache.org/jira/browse/YARN-8213 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, metrics >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8213-branch-2.10.001.patch, YARN-8213.001.patch, > YARN-8213.002.patch, YARN-8213.003.patch, YARN-8213.004.patch, > YARN-8213.005.patch > > > Currently when tune CS performance, it is not that straightforward because > lacking of metrics. Right now we only have \{{QueueMetrics}} which mostly > only tracks queue level resource counters. Propose to add CS metrics to > collect and display more fine-grained perf metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8213) Add Capacity Scheduler performance metrics
[ https://issues.apache.org/jira/browse/YARN-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069030#comment-17069030 ] Jonathan Hung edited comment on YARN-8213 at 3/27/20, 8:33 PM: --- Attached [^YARN-8213-branch-2.10.001.patch]. Diffs from trunk patch: * Set some variables as final in TestCapacitySchedulerMetrics.java * Replace lambdas with anonymous inner classes in TestCapacitySchedulerMetrics.java was (Author: jhung): Attached [^YARN-8213-branch-2.10.001.patch]. Diffs from trunk patch: * Set some variables as final in *TestCapacitySchedulerMetrics.java* * **Replace lambdas with anonymous inner classes in *TestCapacitySchedulerMetrics.java* > Add Capacity Scheduler performance metrics > -- > > Key: YARN-8213 > URL: https://issues.apache.org/jira/browse/YARN-8213 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, metrics >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8213-branch-2.10.001.patch, YARN-8213.001.patch, > YARN-8213.002.patch, YARN-8213.003.patch, YARN-8213.004.patch, > YARN-8213.005.patch > > > Currently when tune CS performance, it is not that straightforward because > lacking of metrics. Right now we only have \{{QueueMetrics}} which mostly > only tracks queue level resource counters. Propose to add CS metrics to > collect and display more fine-grained perf metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-8213) Add Capacity Scheduler performance metrics
[ https://issues.apache.org/jira/browse/YARN-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung reopened YARN-8213: - > Add Capacity Scheduler performance metrics > -- > > Key: YARN-8213 > URL: https://issues.apache.org/jira/browse/YARN-8213 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, metrics >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8213-branch-2.10.001.patch, YARN-8213.001.patch, > YARN-8213.002.patch, YARN-8213.003.patch, YARN-8213.004.patch, > YARN-8213.005.patch > > > Currently when tune CS performance, it is not that straightforward because > lacking of metrics. Right now we only have \{{QueueMetrics}} which mostly > only tracks queue level resource counters. Propose to add CS metrics to > collect and display more fine-grained perf metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8213) Add Capacity Scheduler performance metrics
[ https://issues.apache.org/jira/browse/YARN-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-8213: Attachment: YARN-8213-branch-2.10.001.patch > Add Capacity Scheduler performance metrics > -- > > Key: YARN-8213 > URL: https://issues.apache.org/jira/browse/YARN-8213 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, metrics >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8213-branch-2.10.001.patch, YARN-8213.001.patch, > YARN-8213.002.patch, YARN-8213.003.patch, YARN-8213.004.patch, > YARN-8213.005.patch > > > Currently when tune CS performance, it is not that straightforward because > lacking of metrics. Right now we only have \{{QueueMetrics}} which mostly > only tracks queue level resource counters. Propose to add CS metrics to > collect and display more fine-grained perf metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10212) Create separate configuration for max global AM attempts
Jonathan Hung created YARN-10212: Summary: Create separate configuration for max global AM attempts Key: YARN-10212 URL: https://issues.apache.org/jira/browse/YARN-10212 Project: Hadoop YARN Issue Type: Improvement Reporter: Jonathan Hung Right now user's default max AM attempts is set to the same as global max AM attempts: {noformat} int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); {noformat} If we want to increase global max AM attempts, it will also increase the default. So we should create a separate global AM max attempts config to separate the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10200) Add number of containers to RMAppManager summary
[ https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066178#comment-17066178 ] Jonathan Hung commented on YARN-10200: -- Jenkins looks good, [~tangzhankun] mind having another look? Thanks! > Add number of containers to RMAppManager summary > > > Key: YARN-10200 > URL: https://issues.apache.org/jira/browse/YARN-10200 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: YARN-10200.001.patch, YARN-10200.002.patch, > YARN-10200.003.patch > > > It would be useful to persist this so we can track containers processed by RM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10200) Add number of containers to RMAppManager summary
[ https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066051#comment-17066051 ] Jonathan Hung commented on YARN-10200: -- Thanks [~tangzhankun] for looking. Seems reasonable. Attached [^YARN-10200.003.patch] to address this. > Add number of containers to RMAppManager summary > > > Key: YARN-10200 > URL: https://issues.apache.org/jira/browse/YARN-10200 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: YARN-10200.001.patch, YARN-10200.002.patch, > YARN-10200.003.patch > > > It would be useful to persist this so we can track containers processed by RM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10200) Add number of containers to RMAppManager summary
[ https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-10200: - Attachment: YARN-10200.003.patch > Add number of containers to RMAppManager summary > > > Key: YARN-10200 > URL: https://issues.apache.org/jira/browse/YARN-10200 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: YARN-10200.001.patch, YARN-10200.002.patch, > YARN-10200.003.patch > > > It would be useful to persist this so we can track containers processed by RM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10200) Add number of containers to RMAppManager summary
[ https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065126#comment-17065126 ] Jonathan Hung commented on YARN-10200: -- Thanks Haibo, attached [^YARN-10200.002.patch] to fix checkstyle > Add number of containers to RMAppManager summary > > > Key: YARN-10200 > URL: https://issues.apache.org/jira/browse/YARN-10200 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: YARN-10200.001.patch, YARN-10200.002.patch > > > It would be useful to persist this so we can track containers processed by RM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10200) Add number of containers to RMAppManager summary
[ https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-10200: - Attachment: YARN-10200.002.patch > Add number of containers to RMAppManager summary > > > Key: YARN-10200 > URL: https://issues.apache.org/jira/browse/YARN-10200 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: YARN-10200.001.patch, YARN-10200.002.patch > > > It would be useful to persist this so we can track containers processed by RM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10192) CapacityScheduler stuck in loop rejecting allocation proposals
[ https://issues.apache.org/jira/browse/YARN-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061976#comment-17061976 ] Jonathan Hung commented on YARN-10192: -- Thanks. Yeah Tao, agreed, we plan on turning DEBUG on for this class when we encounter this again. [~epayne], we have patches on top of 2.10.0, but not YARN-10009. Looking at YARN-10009, seems it could be related. Thanks for the reference. > CapacityScheduler stuck in loop rejecting allocation proposals > -- > > Key: YARN-10192 > URL: https://issues.apache.org/jira/browse/YARN-10192 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.10.0 >Reporter: Jonathan Hung >Priority: Major > > On a 2.10.0 cluster, we observed containers were being scheduled very slowly. > Based on logs, it seems to reject a bunch of allocation proposals, then > accept a bunch of reserved containers, but very few containers are actually > getting allocated: > {noformat} > 2020-03-10 06:31:48,965 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.30113637 > absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster= > 2020-03-10 06:31:48,965 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > 2020-03-10 06:31:48,965 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1582403122262_15460_01 > container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu > 2020-03-10 06:31:48,965 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=misc usedCapacity=0.0031771248 > absoluteUsedCapacity=3.1771246E-4 used= > cluster= > 2020-03-10 06:31:48,965 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.30113637 > absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster= > 2020-03-10 06:31:48,965 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > 2020-03-10 06:31:48,968 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1582403122262_15460_01 > container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu > 2020-03-10 06:31:48,968 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=misc usedCapacity=0.0031771248 > absoluteUsedCapacity=3.1771246E-4 used= > cluster= > 2020-03-10 06:31:48,968 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.30113637 > absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster= > 2020-03-10 06:31:48,968 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > 2020-03-10 06:31:48,977 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1582403122262_15460_01 > container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu > 2020-03-10 06:31:48,977 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=misc usedCapacity=0.0031771248 > absoluteUsedCapacity=3.1771246E-4 used= > cluster= > 2020-03-10 06:31:48,977 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.30113637 > absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster= > 2020-03-10 06:31:48,977 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > 2020-03-10 06:31:48,981 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1582403122262_15460_01 > container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu > 2020-03-10 06:31:48,982 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=misc usedCapacity=0.0031771248 > absoluteUsedCapacity=3.1771246E-4 used=
[jira] [Updated] (YARN-10200) Add number of containers to RMAppManager summary
[ https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-10200: - Attachment: YARN-10200.001.patch > Add number of containers to RMAppManager summary > > > Key: YARN-10200 > URL: https://issues.apache.org/jira/browse/YARN-10200 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: YARN-10200.001.patch > > > It would be useful to persist this so we can track containers processed by RM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10200) Add number of containers to RMAppManager summary
[ https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung reassigned YARN-10200: Assignee: Jonathan Hung > Add number of containers to RMAppManager summary > > > Key: YARN-10200 > URL: https://issues.apache.org/jira/browse/YARN-10200 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > It would be useful to persist this so we can track containers processed by RM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10200) Add number of containers to RMAppManager summary
[ https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061059#comment-17061059 ] Jonathan Hung commented on YARN-10200: -- Yeah [~maniraj...@gmail.com] I think that makes sense. > Add number of containers to RMAppManager summary > > > Key: YARN-10200 > URL: https://issues.apache.org/jira/browse/YARN-10200 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Priority: Major > > It would be useful to persist this so we can track containers processed by RM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org