[jira] [Commented] (YARN-6507) Add support in NodeManager to isolate FPGA devices with CGroups
[ https://issues.apache.org/jira/browse/YARN-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265938#comment-16265938 ] Zhankun Tang commented on YARN-6507: [~wangda], the end-to-end test reported is attached in YARN-5983. Please check. > Add support in NodeManager to isolate FPGA devices with CGroups > --- > > Key: YARN-6507 > URL: https://issues.apache.org/jira/browse/YARN-6507 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Zhankun Tang >Assignee: Zhankun Tang > Attachments: YARN-6507-branch-YARN-3926.001.patch, > YARN-6507-branch-YARN-3926.002.patch, YARN-6507-trunk.001.patch, > YARN-6507-trunk.002.patch, YARN-6507-trunk.003.patch, > YARN-6507-trunk.004.patch, YARN-6507-trunk.005.patch, > YARN-6507-trunk.006.patch, YARN-6507-trunk.007.patch, > YARN-6507-trunk.008.patch, YARN-6507-trunk.009.patch > > > Support local FPGA resource scheduler to assign/isolate N FPGA slots to a > container. > At the beginning, support one vendor plugin with basic features to serve > OpenCL applications -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5983) [Umbrella] Support for FPGA as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang updated YARN-5983: --- Attachment: YARN-5983_end-to-end_test_report.pdf Add an end-to-end test report for your reference. > [Umbrella] Support for FPGA as a Resource in YARN > - > > Key: YARN-5983 > URL: https://issues.apache.org/jira/browse/YARN-5983 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Zhankun Tang >Assignee: Zhankun Tang > Attachments: YARN-5983-Support-FPGA-resource-on-NM-side_v1.pdf, > YARN-5983-implementation-notes.pdf, YARN-5983_end-to-end_test_report.pdf > > > As various big data workload running on YARN, CPU will no longer scale > eventually and heterogeneous systems will become more important. ML/DL is a > rising star in recent years, applications focused on these areas have to > utilize GPU or FPGA to boost performance. Also, hardware vendors such as > Intel also invest in such hardware. It is most likely that FPGA will become > popular in data centers like CPU in the near future. > So YARN as a resource managing and scheduling system, would be great to > evolve to support this. This JIRA proposes FPGA to be a first-class citizen. > The changes roughly includes: > 1. FPGA resource detection and heartbeat > 2. Scheduler changes > 3. FPGA related preparation and isolation before launch container > We know that YARN-3926 is trying to extend current resource model. But still > we can leave some FPGA related discussion here -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yufei Gu reassigned YARN-7560: -- Assignee: zhengchenyu > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > -- > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 3.0.0 >Reporter: zhengchenyu >Assignee: zhengchenyu > Fix For: 3.0.0 > > Attachments: YARN-7560.000.patch > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, always show like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable > [0x7f98eed9a000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x7f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x7f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > the sum of all minRes is over int.max, so > resourceUsedWithWeightToResourceRatio return a negative value. > below is the loop. Because totalResource is long, so always postive. But > resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big > that resourceUsedWithWeightToResourceRatio will return a overflow value, just > a negative. So the loop will never break. > {code} > while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type) > < totalResource) { > rMax *= 2.0; > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsub
[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengchenyu updated YARN-7560: -- Attachment: YARN-7560.000.patch > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > -- > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 3.0.0 >Reporter: zhengchenyu > Fix For: 3.0.0 > > Attachments: YARN-7560.000.patch > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, always show like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable > [0x7f98eed9a000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x7f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x7f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > the sum of all minRes is over int.max, so > resourceUsedWithWeightToResourceRatio return a negative value. > below is the loop. Because totalResource is long, so always postive. But > resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big > that resourceUsedWithWeightToResourceRatio will return a overflow value, just > a negative. So the loop will never break. > {code} > while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type) > < totalResource) { > rMax *= 2.0; > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org F
[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengchenyu updated YARN-7560: -- Attachment: (was: YARN-7560.patch.00) > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > -- > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 3.0.0 >Reporter: zhengchenyu > Fix For: 3.0.0 > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, always show like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable > [0x7f98eed9a000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x7f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x7f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > the sum of all minRes is over int.max, so > resourceUsedWithWeightToResourceRatio return a negative value. > below is the loop. Because totalResource is long, so always postive. But > resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big > that resourceUsedWithWeightToResourceRatio will return a overflow value, just > a negative. So the loop will never break. > {code} > while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type) > < totalResource) { > rMax *= 2.0; > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yar
[jira] [Commented] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265922#comment-16265922 ] Yufei Gu commented on YARN-7560: [~zhengchenyu], I've added you as a contributor. You can assign this to yourself. > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > -- > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 3.0.0 >Reporter: zhengchenyu > Fix For: 3.0.0 > > Attachments: YARN-7560.patch.00 > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, always show like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable > [0x7f98eed9a000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x7f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x7f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > the sum of all minRes is over int.max, so > resourceUsedWithWeightToResourceRatio return a negative value. > below is the loop. Because totalResource is long, so always postive. But > resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big > that resourceUsedWithWeightToResourceRatio will return a overflow value, just > a negative. So the loop will never break. > {code} > while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type) > < totalResource) { > rMax *= 2.0; > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) ---
[jira] [Comment Edited] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265921#comment-16265921 ] Yufei Gu edited comment on YARN-7560 at 11/26/17 6:09 AM: -- Thanks for filing this issue and provide the patch, [~zhengchenyu]. Can you rename your patch to something like YARN-7560.xxx.patch, so that Hadoop QA can kick in. was (Author: yufeigu): Thanks for filing this issue and provide the patch, [~zhengchenyu]. Can you remove your patch to something like YARN-7560.xxx.patch, so that Hadoop QA can kick in. > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > -- > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 3.0.0 >Reporter: zhengchenyu > Fix For: 3.0.0 > > Attachments: YARN-7560.patch.00 > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, always show like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable > [0x7f98eed9a000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x7f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x7f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > the sum of all minRes is over int.max, so > resourceUsedWithWeightToResourceRatio return a negative value. > below is the loop. Because totalResource is long, so always postive. But > resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big > that resourceUsedWithWeightToResourceRatio will ret
[jira] [Commented] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265921#comment-16265921 ] Yufei Gu commented on YARN-7560: Thanks for filing this issue and provide the patch, [~zhengchenyu]. Can you remove your patch to something like YARN-7560.xxx.patch, so that Hadoop QA can kick in. > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > -- > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 3.0.0 >Reporter: zhengchenyu > Fix For: 3.0.0 > > Attachments: YARN-7560.patch.00 > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, always show like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable > [0x7f98eed9a000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x7f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x7f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > the sum of all minRes is over int.max, so > resourceUsedWithWeightToResourceRatio return a negative value. > below is the loop. Because totalResource is long, so always postive. But > resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big > that resourceUsedWithWeightToResourceRatio will return a overflow value, just > a negative. So the loop will never break. > {code} > while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type) > < totalResource) { > rMax *= 2.0; > } > {code} -- This message
[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengchenyu updated YARN-7560: -- Attachment: YARN-7560.patch.00 > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > -- > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 3.0.0 >Reporter: zhengchenyu > Fix For: 3.0.0 > > Attachments: YARN-7560.patch.00 > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, always show like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable > [0x7f98eed9a000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x7f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x7f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > the sum of all minRes is over int.max, so > resourceUsedWithWeightToResourceRatio return a negative value. > below is the loop. Because totalResource is long, so always postive. But > resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big > that resourceUsedWithWeightToResourceRatio will return a overflow value, just > a negative. So the loop will never break. > {code} > while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type) > < totalResource) { > rMax *= 2.0; > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For
[jira] [Commented] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265918#comment-16265918 ] zhengchenyu commented on YARN-7560: --- [~yufeigu] the sum of all queue's minRes is over int.max > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > -- > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 3.0.0 >Reporter: zhengchenyu > Fix For: 3.0.0 > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, always show like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable > [0x7f98eed9a000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x7f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x7f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > the sum of all minRes is over int.max, so > resourceUsedWithWeightToResourceRatio return a negative value. > below is the loop. Because totalResource is long, so always postive. But > resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big > that resourceUsedWithWeightToResourceRatio will return a overflow value, just > a negative. So the loop will never break. > {code} > while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type) > < totalResource) { > rMax *= 2.0; > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issu
[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengchenyu updated YARN-7560: -- Description: In our cluster, we changed the configuration, then refreshQueues, we found the resourcemanager hangs. And the Resourcemanager can't restart successfully. We got jstack information, always show like this: {code} "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable [0x7f98eed9a000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) - locked <0x7f8c4a8177a0> (a java.util.HashMap) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) - locked <0x7f8c4a7eb2e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) - locked <0x7f8c4a76ac48> (a java.lang.Object) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) - locked <0x7f8c49254268> (a java.lang.Object) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) - locked <0x7f8c467495e0> (a java.lang.Object) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) {code} When we debug the cluster, we found resourceUsedWithWeightToResourceRatio return a negative value. So the loop can't return. We found in our cluster, the sum of all minRes is over int.max, so resourceUsedWithWeightToResourceRatio return a negative value. below is the loop. Because totalResource is long, so always postive. But resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big that resourceUsedWithWeightToResourceRatio will return a overflow value, just a negative. So the loop will never break. {code} while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type) < totalResource) { rMax *= 2.0; } {code} was: In our cluster, we changed the configuration, then refreshQueues, we found the resourcemanager hangs. And the Resourcemanager can't restart successfully. We got jstack information, like this: {code} "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable [0x7f98eed9a000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.compute
[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengchenyu updated YARN-7560: -- Target Version/s: 3.0.0 (was: 2.7.5) > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > -- > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 3.0.0 >Reporter: zhengchenyu > Fix For: 3.0.0 > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable > [0x7f98eed9a000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x7f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x7f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > all minRes is over int.max. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengchenyu updated YARN-7560: -- Affects Version/s: (was: 2.7.1) > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > -- > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 3.0.0 >Reporter: zhengchenyu > Fix For: 3.0.0 > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable > [0x7f98eed9a000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x7f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x7f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > all minRes is over int.max. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengchenyu updated YARN-7560: -- Fix Version/s: (was: 2.7.5) 3.0.0 > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > -- > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 3.0.0 >Reporter: zhengchenyu > Fix For: 3.0.0 > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable > [0x7f98eed9a000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x7f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x7f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > all minRes is over int.max. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265910#comment-16265910 ] Yufei Gu commented on YARN-7560: Which version is used? 2.7.1 or 3.0? What do you mean all minRes is over int.max? Do you intentionally make it so? > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > -- > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.7.1, 3.0.0 >Reporter: zhengchenyu > Fix For: 2.7.5 > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable > [0x7f98eed9a000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x7f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x7f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > all minRes is over int.max. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yufei Gu updated YARN-7560: --- Component/s: fairscheduler > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > -- > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.7.1, 3.0.0 >Reporter: zhengchenyu > Fix For: 2.7.5 > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable > [0x7f98eed9a000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x7f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x7f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x7f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > all minRes is over int.max. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7218) ApiServer REST API naming convention /ws/v1 is already used in Hadoop v2
[ https://issues.apache.org/jira/browse/YARN-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-7218: Fix Version/s: 3.1.0 > ApiServer REST API naming convention /ws/v1 is already used in Hadoop v2 > > > Key: YARN-7218 > URL: https://issues.apache.org/jira/browse/YARN-7218 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, applications >Reporter: Eric Yang >Assignee: Eric Yang > Fix For: 3.1.0 > > Attachments: YARN-7218.001.patch, YARN-7218.002.patch, > YARN-7218.003.patch, YARN-7218.004.patch > > > In YARN-6626, there is a desire to have ability to run ApiServer REST API in > Resource Manager, this can eliminate the requirement to deploy another daemon > service for submitting docker applications. In YARN-5698, a new UI has been > implemented as a separate web application. There are some problems in the > arrangement that can cause conflicts of how Java session are being managed. > The root context of Resource Manager web application is /ws. This is hard > coded in startWebapp method in ResourceManager.java. This means all the > session management is applied to Web URL of /ws prefix. /ui2 is independent > of /ws context, therefore session management code doesn't apply to /ui2. > This could be a session management problem, if servlet based code is going to > be introduced into /ui2 web application. > ApiServer code base is designed as a separate web application. There is no > easy way to inject a separate web application into the same /ws context > because ResourceManager is already setup to bind to RMWebServices. Unless > ApiServer code is moved into RMWebServices, otherwise, they will not share > the same session management. > The alternate solution is to keep ApiServer prefix URL independent of /ws > context. However, this will be a departure from YARN web services naming > convention. This can be loaded as a separate web application in Resource > Manager jetty server. One possible proposal is /app/v1/services. This can > keep ApiServer code modular and independent from Resource Manager. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7560) Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
zhengchenyu created YARN-7560: - Summary: Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value Key: YARN-7560 URL: https://issues.apache.org/jira/browse/YARN-7560 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.1, 3.0.0 Reporter: zhengchenyu Fix For: 2.7.5 In our cluster, we changed the configuration, then refreshQueues, we found the resourcemanager hangs. And the Resourcemanager can't restart successfully. We got jstack information, like this: {code} "main" #1 prio=5 os_prio=0 tid=0x7f98e8017000 nid=0x2f5 runnable [0x7f98eed9a000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) - locked <0x7f8c4a8177a0> (a java.util.HashMap) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) - locked <0x7f8c4a7eb2e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) - locked <0x7f8c4a76ac48> (a java.lang.Object) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) - locked <0x7f8c49254268> (a java.lang.Object) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) - locked <0x7f8c467495e0> (a java.lang.Object) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) {code} When we debug the cluster, we found resourceUsedWithWeightToResourceRatio return a negative value. So the loop can't return. We found in our cluster, all minRes is over int.max. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7229) Add a metric for the size of event queue in AsyncDispatcher
[ https://issues.apache.org/jira/browse/YARN-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee reassigned YARN-7229: -- Assignee: sandflee > Add a metric for the size of event queue in AsyncDispatcher > --- > > Key: YARN-7229 > URL: https://issues.apache.org/jira/browse/YARN-7229 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.1.0 >Reporter: Yufei Gu >Assignee: sandflee > > The size of event queue in AsyncDispatcher is a good point to monitor daemon > performance. Let's make it a RM metric. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations
[ https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265837#comment-16265837 ] Eric Payne commented on YARN-7496: -- Thank you very much, [~leftnoteasy] > CS Intra-queue preemption user-limit calculations are not in line with > LeafQueue user-limit calculations > > > Key: YARN-7496 > URL: https://issues.apache.org/jira/browse/YARN-7496 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > Fix For: 2.8.3 > > Attachments: YARN-7496.001.branch-2.8.patch > > > Only a problem in 2.8. > Preemption could oscillate due to the difference in how user limit is > calculated between 2.8 and later releases. > Basically (ignoring ULF, MULP, and maybe others), the calculation for user > limit on the Capacity Scheduler side in 2.8 is {{total used resources / > number of active users}} while the calculation in later releases is {{total > active resources / number of active users}}. When intra-queue preemption was > backported to 2.8, it's calculations for user limit were more aligned with > the latter algorithm, which is in 2.9 and later releases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7505) RM REST endpoints generate malformed JSON
[ https://issues.apache.org/jira/browse/YARN-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265756#comment-16265756 ] Daniel Templeton commented on YARN-7505: Now that I'm into the implementation, I can see that this change would be better to target for 3.1. If we're going to introduce a v2, we should at least allow some time for any pent up incompatible changes to surface and be handled. There are likely several places where resource types makes more sense in the API with an incompatible change. Let's hold off on this one until 3.1 and try to get the APIs cleaned up between now and then. > RM REST endpoints generate malformed JSON > - > > Key: YARN-7505 > URL: https://issues.apache.org/jira/browse/YARN-7505 > Project: Hadoop YARN > Issue Type: Bug > Components: restapi >Affects Versions: 3.0.0 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Critical > Attachments: YARN-7505.001.patch, YARN-7505.002.patch > > > For all endpoints that return DAOs that contain maps, the generated JSON is > malformed. For example: > % curl 'http://localhost:8088/ws/v1/cluster/apps' > {"apps":{"app":[{"id":"application_1510777276702_0001","user":"daniel","name":"QuasiMonteCarlo","queue":"root.daniel","state":"RUNNING","finalStatus":"UNDEFINED","progress":5.0,"trackingUI":"ApplicationMaster","trackingUrl":"http://dhcp-10-16-0-181.pa.cloudera.com:8088/proxy/application_1510777276702_0001/","diagnostics":"","clusterId":1510777276702,"applicationType":"MAPREDUCE","applicationTags":"","priority":0,"startedTime":1510777317853,"finishedTime":0,"elapsedTime":21623,"amContainerLogs":"http://dhcp-10-16-0-181.pa.cloudera.com:8042/node/containerlogs/container_1510777276702_0001_01_01/daniel","amHostHttpAddress":"dhcp-10-16-0-181.pa.cloudera.com:8042","amRPCAddress":"dhcp-10-16-0-181.pa.cloudera.com:63371","allocatedMB":5120,"allocatedVCores":4,"reservedMB":0,"reservedVCores":0,"runningContainers":4,"memorySeconds":49820,"vcoreSeconds":26,"queueUsagePercentage":62.5,"clusterUsagePercentage":62.5,"resourceSecondsMap":{"entry":{"key":"test2","value":"0"},"entry":{"key":"test","value":"0"},"entry":{"key":"memory-mb","value":"49820"},"entry":{"key":"vcores","value":"26"}},"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"preemptedMemorySeconds":0,"preemptedVcoreSeconds":0,"preemptedResourceSecondsMap":{},"resourceRequests":[{"priority":20,"resourceName":"dhcp-10-16-0-181.pa.cloudera.com","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"/default-rack","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"*","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false}],"logAggregationStatus":"DISABLED","unmanagedApplication":false,"amNodeLabelExpression":"","timeouts":{"timeout":[{"type":"LIFETIME","expiryTime":"UNLIMITED","remainingTimeInSeconds":-1}]}}]}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7505) RM REST endpoints generate malformed JSON
[ https://issues.apache.org/jira/browse/YARN-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Templeton updated YARN-7505: --- Target Version/s: 3.1.0 (was: 3.0.0) > RM REST endpoints generate malformed JSON > - > > Key: YARN-7505 > URL: https://issues.apache.org/jira/browse/YARN-7505 > Project: Hadoop YARN > Issue Type: Bug > Components: restapi >Affects Versions: 3.0.0 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Critical > Attachments: YARN-7505.001.patch, YARN-7505.002.patch > > > For all endpoints that return DAOs that contain maps, the generated JSON is > malformed. For example: > % curl 'http://localhost:8088/ws/v1/cluster/apps' > {"apps":{"app":[{"id":"application_1510777276702_0001","user":"daniel","name":"QuasiMonteCarlo","queue":"root.daniel","state":"RUNNING","finalStatus":"UNDEFINED","progress":5.0,"trackingUI":"ApplicationMaster","trackingUrl":"http://dhcp-10-16-0-181.pa.cloudera.com:8088/proxy/application_1510777276702_0001/","diagnostics":"","clusterId":1510777276702,"applicationType":"MAPREDUCE","applicationTags":"","priority":0,"startedTime":1510777317853,"finishedTime":0,"elapsedTime":21623,"amContainerLogs":"http://dhcp-10-16-0-181.pa.cloudera.com:8042/node/containerlogs/container_1510777276702_0001_01_01/daniel","amHostHttpAddress":"dhcp-10-16-0-181.pa.cloudera.com:8042","amRPCAddress":"dhcp-10-16-0-181.pa.cloudera.com:63371","allocatedMB":5120,"allocatedVCores":4,"reservedMB":0,"reservedVCores":0,"runningContainers":4,"memorySeconds":49820,"vcoreSeconds":26,"queueUsagePercentage":62.5,"clusterUsagePercentage":62.5,"resourceSecondsMap":{"entry":{"key":"test2","value":"0"},"entry":{"key":"test","value":"0"},"entry":{"key":"memory-mb","value":"49820"},"entry":{"key":"vcores","value":"26"}},"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"preemptedMemorySeconds":0,"preemptedVcoreSeconds":0,"preemptedResourceSecondsMap":{},"resourceRequests":[{"priority":20,"resourceName":"dhcp-10-16-0-181.pa.cloudera.com","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"/default-rack","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false},{"priority":20,"resourceName":"*","capability":{"memory":1024,"vCores":1},"numContainers":8,"relaxLocality":true,"nodeLabelExpression":"","executionTypeRequest":{"executionType":"GUARANTEED","enforceExecutionType":true},"enforceExecutionType":false}],"logAggregationStatus":"DISABLED","unmanagedApplication":false,"amNodeLabelExpression":"","timeouts":{"timeout":[{"type":"LIFETIME","expiryTime":"UNLIMITED","remainingTimeInSeconds":-1}]}}]}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6647) RM can crash during transitionToStandby due to InterruptedException
[ https://issues.apache.org/jira/browse/YARN-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-6647: --- Target Version/s: 3.0.0 Marking target as 3.0.0 > RM can crash during transitionToStandby due to InterruptedException > --- > > Key: YARN-6647 > URL: https://issues.apache.org/jira/browse/YARN-6647 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha4 >Reporter: Jason Lowe >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-6647.001.patch, YARN-6647.002.patch, > YARN-6647.003.patch, YARN-6647.004.patch, YARN-6647.005.patch > > > Noticed some tests were failing due to the JVM shutting down early. I was > able to reproduce this occasionally with TestKillApplicationWithRMHA. > Stacktrace to follow. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7541) Node updates don't update the maximum cluster capability for resources other than CPU and memory
[ https://issues.apache.org/jira/browse/YARN-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265619#comment-16265619 ] Yufei Gu commented on YARN-7541: Jenkins failed https://builds.apache.org/job/PreCommit-YARN-Build/18660/console > Node updates don't update the maximum cluster capability for resources other > than CPU and memory > > > Key: YARN-7541 > URL: https://issues.apache.org/jira/browse/YARN-7541 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.0.0-beta1, 3.1.0 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Critical > Attachments: YARN-7541.001.patch, YARN-7541.002.patch, > YARN-7541.003.patch, YARN-7541.004.patch, YARN-7541.005.patch > > > When I submit an MR job that asks for too much memory or CPU for the map or > reduce, the AM will fail because it recognizes that the request is too large. > With any other resources, however, the resource requests will instead be > made and remain pending forever. Looks like we forgot to update the code > that tracks the maximum container allocation in {{ClusterNodeTracker}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org