[
https://issues.apache.org/jira/browse/YARN-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Advertising
Bibin A Chundatt updated YARN-3894:
-----------------------------------
Attachment: 0001-YARN-3894.patch
Attached patch as per discussion.
Please review patch
> RM startup should fail for wrong CS xml NodeLabel capacity configuration
> -------------------------------------------------------------------------
>
> Key: YARN-3894
> URL: https://issues.apache.org/jira/browse/YARN-3894
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Reporter: Bibin A Chundatt
> Assignee: Bibin A Chundatt
> Priority: Critical
> Attachments: 0001-YARN-3894.patch, capacity-scheduler.xml
>
>
> Currently in capacity Scheduler when capacity configuration is wrong
> RM will shutdown, but not incase of NodeLabels capacity mismatch
> In {{CapacityScheduler#initializeQueues}}
> {code}
> private void initializeQueues(CapacitySchedulerConfiguration conf)
> throws IOException {
> root =
> parseQueue(this, conf, null, CapacitySchedulerConfiguration.ROOT,
> queues, queues, noop);
> labelManager.reinitializeQueueLabels(getQueueToLabels());
> root =
> parseQueue(this, conf, null, CapacitySchedulerConfiguration.ROOT,
> queues, queues, noop);
> LOG.info("Initialized root queue " + root);
> initializeQueueMappings();
> setQueueAcls(authorizer, queues);
> }
> {code}
> {{labelManager}} is initialized from queues and calculation for Label level
> capacity mismatch happens in {{parseQueue}} . So during initialization
> {{parseQueue}} the labels will be empty .
> *Steps to reproduce*
> # Configure RM with capacity scheduler
> # Add one or two node label from rmadmin
> # Configure capacity xml with nodelabel but issue with capacity configuration
> for already added label
> # Restart both RM
> # Check on service init of capacity scheduler node label list is populated
> *Expected*
> RM should not start
> *Current exception on reintialize check*
> {code}
> 2015-07-07 19:18:25,655 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Initialized queue: default: capacity=0.5, absoluteCapacity=0.5,
> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
> 2015-07-07 19:18:25,656 WARN
> org.apache.hadoop.yarn.server.resourcemanager.AdminService: Exception refresh
> queues.
> java.io.IOException: Failed to re-init queues
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:383)
> at
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:376)
> at
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:605)
> at
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:314)
> at
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: java.lang.IllegalArgumentException: Illegal capacity of 0.5 for
> children of queue root for label=node2
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.setChildQueues(ParentQueue.java:159)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:639)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:503)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:379)
> ... 8 more
> 2015-07-07 19:18:25,656 WARN
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf
> OPERATION=refreshQueues TARGET=AdminService RESULT=FAILURE
> DESCRIPTION=Exception refresh queues. PERMISSIONS=
> 2015-07-07 19:18:25,656 WARN
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf
> OPERATION=transitionToActive TARGET=RMHAProtocolService
> RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=
> 2015-07-07 19:18:25,656 WARN org.apache.hadoop.ha.ActiveStandbyElector:
> Exception handling the winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when
> transitioning to Active mode
> at
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:321)
> at
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.IOException:
> Failed to re-init queues
> at
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:617)
> at
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:314)
> ... 5 more
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)