[ 
https://issues.apache.org/jira/browse/YARN-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14618776#comment-14618776
 ] 

Sunil G commented on YARN-3894:
-------------------------------

Thanks [~bibinchundatt] for reporting and providing analysis.

During {{initScheduler}} call from *CapacityScheduler#serviceInit*, we will 
initialize the queues. In the same callflow, we also will validate the capacity 
of nodelabel against the queue capacity from {{ParentQueue#setChildQueues}}.
{code}
   // check label capacities
    for (String nodeLabel : labelManager.getClusterNodeLabelNames()) {
      float capacityByLabel = queueCapacities.getCapacity(nodeLabel);
      // check children's labels
      float sum = 0;
      for (CSQueue queue : childQueues) {
        sum += queue.getQueueCapacities().getCapacity(nodeLabel);
      }
      if ((capacityByLabel > 0 && Math.abs(1.0f - sum) > PRECISION)
          || (capacityByLabel == 0) && (sum > 0)) {
        throw new IllegalArgumentException("Illegal" + " capacity of "
            + sum + " for children of queue " + queueName
            + " for label=" + nodeLabel);
      }
    }
{code}

As per this code, if there is a mismatch in capacity for nodelabel against the 
queue capacity, it should through *IllegalArgumentException*. But this will not 
happen in a case where we configure a wrong capacity for label in cs xml, and 
restart RM.

*Issue:*
During {{CommonNodeLabelsManager#serviceStart}}, labels will re-populated from 
old mirror file. But {{initScheduler}} and above call flow will happen from 
*serviceInit* instead of *serviceStart*
This will make {{labelManager.getClusterNodeLabelNames()}} call as empty in 
above code. and desired exception wont be thrown.

IMO We can move the node label init and recovery to serviceInit rather than 
serviceStart. [~leftnoteasy], could you please pool in your thoughts.

> RM startup should fail for wrong CS xml NodeLabel capacity configuration 
> -------------------------------------------------------------------------
>
>                 Key: YARN-3894
>                 URL: https://issues.apache.org/jira/browse/YARN-3894
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: capacity-scheduler.xml
>
>
> Currently in capacity Scheduler when capacity configuration is wrong
> RM shutdown is the current behaviour, but not incase of NodeLabels capacity 
> mismatch
> In {{CapacityScheduler#initializeQueues}}
> {code}
>   private void initializeQueues(CapacitySchedulerConfiguration conf)
>     throws IOException {   
>     root = 
>         parseQueue(this, conf, null, CapacitySchedulerConfiguration.ROOT, 
>             queues, queues, noop);
>     labelManager.reinitializeQueueLabels(getQueueToLabels());
>     root = 
>         parseQueue(this, conf, null, CapacitySchedulerConfiguration.ROOT, 
>             queues, queues, noop);
>     LOG.info("Initialized root queue " + root);
>     initializeQueueMappings();
>     setQueueAcls(authorizer, queues);
>   }
> {code}
> {{labelManager}} is initialized from queues and calculation for Label level 
> capacity mismatch happens in {{parseQueue}} . So during initialization 
> {{parseQueue}} the labels will be empty . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to