[ 
https://issues.apache.org/jira/browse/YARN-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387915#comment-15387915
 ] 

Sunil G commented on YARN-5333:
-------------------------------

[~hex108], thanks for the clarification. With  YARN-3893, we were trying to 
fail-fast RM if wrong capacity-scheduler is present. With the current patch, 
{code}
         try {
+          reinitializeActiveServices();
           startActiveServices();
           return null;
         } catch (Exception e) {
{code}
any exception during queue reinitialize will not make RM fail-fast. So I think 
you can have {{reinitializeActiveServices}} in another try block and invoke RM 
fail-fast with its exception handling block. 
However one more thing worries me. with this patch, reinitialize queue is done 
before starting the active services. So many service like nodelabel manager etc 
are not started (or dispatcher threads are not started). So if  
{{reinitialize}} has some event call flow, then such case may be a pblm. But as 
far as I checked, no such event handling is present in {{reinitialize}} call 
flow. Still I suggest to confirm once, I will also verify and will update if I 
find some leads.

> Some recovered apps are put into default queue when RM HA
> ---------------------------------------------------------
>
>                 Key: YARN-5333
>                 URL: https://issues.apache.org/jira/browse/YARN-5333
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-5333.01.patch, YARN-5333.02.patch, 
> YARN-5333.03.patch
>
>
> Enable RM HA and use FairScheduler, 
> {{yarn.scheduler.fair.allow-undeclared-pools}} is set to false, 
> {{yarn.scheduler.fair.user-as-default-queue}} is set to false.
> Reproduce steps:
> 1. Start two RMs.
> 2. After RMs are running, change both RM's file 
> {{etc/hadoop/fair-scheduler.xml}}, then add some queues.
> 3. Submit some apps to the new added queues.
> 4. Stop the active RM, then the standby RM will transit to active and recover 
> apps.
> However the new active RM will put recovered apps into default queue because 
> it might have not loaded the new {{fair-scheduler.xml}}. We need call 
> {{initScheduler}} before start active services or bring {{refreshAll()}} in 
> front of {{rm.transitionToActive()}}. *It seems it is also important for 
> other scheduler*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to