[ 
https://issues.apache.org/jira/browse/YARN-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387533#comment-15387533
 ] 

Jun Gong commented on YARN-5333:
--------------------------------

{quote}Could you also please confirm that whether you have added new queue 
manually in capacity-scheduler.xml of Standby node, and test the same scenario.
{quote}
I copy the capacity-scheduler.xml from active RM to standby RM, then they are 
same on both RMs. Yes, I tested the same scenario.

{quote}
Because the current approach in your patch will induce a new problem. Suppose 
if capacity-scheduler.xml is corrupted, then we will say a case where bth RMs 
will toggle to become active. We had discussed this solutions in another HA 
ticket and has thought about not trying to do any refresh until active services 
are started.
{quote}
If if capacity-scheduler.xml was corrupted, I saw RM crashed when RM HA because 
it failed to validateConf({{CapacityScheduler.validateConf}})(Note: when 
capacity-scheduler.xml is corrupted, running {{refreshQueues }} will just fail 
and not cause RM to crash). Is there something I missed?

> Some recovered apps are put into default queue when RM HA
> ---------------------------------------------------------
>
>                 Key: YARN-5333
>                 URL: https://issues.apache.org/jira/browse/YARN-5333
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-5333.01.patch, YARN-5333.02.patch, 
> YARN-5333.03.patch
>
>
> Enable RM HA and use FairScheduler, 
> {{yarn.scheduler.fair.allow-undeclared-pools}} is set to false, 
> {{yarn.scheduler.fair.user-as-default-queue}} is set to false.
> Reproduce steps:
> 1. Start two RMs.
> 2. After RMs are running, change both RM's file 
> {{etc/hadoop/fair-scheduler.xml}}, then add some queues.
> 3. Submit some apps to the new added queues.
> 4. Stop the active RM, then the standby RM will transit to active and recover 
> apps.
> However the new active RM will put recovered apps into default queue because 
> it might have not loaded the new {{fair-scheduler.xml}}. We need call 
> {{initScheduler}} before start active services or bring {{refreshAll()}} in 
> front of {{rm.transitionToActive()}}. *It seems it is also important for 
> other scheduler*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to