[
https://issues.apache.org/jira/browse/YARN-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387552#comment-15387552
]
Sunil G commented on YARN-5333:
-------------------------------
Thanks [~hex108]
Yes, we are recovering apps (by calling startActiveServices) first and then
only trying to do refreshQueues from {{AdminService#transitionToActive}}. So
apps on newly added queue will fail during recovery.
bq.when capacity-scheduler.xml is corrupted, running {{refreshQueues }} will
just fail
If {{refreshQueues}} is not called, we can see RMs will toggle. YARN-3893 fixed
this and I made the similar suggestion (I suggested refreshAll) as given in
this patch now. Pls refer my
[comment|https://issues.apache.org/jira/browse/YARN-3893?focusedCommentId=14703329&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14703329].
[~rohithsharma] helped to point out a possible
[problem|https://issues.apache.org/jira/browse/YARN-3893?focusedCommentId=14708470&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14708470]
with this approach.
I agree that its a pblm in CS given we are using normal conf file. So If we
could handle the exception from {{refreshQeueues}} which can be called prior to
{{rm.transitionToActive()}} and *do fail fast directly*, then we can somehow
manage both issues. [~rohithsharma], [~jianhe] Thoughts?
> Some recovered apps are put into default queue when RM HA
> ---------------------------------------------------------
>
> Key: YARN-5333
> URL: https://issues.apache.org/jira/browse/YARN-5333
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Jun Gong
> Assignee: Jun Gong
> Attachments: YARN-5333.01.patch, YARN-5333.02.patch,
> YARN-5333.03.patch
>
>
> Enable RM HA and use FairScheduler,
> {{yarn.scheduler.fair.allow-undeclared-pools}} is set to false,
> {{yarn.scheduler.fair.user-as-default-queue}} is set to false.
> Reproduce steps:
> 1. Start two RMs.
> 2. After RMs are running, change both RM's file
> {{etc/hadoop/fair-scheduler.xml}}, then add some queues.
> 3. Submit some apps to the new added queues.
> 4. Stop the active RM, then the standby RM will transit to active and recover
> apps.
> However the new active RM will put recovered apps into default queue because
> it might have not loaded the new {{fair-scheduler.xml}}. We need call
> {{initScheduler}} before start active services or bring {{refreshAll()}} in
> front of {{rm.transitionToActive()}}. *It seems it is also important for
> other scheduler*.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]