[
https://issues.apache.org/jira/browse/YARN-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373126#comment-15373126
]
Jun Gong commented on YARN-5333:
--------------------------------
Sorry for my mistakes:
1. We changed some code in our code, so that apps will be rejected if the queue
does not exist. For the trunk code, recovered apps will be put into 'default'
queue if their queue does not exist. I think we still need fix it.
2. It will only refresh active RM's queues when running {{rmadmin
-refreshQueues}} on both active and standby RM.
Attach a patch to fix the problem that mentioned above. I verify it on my local
cluster and it works.
> Recovered apps are rejected when RM HA
> --------------------------------------
>
> Key: YARN-5333
> URL: https://issues.apache.org/jira/browse/YARN-5333
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Jun Gong
> Assignee: Jun Gong
> Attachments: YARN-5333.01.patch
>
>
> Enable RM HA and use FairScheduler,
> {{yarn.scheduler.fair.allow-undeclared-pools}} is set to false,
> {{yarn.scheduler.fair.user-as-default-queue}} is set to false.
> Reproduce steps:
> 1. Start two RMs.
> 2. After RMs are running, change both RM's file
> {{etc/hadoop/fair-scheduler.xml}}, then add some queues.
> 3. Submit some apps to the new added queues.
> 4. Stop the active RM, then the standby RM will transit to active and recover
> apps.
> However the new active RM will reject recovered apps because it might have
> not loaded the new {{fair-scheduler.xml}}. We need call {{initScheduler}}
> before start active services or bring {{refreshAll()}} in front of
> {{rm.transitionToActive()}}. *It seems it is aslo important for other
> scheduler*.
> Related logs are as following:
> {quote}
> 2016-07-07 16:55:34,756 INFO
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recover ended
> ...
> 2016-07-07 16:55:34,824 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
> Loading allocation file /gaia/hadoop/etc/hadoop/fair-scheduler.xml
> 2016-07-07 16:55:34,826 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> Application rejected by queue placement policy
> 2016-07-07 16:55:34,828 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> Application appattempt_1467803586002_0006_000001 is done. finalState=FAILED
> 2016-07-07 16:55:34,828 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> Unknown application appattempt_1467803586002_0006_000001 has completed!
> 2016-07-07 16:55:34,828 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> Application rejected by queue placement policy
> 2016-07-07 16:55:34,828 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> Application appattempt_1467803586002_0004_000001 is done. finalState=FAILED
> 2016-07-07 16:55:34,828 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> Unknown application appattempt_1467803586002_0004_000001 has completed!
> 2016-07-07 16:55:34,828 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Can't handle
> this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
> APP_REJECTED at ACCEPTED
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:697)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:88)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:718)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:702)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:191)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:124)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]