[
https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14680210#comment-14680210
]
Jason Lowe commented on YARN-4000:
----------------------------------
I don't believe changing a leaf queue into a parent queue is supported by the
CapacityScheduler, just like it doesn't support deleting a queue. These can be
accomplished by restarting the RM but at that point we're doing an unrelated
queue setup and trying to avoid things that are "hard" to accomplish. If they
were easy, we'd just support them as refreshable options rather than requiring
a restart. Supporting these kinds of config changes during work-preserving RM
restart essentially requires us to tackle them as if we were refreshing,
because apps and containers aren't getting wiped off the cluster between the
changes. That means we need to hammer out exactly what the semantics are if we
don't declare it to be outright wrong to set up the configs like that.
Killing an app when its queue disappears, either by being deleted or by having
it suddenly become a parent queue, is a bit severe, especially if it was an
accident (e.g.: someone typo'd the queue name in the list of child queues when
adding an unrelated queue). However I'm not sure we have a lot of other great
options. We could move the application to another queue so it can survive, but
then the question is what queue to use. There may not be a default queue
and/or the user may not have permissions on any other queue. Or all other
queues could already be at max app capacity, etc.
Another option is to put the app in limbo and "pause" it, where it won't get
any more resources but we won't kill any outstanding containers. Basically
we're waiting for the user to move it themselves so it can progress. But in
the interim the accounting is messed up because cluster resources are being
consumed by something that isn't in a queue.
So for now, killing it seems to be the path of least resistance if the RM has
to survive. Agree with Karthik that the fail-fast config seems appropriate for
determining whether the user would like the RM to fail to come up with that
config or kill apps to survive.
> RM crashes with NPE if leaf queue becomes parent queue during restart
> ---------------------------------------------------------------------
>
> Key: YARN-4000
> URL: https://issues.apache.org/jira/browse/YARN-4000
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler, resourcemanager
> Affects Versions: 2.6.0
> Reporter: Jason Lowe
> Assignee: Varun Saxena
>
> This is a similar situation to YARN-2308. If an application is active in
> queue A and then the RM restarts with a changed capacity scheduler
> configuration where queue A becomes a parent queue to other subqueues then
> the RM will crash with a NullPointerException.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)