[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart

Jason Lowe (JIRA) Mon, 10 Aug 2015 07:44:58 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14680210#comment-14680210
 ]


Jason Lowe commented on YARN-4000:
----------------------------------

I don't believe changing a leaf queue into a parent queue is supported by the 
CapacityScheduler, just like it doesn't support deleting a queue.  These can be 
accomplished by restarting the RM but at that point we're doing an unrelated 
queue setup and trying to avoid things that are "hard" to accomplish.  If they 
were easy, we'd just support them as refreshable options rather than requiring 
a restart.  Supporting these kinds of config changes during work-preserving RM 
restart essentially requires us to tackle them as if we were refreshing, 
because apps and containers aren't getting wiped off the cluster between the 
changes.  That means we need to hammer out exactly what the semantics are if we 
don't declare it to be outright wrong to set up the configs like that.

Killing an app when its queue disappears, either by being deleted or by having 
it suddenly become a parent queue, is a bit severe, especially if it was an 
accident (e.g.: someone typo'd the queue name in the list of child queues when 
adding an unrelated queue).  However I'm not sure we have a lot of other great 
options.  We could move the application to another queue so it can survive, but 
then the question is what queue to use.  There may not be a default queue 
and/or the user may not have permissions on any other queue.  Or all other 
queues could already be at max app capacity, etc.

Another option is to put the app in limbo and "pause" it, where it won't get 
any more resources but we won't kill any outstanding containers.  Basically 
we're waiting for the user to move it themselves so it can progress.  But in 
the interim the accounting is messed up because cluster resources are being 
consumed by something that isn't in a queue.

So for now, killing it seems to be the path of least resistance if the RM has 
to survive.  Agree with Karthik that the fail-fast config seems appropriate for 
determining whether the user would like the RM to fail to come up with that 
config or kill apps to survive.

> RM crashes with NPE if leaf queue becomes parent queue during restart
> ---------------------------------------------------------------------
>
>                 Key: YARN-4000
>                 URL: https://issues.apache.org/jira/browse/YARN-4000
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler, resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>
> This is a similar situation to YARN-2308.  If an application is active in 
> queue A and then the RM restarts with a changed capacity scheduler 
> configuration where queue A becomes a parent queue to other subqueues then 
> the RM will crash with a NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart

Reply via email to