[ https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876958#comment-14876958 ]
Jian He commented on YARN-4000: ------------------------------- One more issue, there may be container leak. Depending on when NM re-register, it is possible that some containers are recovered back even after application gets the kill signal, in which case containers are leaked. One solution I can think of is that, given that CapacityScheduler#doneApplicationAttempt and recoverContainersOnNode are synchronized, we can check whether RMAppAttempt is at final(FINISHED/FAILED/KILLED) state inside recoverContainersOnNode and skip recovering this container if it is. It would be great if you can have a test case for this. > RM crashes with NPE if leaf queue becomes parent queue during restart > --------------------------------------------------------------------- > > Key: YARN-4000 > URL: https://issues.apache.org/jira/browse/YARN-4000 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager > Affects Versions: 2.6.0 > Reporter: Jason Lowe > Assignee: Varun Saxena > Attachments: YARN-4000.01.patch, YARN-4000.02.patch, > YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch > > > This is a similar situation to YARN-2308. If an application is active in > queue A and then the RM restarts with a changed capacity scheduler > configuration where queue A becomes a parent queue to other subqueues then > the RM will crash with a NullPointerException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)