[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity (short term fix for 2.6.1)
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339479#comment-14339479 ] Wangda Tan commented on YARN-3251: -- Checking this into branch-2.6 > CapacityScheduler deadlock when computing absolute max avail capacity (short > term fix for 2.6.1) > > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Craig Welch >Priority: Blocker > Attachments: YARN-3251.1.patch, YARN-3251.2-6-0.2.patch, > YARN-3251.2-6-0.3.patch, YARN-3251.2-6-0.4.patch, YARN-3251.2.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity (short term fix for 2.6.1)
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339140#comment-14339140 ] Hadoop QA commented on YARN-3251: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12701150/YARN-3251.2.patch against trunk revision dce8b9c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6760//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6760//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6760//console This message is automatically generated. > CapacityScheduler deadlock when computing absolute max avail capacity (short > term fix for 2.6.1) > > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Craig Welch >Priority: Blocker > Attachments: YARN-3251.1.patch, YARN-3251.2-6-0.2.patch, > YARN-3251.2-6-0.3.patch, YARN-3251.2-6-0.4.patch, YARN-3251.2.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity (short term fix for 2.6.1)
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338860#comment-14338860 ] Wangda Tan commented on YARN-3251: -- if opposite opinions -> if no opposite opinions > CapacityScheduler deadlock when computing absolute max avail capacity (short > term fix for 2.6.1) > > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Craig Welch >Priority: Blocker > Attachments: YARN-3251.1.patch, YARN-3251.2-6-0.2.patch, > YARN-3251.2-6-0.3.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity (short term fix for 2.6.1)
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338858#comment-14338858 ] Wangda Tan commented on YARN-3251: -- LGTM +1, I will commit the patch to branch-2.6 this afternoon if opposite opinions. Thanks! > CapacityScheduler deadlock when computing absolute max avail capacity (short > term fix for 2.6.1) > > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Craig Welch >Priority: Blocker > Attachments: YARN-3251.1.patch, YARN-3251.2-6-0.2.patch, > YARN-3251.2-6-0.3.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity (short term fix for 2.6.1)
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338848#comment-14338848 ] Craig Welch commented on YARN-3251: --- Sorry if that wasn't clear, to reduce risk removed the minor changes in CSQueueUtils > CapacityScheduler deadlock when computing absolute max avail capacity (short > term fix for 2.6.1) > > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Craig Welch >Priority: Blocker > Attachments: YARN-3251.1.patch, YARN-3251.2-6-0.2.patch, > YARN-3251.2-6-0.3.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity (short term fix for 2.6.1)
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338782#comment-14338782 ] Hadoop QA commented on YARN-3251: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700977/YARN-3251.2-6-0.2.patch against trunk revision dce8b9c. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6759//console This message is automatically generated. > CapacityScheduler deadlock when computing absolute max avail capacity (short > term fix for 2.6.1) > > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Craig Welch >Priority: Blocker > Attachments: YARN-3251.1.patch, YARN-3251.2-6-0.2.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity (short term fix for 2.6.1)
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337962#comment-14337962 ] Craig Welch commented on YARN-3251: --- bq. 1) Since the target of your patch is to make a quick fix for old version, it's better to create a patch in branch-2.6 done bq. And patch I'm working on now will remove the CSQueueUtils.computeMaxAvailResource, so it's no need to add a intermediate fix in branch-2. I suppose that depends on whether anyone needs a trunk version of the patch before the other changes are landed - if someone asks for it I could quickly update the original patch to provide it bq. 2) I think CSQueueUtils.getAbsoluteMaxAvailCapacity doesn't hold child/parent's lock together, maybe we don't need to change that, could you confirm? it doesn't, the change there was to insure consistency for multiple values used from the queue, as previously it was occurring inside a lock and that was guaranteed, now it isn't. However, there's no need to lock on the parent, so I removed that bq. 3) Maybe we don't need getter/setter of absoluteMaxAvailCapacity in queue, a volatile float is enough? Yes, that should be safe, done > CapacityScheduler deadlock when computing absolute max avail capacity (short > term fix for 2.6.1) > > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Craig Welch >Priority: Blocker > Attachments: YARN-3251.1.patch, YARN-3251.2-6-0.2.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity (short term fix for 2.6.1)
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337571#comment-14337571 ] Wangda Tan commented on YARN-3251: -- Removed patch for trunk and uploaded the same one to YARN-3265, reassigned this to [~cwelch]. > CapacityScheduler deadlock when computing absolute max avail capacity (short > term fix for 2.6.1) > > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Craig Welch >Priority: Blocker > Attachments: YARN-3251.1.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity (short term fix for 2.6.1)
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337567#comment-14337567 ] Wangda Tan commented on YARN-3251: -- +1 for the suggestion, moved it out of YARN-3243, changed target version and also updated title. > CapacityScheduler deadlock when computing absolute max avail capacity (short > term fix for 2.6.1) > > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-3251.1.patch, YARN-3251.trunk.1.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity (short term fix for 2.6.1)
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337568#comment-14337568 ] Wangda Tan commented on YARN-3251: -- Linked to YARN-3265. > CapacityScheduler deadlock when computing absolute max avail capacity (short > term fix for 2.6.1) > > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-3251.1.patch, YARN-3251.trunk.1.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337529#comment-14337529 ] Vinod Kumar Vavilapalli commented on YARN-3251: --- Actually let's use this JIRA for 2.6.1 and YARN-3243 for the trunk fix. > CapacityScheduler deadlock when computing absolute max avail capacity > - > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-3251.1.patch, YARN-3251.trunk.1.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337465#comment-14337465 ] Hadoop QA commented on YARN-3251: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700875/YARN-3251.trunk.1.patch against trunk revision caa42ad. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestResourceUsage Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6744//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6744//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6744//console This message is automatically generated. > CapacityScheduler deadlock when computing absolute max avail capacity > - > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-3251.1.patch, YARN-3251.trunk.1.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337443#comment-14337443 ] Hadoop QA commented on YARN-3251: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700874/YARN-3251.trunk.1.patch against trunk revision caa42ad. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 6 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestResourceUsage Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6743//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6743//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6743//console This message is automatically generated. > CapacityScheduler deadlock when computing absolute max avail capacity > - > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-3251.1.patch, YARN-3251.trunk.1.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337348#comment-14337348 ] Vinod Kumar Vavilapalli commented on YARN-3251: --- bq. Since the target of your patch is to make a quick fix for old version, it's better to create a patch in branch-2.6. and the patch you created will be committed to branch-2.6 as well. I noticed some functionalities and interfaces being used in your patch are not part of 2.6. And patch I'm working on now will remove the CSQueueUtils.computeMaxAvailResource, so it's no need to add a intermediate fix in branch-2. How about we have a separate JIRA solely focused on 2.6.1 - as we have two separate patches and two different contributors? > CapacityScheduler deadlock when computing absolute max avail capacity > - > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-3251.1.patch, YARN-3251.trunk.1.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337305#comment-14337305 ] Wangda Tan commented on YARN-3251: -- Found few typos, removed patch and will upload again. > CapacityScheduler deadlock when computing absolute max avail capacity > - > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-3251.1.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335803#comment-14335803 ] Wangda Tan commented on YARN-3251: -- [~cwelch], Some comments, 1) Since the target of your patch is to make a quick fix for old version, it's better to create a patch in branch-2.6. and the patch you created will be committed to branch-2.6 as well. I noticed some functionalities and interfaces being used in your patch are not part of 2.6. And patch I'm working on now will remove the CSQueueUtils.computeMaxAvailResource, so it's no need to add a intermediate fix in branch-2. 2) I think CSQueueUtils.getAbsoluteMaxAvailCapacity doesn't hold child/parent's lock together, maybe we don't need to change that, could you confirm? 3) Maybe we don't need getter/setter of absoluteMaxAvailCapacity in queue, a volatile float is enough? Thanks, > CapacityScheduler deadlock when computing absolute max avail capacity > - > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-3251.1.patch > > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335765#comment-14335765 ] Craig Welch commented on YARN-3251: --- It looks like this can occur when a call which walks down the queue tree (in this case, getQueueInfo()) happens at the same time as an assignContainers call which does not start from the root queue, which is specifically one for a reservedContainer where scheduleAsynchronously is false. Essentially, it isn't safe to hold a lock on a queue while locking on a parent queue (as I now see noted in other methods in LeafQueue :/). [YARN-3243] is potentially a long term fix, but it would be nice to fix this right away as it clearly is already problematic. Also, [YARN-3243] depends on a number of other sizable changes which have gone in recently, meaning it will be difficult to apply it as a fix to older codebases, for which it would be very nice to have a fix. I've attached a patch somewhat along the lines suggest by [~sunilg], it simply moves the acquisition of the absoluteMaxAvailCapacity outside the lock on the leaf queue - it will lock parent queues individually as it ascends, but it never holds a parent and child lock simultaneously, which is the unacceptable state. It follows the pattern for other methods in LeafQueue like recoverContainer which access parent queues - they all are careful to make sure the parent queue access occurs outside any lock on themselves. Unfortunately it's not possible to just do this in root.assignContainers because of the reservedContainer case which will not invoke assignContainers on the root queue at any point. Instead, absoluteMaxAvailCapacity is determined outside any lock on the leaf queue in assignContainers before entering the synchronized method which continues the logic as it is today. This looks to me to be the way to fix the issue with the smallest code change today pending other changes coming down the line. > CapacityScheduler deadlock when computing absolute max avail capacity > - > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Wangda Tan >Priority: Blocker > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335124#comment-14335124 ] Wangda Tan commented on YARN-3251: -- Thanks for reporting this, [~jlowe]! Since this is a blocker for 2.7, I will create a patch for this using method described in YARN-3243 first before working on other related refactorings, I added this as a sub task of YARN-3251. > CapacityScheduler deadlock when computing absolute max avail capacity > - > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Priority: Blocker > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335043#comment-14335043 ] Sunil G commented on YARN-3251: --- Its better to compute the available capacity during the call to root.assignContainers. In that scenario, a simpler get will retrieve the available capacity. > CapacityScheduler deadlock when computing absolute max avail capacity > - > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Priority: Blocker > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335044#comment-14335044 ] Jason Lowe commented on YARN-3251: -- YARN-3243 could remove the need to climb up the hierarchy to compute max avail capacity. > CapacityScheduler deadlock when computing absolute max avail capacity > - > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Priority: Blocker > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335021#comment-14335021 ] Sunil G commented on YARN-3251: --- [~jlowe] Recent getAbsoluteMaxAvailCapacity changes cause this. > CapacityScheduler deadlock when computing absolute max avail capacity > - > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Priority: Blocker > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335002#comment-14335002 ] Jason Lowe commented on YARN-3251: -- It looks like this is fallout from YARN-2008. CSQueueUtils.getAbsoluteMaxAvailCapacity is called with the lock held on the LeafQueue and walks up the tree, attempting to grab locks on parents as it goes. That's contrary to the conventional order of locking while walking down the tree, and thus we can deadlock. > CapacityScheduler deadlock when computing absolute max avail capacity > - > > Key: YARN-3251 > URL: https://issues.apache.org/jira/browse/YARN-3251 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Priority: Blocker > > The ResourceManager can deadlock in the CapacityScheduler when computing the > absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3251) CapacityScheduler deadlock when computing absolute max avail capacity
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334986#comment-14334986 ] Jason Lowe commented on YARN-3251: -- Sample stack trace: {noformat} Found one Java-level deadlock: = "IPC Server handler 71 on 8032": waiting to lock monitor 0x037f9120 (object 0x00023b060ad8, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue), which is held by "ResourceManager Event Processor" "ResourceManager Event Processor": waiting to lock monitor 0x02c4b7d0 (object 0x00023aecf620, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), which is held by "IPC Server handler 71 on 8032" Java stack information for the threads listed above: === "IPC Server handler 71 on 8032": at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueueInfo(LeafQueue.java:451) - waiting to lock <0x00023b060ad8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:214) - locked <0x00023aecf620> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:214) - locked <0x00023af36e70> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:214) - locked <0x00023b0d9478> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getQueueInfo(CapacityScheduler.java:910) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueInfo(ClientRMService.java:832) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:259) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:413) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2079) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2075) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2073) "ResourceManager Event Processor": at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.getParent(AbstractCSQueue.java:185) - waiting to lock <0x00023aecf620> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.getAbsoluteMaxAvailCapacity(CSQueueUtils.java:177) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.getAbsoluteMaxAvailCapacity(CSQueueUtils.java:183) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1033) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.checkLimitsToReserve(LeafQueue.java:1341) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1611) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1399) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1278) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:893) - locked <0x00023b060ad8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:758) - locked <0x00023ceb53e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp) - locked <0x00023b060ad8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue) at org.apache.hadoop.yarn