[ https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15072993#comment-15072993 ]
Wangda Tan edited comment on YARN-4519 at 12/28/15 6:43 PM: ------------------------------------------------------------ Thanks [~jianhe] found this issue and analysis from [~sandflee]/[~mding]. I think the simplest solution could be, move {code} // Decrease containers decreaseContainers(normalizedDecreaseRequests, application); {code} Out of the synchronized lock of application: {code} synchronized (application) { //... } // put it here. {code} And also, in {{AbstractYarnScheduler#decreaseContainers}}, It's better to move {code} boolean hasIncreaseRequest = attempt.removeIncreaseRequest(request.getNodeId(), request.getPriority(), request.getContainerId()); {code} Into {{decreaseContainer}}. After above changes, decrease a container needs to acquire CS lock first. And YARN-4138 can directly use {{decreaseContainer}} to rolllback container. Thoughts? was (Author: leftnoteasy): Thanks [~jianhe] found this issue and analysis from [~sandflee]/[~mding]. I think the simplest solution could be, move {code} // Decrease containers decreaseContainers(normalizedDecreaseRequests, application); {code} Out of the synchronized lock of application: {code} synchronized (application) { //... } // put it here. {code} And also, in {{AbstractYarnScheduler#decreaseContainers}}, It's better to move {code} boolean hasIncreaseRequest = attempt.removeIncreaseRequest(request.getNodeId(), request.getPriority(), request.getContainerId()); {code} Into {{decreaseContainer}}. After above changes, decrease a container needs to acquire CS lock first. And YARN-4136 can directly use {{decreaseContainer}} to rolllback container. Thoughts? > potential deadlock of CapacityScheduler between decrease container and assign > containers > ---------------------------------------------------------------------------------------- > > Key: YARN-4519 > URL: https://issues.apache.org/jira/browse/YARN-4519 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Reporter: sandflee > > In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and > may be get CapacityScheduler's sync lock in decreaseContainer() > In scheduler thread, first get CapacityScheduler's sync lock in > allocateContainersToNode(), and may get FiCaSchedulerApp sync lock in > FicaSchedulerApp.assignContainers(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)