Tao Yang created YARN-11843: ------------------------------- Summary: Fix potential deadlock when auto-correction of container allocation is enabled Key: YARN-11843 URL: https://issues.apache.org/jira/browse/YARN-11843 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 3.5.0 Reporter: Tao Yang Assignee: Tao Yang
The feature introduced in YARN-11702 has a potential deadlock issue. When enabled, it can cause deadlock when holding application-level write locks while trying to acquire queue-level write locks. Root Cause: - autoCorrectContainerAllocation is called while holding application-level write locks - It directly calls completedContainer() which requires queue-level write locks {code:java} CapacityScheduler#allocate --> ... application.getWriteLock().lock(); //1. requires app writeLock!!! try{ ... AbstractYarnScheduler#autoCorrectContainerAllocation --> AbstractYarnScheduler#completedContainer --> AbstractYarnScheduler#completedContainerInternal --> AbstractLeafQueue#completedContainer writeLock.lock() //2. requires queue writeLock!!! try{ ... FiCaSchedulerApp#containerCompleted //3. requires app writeLock!!! }finally{ writeLock.unlock(); } }finally{ application.getWriteLock().unlock(); }{code} - This violates lock hierarchy and creates deadlock scenarios, since AbstractYarnScheduler#completedContainer could be called from another thread during normal container completion operations. Solution: Replace direct completedContainer() calls with asyncContainerRelease() in autoCorrectContainerAllocation method. Before: {code:java} completedContainer(rmContainer, ...); // Direct call causes deadlock {code} After: {code:java} asyncContainerRelease(rmContainer); // Async call avoids deadlock {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org