Shilun Fan created YARN-11935:
---------------------------------

             Summary: Fix deadlock in 
TestRMHA#testTransitionedToStandbyShouldNotHang
                 Key: YARN-11935
                 URL: https://issues.apache.org/jira/browse/YARN-11935
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
            Reporter: Shilun Fan
            Assignee: Shilun Fan


*Problem*

`testTransitionedToStandbyShouldNotHang` hangs and eventually times out after 
100 seconds (YARN-11898 only added a timeout, it didn’t fix the root cause).

*Root Cause*

The test creates a deadlock between the RM lock and the dispatcher thread:

1. Thread `t` calls `rm.transitionToStandby(true)` and holds the RM monitor 
(method is `synchronized`)

2. Inside, the overridden `stopActiveServices()` sleeps 10 seconds 
(intentionally widening the window)

3. Meanwhile the main thread calls `updateApplicationState(null)` → triggers 
`StoreFencedException`

4. This emits `RMFatalEvent(STATE_STORE_FENCED)`, handled by the **async 
dispatcher thread**

5. The dispatcher thread calls `handleTransitionToStandByInNewThread()`, which 
is **synchronized** and needs the RM lock

6. At the same time `transitionToStandby(true)` stops the dispatcher and 
`join()`s the event thread

7. Deadlock:
   - `t` holds RM lock and waits for dispatcher thread to exit
   - dispatcher thread waits for RM lock
   - main thread waits on `t.join()`

*Solution*

Use `InlineDispatcher` for this specific test:

`InlineDispatcher` handles events synchronously on the calling thread
1. No separate dispatcher thread → no lock contention
2. Override `drainEvents()` as no-op to avoid “Not a Drain Dispatcher”
3. Keep the 10s sleep to preserve the slow-shutdown simulation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to