Shilun Fan created YARN-11935:
---------------------------------
Summary: Fix deadlock in
TestRMHA#testTransitionedToStandbyShouldNotHang
Key: YARN-11935
URL: https://issues.apache.org/jira/browse/YARN-11935
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Reporter: Shilun Fan
Assignee: Shilun Fan
*Problem*
`testTransitionedToStandbyShouldNotHang` hangs and eventually times out after
100 seconds (YARN-11898 only added a timeout, it didn’t fix the root cause).
*Root Cause*
The test creates a deadlock between the RM lock and the dispatcher thread:
1. Thread `t` calls `rm.transitionToStandby(true)` and holds the RM monitor
(method is `synchronized`)
2. Inside, the overridden `stopActiveServices()` sleeps 10 seconds
(intentionally widening the window)
3. Meanwhile the main thread calls `updateApplicationState(null)` → triggers
`StoreFencedException`
4. This emits `RMFatalEvent(STATE_STORE_FENCED)`, handled by the **async
dispatcher thread**
5. The dispatcher thread calls `handleTransitionToStandByInNewThread()`, which
is **synchronized** and needs the RM lock
6. At the same time `transitionToStandby(true)` stops the dispatcher and
`join()`s the event thread
7. Deadlock:
- `t` holds RM lock and waits for dispatcher thread to exit
- dispatcher thread waits for RM lock
- main thread waits on `t.join()`
*Solution*
Use `InlineDispatcher` for this specific test:
`InlineDispatcher` handles events synchronously on the calling thread
1. No separate dispatcher thread → no lock contention
2. Override `drainEvents()` as no-op to avoid “Not a Drain Dispatcher”
3. Keep the 10s sleep to preserve the slow-shutdown simulation
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]