ConfX created YARN-11904:
----------------------------
Summary: TestNMClient has race condition causing intermittent
IndexOutOfBoundsException
Key: YARN-11904
URL: https://issues.apache.org/jira/browse/YARN-11904
Project: Hadoop YARN
Issue Type: Bug
Components: test, yarn-client
Affects Versions: 3.3.5
Environment: - OS: macOS Darwin 25.0.0 (also reproducible on other
platforms)
- Java: OpenJDK 1.8
- Maven: 3.6+
- Test: org.apache.hadoop.yarn.client.api.impl.TestNMClient
Reporter: ConfX
## DESCRIPTION:
The TestNMClient test class has a race condition in its @Before setup() method
that causes intermittent test failures with IndexOutOfBoundsException.
The issue occurs because the test fetches NodeManager reports immediately after
starting the YARN cluster, without waiting for NodeManagers to fully register
and transition to RUNNING state. This results in an empty nodeReports list,
which later causes an IndexOutOfBoundsException when the test tries to access
nodeReports.get(0) in the allocateContainers() method.
This is a timing-dependent bug that may pass on fast hardware (where NMs
register
quickly) but fails on slower systems or under load.
## ROOT CAUSE:
File:
hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestNMClient.java
Method: setup() (lines 160-182)
Problematic code sequence:
{code:java}
@Before
public void setup() throws YarnException, IOException {
// start minicluster
yarnCluster = new MiniYARNCluster(TestAMRMClient.class.getName(),
nodeCount, 1, 1);
yarnCluster.init(conf);
yarnCluster.start(); // ← NodeManagers start asynchronously
// start rm client
yarnClient = (YarnClientImpl) YarnClient.createYarnClient();
yarnClient.init(conf);
yarnClient.start(); // get node info
nodeReports = yarnClient.getNodeReports(NodeState.RUNNING); // ← RACE
CONDITION!
// At this point, NodeManagers may not have registered yet
// Result: nodeReports is EMPTY // ... rest of setup ...
} {code}
Later in the test, allocateContainers() method tries to access:
{code:java}
String node = nodeReports.get(0).getNodeId().getHost(); // ←
IndexOutOfBoundsException! {code}
## STEPS TO REPRODUCE:
# Check out Apache Hadoop 3.3.5 source code
# Navigate to hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client
# Run the test:
{code:java}
mvn test -Dtest=TestNMClient#testNMClientNoCleanupOnStop {code}
# The test may fail intermittently depending on system performanceNote: The
failure is timing-dependent. It may pass on fast systems but fail on slower
systems or when system is under load.
## EXPECTED RESULT:
Test should wait for NodeManagers to register and be in RUNNING state before
fetching node reports. The test should pass consistently regardless of system
performance.
ACTUAL RESULT:
Test fails with IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:659)
at java.util.ArrayList.get(ArrayList.java:435)
at
org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:324)
at
org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:290){code}
## PROPOSED FIX:
Add proper synchronization in the setup() method to wait for NodeManagers
before fetching node reports:
{code:java}
@Before
public void setup() throws YarnException, IOException {
// start minicluster
conf = new YarnConfiguration();
conf.set(YarnConfiguration.NM_CONTAINER_STATE_TRANSITION_LISTENERS,
DebugSumContainerStateListener.class.getName());
yarnCluster =
new MiniYARNCluster(TestAMRMClient.class.getName(), nodeCount, 1, 1);
yarnCluster.init(conf);
yarnCluster.start();
assertNotNull(yarnCluster);
assertEquals(STATE.STARTED, yarnCluster.getServiceState()); // Wait for
NodeManagers to connect
try {
if (!yarnCluster.waitForNodeManagersToConnect(30000)) {
fail("NodeManagers failed to connect within 30 seconds");
}
} catch (InterruptedException e) {
fail("Interrupted while waiting for NodeManagers: " + e.getMessage());
} // start rm client
yarnClient = (YarnClientImpl) YarnClient.createYarnClient();
yarnClient.init(conf);
yarnClient.start();
assertNotNull(yarnClient);
assertEquals(STATE.STARTED, yarnClient.getServiceState()); // get node
info - wait for nodes to be in RUNNING state
int retries = 10;
while (retries > 0) {
nodeReports = yarnClient.getNodeReports(NodeState.RUNNING);
if (nodeReports != null && !nodeReports.isEmpty()) {
break;
}
sleep(1000);
retries--;
} if (nodeReports == null || nodeReports.isEmpty()) {
fail("No NodeManagers in RUNNING state after waiting");
} // ... rest of setup ...
} {code}
I'm happy to submit a patch for this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]