ConfX created YARN-11904:
----------------------------

             Summary: TestNMClient has race condition causing intermittent 
IndexOutOfBoundsException
                 Key: YARN-11904
                 URL: https://issues.apache.org/jira/browse/YARN-11904
             Project: Hadoop YARN
          Issue Type: Bug
          Components: test, yarn-client
    Affects Versions: 3.3.5
         Environment: - OS: macOS Darwin 25.0.0 (also reproducible on other 
platforms)
- Java: OpenJDK 1.8
- Maven: 3.6+
- Test: org.apache.hadoop.yarn.client.api.impl.TestNMClient
            Reporter: ConfX


## DESCRIPTION:

The TestNMClient test class has a race condition in its @Before setup() method
that causes intermittent test failures with IndexOutOfBoundsException.

The issue occurs because the test fetches NodeManager reports immediately after
starting the YARN cluster, without waiting for NodeManagers to fully register
and transition to RUNNING state. This results in an empty nodeReports list,
which later causes an IndexOutOfBoundsException when the test tries to access
nodeReports.get(0) in the allocateContainers() method.

This is a timing-dependent bug that may pass on fast hardware (where NMs 
register
quickly) but fails on slower systems or under load.

 

## ROOT CAUSE:

File: 
hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestNMClient.java
Method: setup() (lines 160-182)

Problematic code sequence:

 
{code:java}
@Before
public void setup() throws YarnException, IOException {
    // start minicluster
    yarnCluster = new MiniYARNCluster(TestAMRMClient.class.getName(), 
nodeCount, 1, 1);
    yarnCluster.init(conf);
    yarnCluster.start();              // ← NodeManagers start asynchronously    
// start rm client
    yarnClient = (YarnClientImpl) YarnClient.createYarnClient();
    yarnClient.init(conf);
    yarnClient.start();    // get node info
    nodeReports = yarnClient.getNodeReports(NodeState.RUNNING);  // ← RACE 
CONDITION!
    // At this point, NodeManagers may not have registered yet
    // Result: nodeReports is EMPTY    // ... rest of setup ...
} {code}
Later in the test, allocateContainers() method tries to access:
{code:java}
String node = nodeReports.get(0).getNodeId().getHost();  // ← 
IndexOutOfBoundsException! {code}
## STEPS TO REPRODUCE:
 # Check out Apache Hadoop 3.3.5 source code
 # Navigate to hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client
 # Run the test:  
{code:java}
mvn test -Dtest=TestNMClient#testNMClientNoCleanupOnStop   {code}

 # The test may fail intermittently depending on system performanceNote: The 
failure is timing-dependent. It may pass on fast systems but fail on slower 
systems or when system is under load.

## EXPECTED RESULT:

Test should wait for NodeManagers to register and be in RUNNING state before
fetching node reports. The test should pass consistently regardless of system
performance.

 

ACTUAL RESULT:

Test fails with IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.rangeCheck(ArrayList.java:659)
    at java.util.ArrayList.get(ArrayList.java:435)
    at 
org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:324)
    at 
org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:290){code}
## PROPOSED FIX:

Add proper synchronization in the setup() method to wait for NodeManagers
before fetching node reports:
{code:java}
@Before
public void setup() throws YarnException, IOException {
    // start minicluster
    conf = new YarnConfiguration();
    conf.set(YarnConfiguration.NM_CONTAINER_STATE_TRANSITION_LISTENERS,
        DebugSumContainerStateListener.class.getName());
    yarnCluster =
        new MiniYARNCluster(TestAMRMClient.class.getName(), nodeCount, 1, 1);
    yarnCluster.init(conf);
    yarnCluster.start();
    assertNotNull(yarnCluster);
    assertEquals(STATE.STARTED, yarnCluster.getServiceState());    // Wait for 
NodeManagers to connect
    try {
        if (!yarnCluster.waitForNodeManagersToConnect(30000)) {
            fail("NodeManagers failed to connect within 30 seconds");
        }
    } catch (InterruptedException e) {
        fail("Interrupted while waiting for NodeManagers: " + e.getMessage());
    }    // start rm client
    yarnClient = (YarnClientImpl) YarnClient.createYarnClient();
    yarnClient.init(conf);
    yarnClient.start();
    assertNotNull(yarnClient);
    assertEquals(STATE.STARTED, yarnClient.getServiceState());    // get node 
info - wait for nodes to be in RUNNING state
    int retries = 10;
    while (retries > 0) {
        nodeReports = yarnClient.getNodeReports(NodeState.RUNNING);
        if (nodeReports != null && !nodeReports.isEmpty()) {
            break;
        }
        sleep(1000);
        retries--;
    }    if (nodeReports == null || nodeReports.isEmpty()) {
        fail("No NodeManagers in RUNNING state after waiting");
    }    // ... rest of setup ...
} {code}
I'm happy to submit a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to