Shilun Fan created YARN-11932:
---------------------------------
Summary: Fix TestYarnFederationWithFairScheduler timeout caused by
shared NodeLabel storage
Key: YARN-11932
URL: https://issues.apache.org/jira/browse/YARN-11932
Project: Hadoop YARN
Issue Type: Bug
Components: router
Affects Versions: 3.5.1
Reporter: Shilun Fan
Assignee: Shilun Fan
*Problem*
TestYarnFederationWithFairScheduler#testMetricsInfo intermittently times out
during test execution.
The root cause is that multiple test subclusters share the same NodeLabel
storage directory (\{{/tmp/hadoop-yarn-$USER/node-labels}}) by default. When
tests run sequentially, residual editlog entries containing "delete default
label" operations from previous tests cause the ResourceManager to fail during
startup recovery with the error:
{code:java}
Node label=default to be removed doesn't existed in cluster node labels
collection {code}
*Solution*
Set an isolated NodeLabel storage directory for each subcluster startup to
avoid reusing old editlog files.
In \{{TestMockSubCluster.java}}, configure a unique directory per subcluster
using:
* GenericTestUtils.getTestDir() to create test-specific directories
* Directory naming pattern: \{{node-labels-{subClusterId}-\{timestamp}}}
* Configuration key: \{{YarnConfiguration.FS_NODE_LABELS_STORE_ROOT_DIR}}
*Test Results*
After the fix, all 38 tests in TestYarnFederationWithFairScheduler pass
successfully:
* Tests run: 38, Failures: 0, Errors: 0, Skipped: 0
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]