[ https://issues.apache.org/jira/browse/YARN-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18012704#comment-18012704 ]
ASF GitHub Bot commented on YARN-11844: --------------------------------------- cnauroth commented on code in PR #7857: URL: https://github.com/apache/hadoop/pull/7857#discussion_r2261601292 ########## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/TestGpuDiscoverer.java: ########## @@ -297,6 +297,36 @@ public void testGetGpuDeviceInformationFaultyNvidiaSmiScriptConsecutiveRun() assertNotNull(discoverer.getGpusUsableByYarn()); } + @Test + public void testGetGpuDeviceInformationDisableMaxErrors() + throws YarnException, IOException { + Configuration conf = new Configuration(false); + // A negative value should disable max errors enforcement. + conf.setInt(YarnConfiguration.NM_GPU_DISCOVERY_MAX_ERRORS, -1); + + File fakeBinary = createFakeNvidiaSmiScriptAsRunnableFile( + this::createFaultyNvidiaSmiScript); + + GpuDiscoverer discoverer = creatediscovererWithGpuPathDefined(conf); + assertEquals(fakeBinary.getAbsolutePath(), + discoverer.getPathOfGpuBinary()); + assertNull(discoverer.getEnvironmentToRunCommand().get(PATH)); + + final String terminateMsg = "Failed to execute GPU device " + + "detection script (" + fakeBinary.getAbsolutePath() + ") for 10 times"; + final String msg = "Failed to execute GPU device detection script"; + + // The default max errors is 10. Verify that it keeps going for an 11th try. + for (int i = 0; i < 11; ++i) { Review Comment: This test is covering the case where you disable the max errors by setting a negative value. To make this clearer, I dialed it up to 20 attempts, and I also added another test that sets the configuration to 11 and confirms it tries exactly 11 times. ########## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml: ########## @@ -4650,6 +4650,34 @@ <value></value> </property> + <property> + <description> + Sets the maximum duration for executions of the discovery binary defined in + yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables. If + the binary takes longer than this amount of time to run, then the process + is aborted. Discovery may be attempted again, depending on + yarn.nodemanager.resource-plugins.gpu.discovery-max-errors. + </description> + <name>yarn.nodemanager.resource-plugins.gpu.discovery-timeout</name> + <value>10000ms</value> Review Comment: Sounds good, updated. > Support configuration of retry policy on GPU discovery > ------------------------------------------------------ > > Key: YARN-11844 > URL: https://issues.apache.org/jira/browse/YARN-11844 > Project: Hadoop YARN > Issue Type: Improvement > Components: gpu, nodemanager > Reporter: Chris Nauroth > Assignee: Chris Nauroth > Priority: Major > Labels: pull-request-available > > The NodeManager invokes an external binary (e.g. {{nvidia-smi}}) to discover > attached GPUs. Right now, there is a hard-coded 10-second timeout on > execution of this binary and a hard-coded max error count of 10, beyond which > the NodeManager will stop attempting discovery. This change will provide new > configuration properties to control both the timeout and the max errors, > which is useful in environments where there may be a delay in binding the GPU > to the host. Default values for the new configuration properties will be set > so as to maintain the existing behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org