[ 
https://issues.apache.org/jira/browse/YARN-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18012704#comment-18012704
 ] 

ASF GitHub Bot commented on YARN-11844:
---------------------------------------

cnauroth commented on code in PR #7857:
URL: https://github.com/apache/hadoop/pull/7857#discussion_r2261601292


##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/TestGpuDiscoverer.java:
##########
@@ -297,6 +297,36 @@ public void 
testGetGpuDeviceInformationFaultyNvidiaSmiScriptConsecutiveRun()
     assertNotNull(discoverer.getGpusUsableByYarn());
   }
 
+  @Test
+  public void testGetGpuDeviceInformationDisableMaxErrors()
+      throws YarnException, IOException {
+    Configuration conf = new Configuration(false);
+    // A negative value should disable max errors enforcement.
+    conf.setInt(YarnConfiguration.NM_GPU_DISCOVERY_MAX_ERRORS, -1);
+
+    File fakeBinary = createFakeNvidiaSmiScriptAsRunnableFile(
+        this::createFaultyNvidiaSmiScript);
+
+    GpuDiscoverer discoverer = creatediscovererWithGpuPathDefined(conf);
+    assertEquals(fakeBinary.getAbsolutePath(),
+        discoverer.getPathOfGpuBinary());
+    assertNull(discoverer.getEnvironmentToRunCommand().get(PATH));
+
+    final String terminateMsg = "Failed to execute GPU device " +
+        "detection script (" + fakeBinary.getAbsolutePath() + ") for 10 times";
+    final String msg = "Failed to execute GPU device detection script";
+
+    // The default max errors is 10. Verify that it keeps going for an 11th 
try.
+    for (int i = 0; i < 11; ++i) {

Review Comment:
   This test is covering the case where you disable the max errors by setting a 
negative value. To make this clearer, I dialed it up to 20 attempts, and I also 
added another test that sets the configuration to 11 and confirms it tries 
exactly 11 times.



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml:
##########
@@ -4650,6 +4650,34 @@
     <value></value>
   </property>
 
+  <property>
+    <description>
+      Sets the maximum duration for executions of the discovery binary defined 
in
+      yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables. If
+      the binary takes longer than this amount of time to run, then the process
+      is aborted. Discovery may be attempted again, depending on
+      yarn.nodemanager.resource-plugins.gpu.discovery-max-errors.
+    </description>
+    <name>yarn.nodemanager.resource-plugins.gpu.discovery-timeout</name>
+    <value>10000ms</value>

Review Comment:
   Sounds good, updated.





> Support configuration of retry policy on GPU discovery
> ------------------------------------------------------
>
>                 Key: YARN-11844
>                 URL: https://issues.apache.org/jira/browse/YARN-11844
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: gpu, nodemanager
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>            Priority: Major
>              Labels: pull-request-available
>
> The NodeManager invokes an external binary (e.g. {{nvidia-smi}}) to discover 
> attached GPUs. Right now, there is a hard-coded 10-second timeout on 
> execution of this binary and a hard-coded max error count of 10, beyond which 
> the NodeManager will stop attempting discovery. This change will provide new 
> configuration properties to control both the timeout and the max errors, 
> which is useful in environments where there may be a delay in binding the GPU 
> to the host. Default values for the new configuration properties will be set 
> so as to maintain the existing behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to