Chris Nauroth created YARN-11844:
------------------------------------

             Summary: Support configuration of retry policy on GPU discovery
                 Key: YARN-11844
                 URL: https://issues.apache.org/jira/browse/YARN-11844
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: gpu, nodemanager
            Reporter: Chris Nauroth
            Assignee: Chris Nauroth


The NodeManager invokes an external binary (e.g. {{nvidia-smi}}) to discover 
attached GPUs. Right now, there is a hard-coded 10-second timeout on execution 
of this binary and a hard-coded max error count of 10, beyond which the 
NodeManager will stop attempting discovery. This change will provide new 
configuration properties to control both the timeout and the max errors, which 
is useful in environments where there may be a delay in binding the GPU to the 
host. Default values for the new configuration properties will be set so as to 
maintain the existing behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to