Chris Nauroth created YARN-11844: ------------------------------------ Summary: Support configuration of retry policy on GPU discovery Key: YARN-11844 URL: https://issues.apache.org/jira/browse/YARN-11844 Project: Hadoop YARN Issue Type: Improvement Components: gpu, nodemanager Reporter: Chris Nauroth Assignee: Chris Nauroth
The NodeManager invokes an external binary (e.g. {{nvidia-smi}}) to discover attached GPUs. Right now, there is a hard-coded 10-second timeout on execution of this binary and a hard-coded max error count of 10, beyond which the NodeManager will stop attempting discovery. This change will provide new configuration properties to control both the timeout and the max errors, which is useful in environments where there may be a delay in binding the GPU to the host. Default values for the new configuration properties will be set so as to maintain the existing behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org