[
https://issues.apache.org/jira/browse/YARN-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767203#comment-16767203
]
Peter Bacsko commented on YARN-9217:
------------------------------------
Minor comments:
1. Do we need a separate variable here?
{noformat}
70 if (usableGpus.isEmpty()) {
71 String message = "GPU is enabled on the NodeManager, but couldn't
find "
72 + "any usable GPU devices, please double check
configuration.";
73 LOG.warn(message);
{noformat}
2. Similar thing in GpuNodeResourceUpdateHandler
{noformat}
if (usableGpus.isEmpty()) {
String message = "GPU is enabled, but couldn't find any usable GPUs on the "
+ "NodeManager.";
LOG.warn(message);
{noformat}
3. I would rename {{checkErrorNumber()}} to {{checkErrorCount()}}
4. By the way -- is it reasonable to perform GPU discovery in a loop? What's
the idea here? Is "nvidia-smi" flaky sometimes? What condition are we trying to
avoid? I realized that this part of the code existed before, but still...
anyone? :)
5. {{NvidiaBinaryHelper}} - {{@returns}} clause is missing in the JavaDoc
6. {{NvidiaBinaryHelper}} - this class is very small. If it's introduced for
testing purposes, I strongly recommend using a replaceable lamba function, like
this:
{noformat}
Function<String, Optional<GpuDeviceInformation>> gpuDeviceRetriever =
this::getGpuDeviceInformation;
...
@VisibleForTesting
void setGpuDeviceRetriever(Function<String, Optional<GpuDeviceInformation>>
func) {
this.gpuDeviceRetriever = func;
}
...
lastDiscoveredGpuInformation = gpuDeviceRetriever.apply(pathOfGpuBinary);
{noformat}
Then you can set your own retrieving logic in the test. Lambdas can't throw
exceptions, so you have to wrap incorrect return values in {{Optional}}.
*Fundamental question*: is this the way how we want to use thig plugin? Just
asking because we might accidentally mask erratic behavior. Eg. a Hadoop user
might think that he has a cluster with 10 GPUs. In reality, the plugin failed
to detect some cards, and only 5 NMs support GPU scheduling. If it's not
explicitly displayed, the user might be under the impression that 10 GPUs are
ready to run YARN workloads. This can be very misleading.
At the very least, a fail-fast method should be considered.
> Nodemanager will fail to start if GPU is misconfigured on the node or GPU
> drivers missing
> -----------------------------------------------------------------------------------------
>
> Key: YARN-9217
> URL: https://issues.apache.org/jira/browse/YARN-9217
> Project: Hadoop YARN
> Issue Type: Bug
> Components: yarn
> Affects Versions: 3.0.0, 3.1.0
> Reporter: Antal Bálint Steinbach
> Assignee: Antal Bálint Steinbach
> Priority: Major
> Attachments: YARN-9217.001.patch, YARN-9217.002.patch,
> YARN-9217.003.patch, YARN-9217.004.patch
>
>
> Nodemanager will not start
> 1. If Autodiscovery is enabled:
> * If nvidia-smi path is misconfigured or the file does not exist.
> * There is 0 GPU found
> * If the file exists but it is not pointing to an nvidia-smi
> * if the binary is ok but there is an IOException
> 2. If the manually configured GPU devices are misconfigured
> * Any index:minor number format failure will cause a problem
> * 0 configured device will cause a problem
> * NumberFormatException is not handled
> It would be a better option to add warnings about the configuration, set 0
> available GPUs and let the node work and run non-gpu jobs.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]