[
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025728#comment-17025728
]
Szilard Nemeth commented on YARN-10107:
---------------------------------------
Thanks [~prabhujoseph].
> Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery
> binary even if auto discovery is turned off
> -------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-10107
> URL: https://issues.apache.org/jira/browse/YARN-10107
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Szilard Nemeth
> Assignee: Szilard Nemeth
> Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml,
> nm-config-beforechange-gpu.xml.xml,
> request-response-afterchange-with-autodiscovery.txt,
> request-response-afterchange.txt, request-response-beforechange.txt
>
>
> During internal end-to-end testing, I found the following issue:
> Configuration:
> - GPU is enabled
> - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set
> to "/usr/bin/ls" - Any existing valid binary file
> - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to
> "0:0,1:1,2:2", so auto-discovery is turned off.
> If REST endpoint
> [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu]
> is called, the following exception is thrown in NM:
> {code:java}
> 2020-01-23 07:55:24,803 ERROR
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin:
> Failed to find GPU discovery executable, please double check
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery
> executable, please double check
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104)
> at
> org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515)
> {code}
> *Let's break this down:*
> 1.
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo
> just calls to the
> {code:java}
> gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation();
> {code}
> 2. In
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation,
> the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation:
> {code:java}
> try {
> lastDiscoveredGpuInformation =
> nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary);
> } catch (IOException e) {
> {code}
> 3.
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation
> finally throws the exception.
> This is only happens in case of the parameter called "pathOfGpuBinary" is
> null.
> Since this method is only called from GpuDiscoverer#getGpuDeviceInformation,
> that passes it's field called "pathOfGpuBinary" as the only one parameter, we
> can be sure if this field is null, then we have the exception.
> 4. The only method that can set the "pathOfGpuBinary" fields is with this
> call chain:
> {code:java}
> GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration)
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
> GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper)
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
> {code}
> 5. GpuDiscoverer#initialize contains this code:
> {code:java}
> if (isAutoDiscoveryEnabled()) {
> numOfErrorExecutionSinceLastSucceed = 0;
> lookUpAutoDiscoveryBinary(config);
> ....
> {code}
> , so
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary
> is set ONLY IF auto discovery is enabled.
> Since our tests don't have auto discovery enabled, we have this exception.
> In this sense, the exception message is very misleading for me:
> {code:java}
> Failed to find GPU discovery executable, please double check
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> {code}
>
> Related jira: https://issues.apache.org/jira/browse/YARN-9337
> I think this exception message is very misleading and of course, it does not
> make any sense at all to try to execute the discovery binary.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]