[ 
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10107:
----------------------------------
    Comment: was deleted

(was: During internal end-to-end testing, I found the following issue:

Configuration:
 - GPU is enabled
 - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set 
to "/usr/bin/ls" - Any existing valid binary file
 - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to 
"0:0,1:1,2:2", so auto-discovery is turned off.
 If REST endpoint 
[http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu]
 is called, the following exception is thrown in NM:

{code:java}
2020-01-23 07:55:24,803 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin:
 Failed to find GPU discovery executable, please double check 
yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery 
executable, please double check 
yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104)
        at 
org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515)
{code}
*Let's break this down:* 
 1. 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo
 just calls to the
{code:java}
gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation();
{code}
2. In 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation,
 the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation:
{code:java}
 try {
      lastDiscoveredGpuInformation =
          nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary);
    } catch (IOException e) {
{code}
3. 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation
 finally throws the exception.
 This is only happens in case of the parameter called "pathOfGpuBinary" is null.
 Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, 
that passes it's field called "pathOfGpuBinary" as the only one parameter, we 
can be sure if this field is null, then we have the exception.
 4. The only method that can set the "pathOfGpuBinary" fields is with this call 
chain:
{code:java}
GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration)  
(org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
  GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper)  
(org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
{code}
5. GpuDiscoverer#initialize contains this code:
{code:java}
if (isAutoDiscoveryEnabled()) {
      numOfErrorExecutionSinceLastSucceed = 0;
      lookUpAutoDiscoveryBinary(config);
      ....
{code}
, so 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary
 is set ONLY IF auto discovery is enabled.
 Since our tests don't have auto discovery enabled, we have this exception. In 
this sense, the exception message is very misleading for me:
{code:java}
Failed to find GPU discovery executable, please double check 
yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
{code}
 
 Related jira: https://issues.apache.org/jira/browse/YARN-9337

I think this exception message is very misleading and of course, it does not 
make any sense at all to try to execute the discovery binary.)

> Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery 
> binary even if auto discovery is turned off
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10107
>                 URL: https://issues.apache.org/jira/browse/YARN-10107
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, 
> nm-config-beforechange-gpu.xml.xml, 
> request-response-afterchange-with-autodiscovery.txt, 
> request-response-afterchange.txt, request-response-beforechange.txt
>
>
> During internal end-to-end testing, I found the following issue:
> Configuration:
>  - GPU is enabled
>  - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set 
> to "/usr/bin/ls" - Any existing valid binary file
>  - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to 
> "0:0,1:1,2:2", so auto-discovery is turned off.
>  If REST endpoint 
> [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu]
>  is called, the following exception is thrown in NM:
> {code:java}
> 2020-01-23 07:55:24,803 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin:
>  Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery 
> executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515)
> {code}
> *Let's break this down:* 
>  1. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo
>  just calls to the
> {code:java}
> gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation();
> {code}
> 2. In 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation,
>  the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation:
> {code:java}
>  try {
>       lastDiscoveredGpuInformation =
>           nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary);
>     } catch (IOException e) {
> {code}
> 3. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation
>  finally throws the exception.
>  This is only happens in case of the parameter called "pathOfGpuBinary" is 
> null.
>  Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, 
> that passes it's field called "pathOfGpuBinary" as the only one parameter, we 
> can be sure if this field is null, then we have the exception.
>  4. The only method that can set the "pathOfGpuBinary" fields is with this 
> call chain:
> {code:java}
> GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
>   GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
> {code}
> 5. GpuDiscoverer#initialize contains this code:
> {code:java}
> if (isAutoDiscoveryEnabled()) {
>       numOfErrorExecutionSinceLastSucceed = 0;
>       lookUpAutoDiscoveryBinary(config);
>       ....
> {code}
> , so 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary
>  is set ONLY IF auto discovery is enabled.
>  Since our tests don't have auto discovery enabled, we have this exception. 
> In this sense, the exception message is very misleading for me:
> {code:java}
> Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> {code}
>  
>  Related jira: https://issues.apache.org/jira/browse/YARN-9337
> I think this exception message is very misleading and of course, it does not 
> make any sense at all to try to execute the discovery binary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to