[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off

2020-01-29 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025728#comment-17025728
 ] 

Szilard Nemeth commented on YARN-10107:
---

Thanks [~prabhujoseph].

> Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery 
> binary even if auto discovery is turned off
> ---
>
> Key: YARN-10107
> URL: https://issues.apache.org/jira/browse/YARN-10107
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, 
> nm-config-beforechange-gpu.xml.xml, 
> request-response-afterchange-with-autodiscovery.txt, 
> request-response-afterchange.txt, request-response-beforechange.txt
>
>
> During internal end-to-end testing, I found the following issue:
> Configuration:
>  - GPU is enabled
>  - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set 
> to "/usr/bin/ls" - Any existing valid binary file
>  - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to 
> "0:0,1:1,2:2", so auto-discovery is turned off.
>  If REST endpoint 
> [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu]
>  is called, the following exception is thrown in NM:
> {code:java}
> 2020-01-23 07:55:24,803 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin:
>  Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery 
> executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515)
> {code}
> *Let's break this down:* 
>  1. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo
>  just calls to the
> {code:java}
> gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation();
> {code}
> 2. In 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation,
>  the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation:
> {code:java}
>  try {
>   lastDiscoveredGpuInformation =
>   nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary);
> } catch (IOException e) {
> {code}
> 3. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation
>  finally throws the exception.
>  This is only happens in case of the parameter called "pathOfGpuBinary" is 
> null.
>  Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, 
> that passes it's field called "pathOfGpuBinary" as the only one parameter, we 
> can be sure if this field is null, then we have the exception.
>  4. The only method that can set the "pathOfGpuBinary" fields is with this 
> call chain:
> {code:java}
> GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
>   GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
> {code}
> 5. GpuDiscoverer#initialize contains this code:
> {code:java}
> if (isAutoDiscoveryEnabled()) {
>   numOfErrorExecutionSinceLastSucceed = 0;
>   lookUpAutoDiscoveryBinary(config);
>   
> {code}
> , so 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary
>  is set ONLY IF auto discovery is enabled.
>  Since our tests don't have auto discovery enabled, we have this exception. 
> In this sense, the exception message is very misleading for me:
> {code:java}
> Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> {code}
>  
>  Related jira: https://issues.apache.org/jira/browse/YARN-9337
> I think this exception message is very misleading and of 

[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off

2020-01-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025684#comment-17025684
 ] 

Hudson commented on YARN-10107:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17915 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17915/])
YARN-10107. Fix GpuResourcePlugin#getNMResourceInfo to honor Auto (pjoseph: rev 
825db8fe2ab37bd5a9a54485ea9ecbabf3766ed6)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuResourcePlugin.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/TestGpuResourcePlugin.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDiscoverer.java


> Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery 
> binary even if auto discovery is turned off
> ---
>
> Key: YARN-10107
> URL: https://issues.apache.org/jira/browse/YARN-10107
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, 
> nm-config-beforechange-gpu.xml.xml, 
> request-response-afterchange-with-autodiscovery.txt, 
> request-response-afterchange.txt, request-response-beforechange.txt
>
>
> During internal end-to-end testing, I found the following issue:
> Configuration:
>  - GPU is enabled
>  - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set 
> to "/usr/bin/ls" - Any existing valid binary file
>  - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to 
> "0:0,1:1,2:2", so auto-discovery is turned off.
>  If REST endpoint 
> [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu]
>  is called, the following exception is thrown in NM:
> {code:java}
> 2020-01-23 07:55:24,803 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin:
>  Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery 
> executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515)
> {code}
> *Let's break this down:* 
>  1. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo
>  just calls to the
> {code:java}
> gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation();
> {code}
> 2. In 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation,
>  the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation:
> {code:java}
>  try {
>   lastDiscoveredGpuInformation =
>   nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary);
> } catch (IOException e) {
> {code}
> 3. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation
>  finally throws the exception.
>  This is only happens in case of the parameter called "pathOfGpuBinary" is 
> null.
>  Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, 
> that passes it's field called "pathOfGpuBinary" as the only one parameter, we 
> can be sure if this field is null, then we have the exception.
>  4. The only method that can set the "pathOfGpuBinary" fields is with this 
> call chain:
> {code:java}
> GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
>   GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper)  
> 

[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off

2020-01-29 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025677#comment-17025677
 ] 

Prabhu Joseph commented on YARN-10107:
--

Thank you [~snemeth] for the patch, [~pbacsko] for the review.

+1 for  [^YARN-10107.001.patch] . Have just committed this to trunk.

> Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery 
> binary even if auto discovery is turned off
> ---
>
> Key: YARN-10107
> URL: https://issues.apache.org/jira/browse/YARN-10107
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, 
> nm-config-beforechange-gpu.xml.xml, 
> request-response-afterchange-with-autodiscovery.txt, 
> request-response-afterchange.txt, request-response-beforechange.txt
>
>
> During internal end-to-end testing, I found the following issue:
> Configuration:
>  - GPU is enabled
>  - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set 
> to "/usr/bin/ls" - Any existing valid binary file
>  - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to 
> "0:0,1:1,2:2", so auto-discovery is turned off.
>  If REST endpoint 
> [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu]
>  is called, the following exception is thrown in NM:
> {code:java}
> 2020-01-23 07:55:24,803 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin:
>  Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery 
> executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515)
> {code}
> *Let's break this down:* 
>  1. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo
>  just calls to the
> {code:java}
> gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation();
> {code}
> 2. In 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation,
>  the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation:
> {code:java}
>  try {
>   lastDiscoveredGpuInformation =
>   nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary);
> } catch (IOException e) {
> {code}
> 3. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation
>  finally throws the exception.
>  This is only happens in case of the parameter called "pathOfGpuBinary" is 
> null.
>  Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, 
> that passes it's field called "pathOfGpuBinary" as the only one parameter, we 
> can be sure if this field is null, then we have the exception.
>  4. The only method that can set the "pathOfGpuBinary" fields is with this 
> call chain:
> {code:java}
> GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
>   GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
> {code}
> 5. GpuDiscoverer#initialize contains this code:
> {code:java}
> if (isAutoDiscoveryEnabled()) {
>   numOfErrorExecutionSinceLastSucceed = 0;
>   lookUpAutoDiscoveryBinary(config);
>   
> {code}
> , so 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary
>  is set ONLY IF auto discovery is enabled.
>  Since our tests don't have auto discovery enabled, we have this exception. 
> In this sense, the exception message is very misleading for me:
> {code:java}
> Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> {code}
>  
>  Related jira: 

[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off

2020-01-28 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025081#comment-17025081
 ] 

Szilard Nemeth commented on YARN-10107:
---

Hi [~pbacsko], 
I think so it's okay to return null. 
I patched my cluster with the updated code (jars) and tested how the endpoint 
responds to a request. Please check the attached files. 

> Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery 
> binary even if auto discovery is turned off
> ---
>
> Key: YARN-10107
> URL: https://issues.apache.org/jira/browse/YARN-10107
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, 
> nm-config-beforechange-gpu.xml.xml, 
> request-response-afterchange-with-autodiscovery.txt, 
> request-response-afterchange.txt, request-response-beforechange.txt
>
>
> During internal end-to-end testing, I found the following issue:
> Configuration:
>  - GPU is enabled
>  - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set 
> to "/usr/bin/ls" - Any existing valid binary file
>  - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to 
> "0:0,1:1,2:2", so auto-discovery is turned off.
>  If REST endpoint 
> [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu]
>  is called, the following exception is thrown in NM:
> {code:java}
> 2020-01-23 07:55:24,803 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin:
>  Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery 
> executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515)
> {code}
> *Let's break this down:* 
>  1. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo
>  just calls to the
> {code:java}
> gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation();
> {code}
> 2. In 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation,
>  the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation:
> {code:java}
>  try {
>   lastDiscoveredGpuInformation =
>   nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary);
> } catch (IOException e) {
> {code}
> 3. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation
>  finally throws the exception.
>  This is only happens in case of the parameter called "pathOfGpuBinary" is 
> null.
>  Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, 
> that passes it's field called "pathOfGpuBinary" as the only one parameter, we 
> can be sure if this field is null, then we have the exception.
>  4. The only method that can set the "pathOfGpuBinary" fields is with this 
> call chain:
> {code:java}
> GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
>   GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
> {code}
> 5. GpuDiscoverer#initialize contains this code:
> {code:java}
> if (isAutoDiscoveryEnabled()) {
>   numOfErrorExecutionSinceLastSucceed = 0;
>   lookUpAutoDiscoveryBinary(config);
>   
> {code}
> , so 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary
>  is set ONLY IF auto discovery is enabled.
>  Since our tests don't have auto discovery enabled, we have this exception. 
> In this sense, the exception message is very misleading for me:
> {code:java}
> Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> 

[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off

2020-01-28 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025077#comment-17025077
 ] 

Peter Bacsko commented on YARN-10107:
-

[~snemeth] I just have one question.

Now in the {{else}} branch, you set {{gpuDeviceInformation}} to {{null}} and it 
will be wrapped in the response, right? What's the net effect of this? Is it OK 
to return a {{null}} {{GpuDeviceInformation}}?

> Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery 
> binary even if auto discovery is turned off
> ---
>
> Key: YARN-10107
> URL: https://issues.apache.org/jira/browse/YARN-10107
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, 
> nm-config-beforechange-gpu.xml.xml, 
> request-response-afterchange-with-autodiscovery.txt, 
> request-response-afterchange.txt, request-response-beforechange.txt
>
>
> During internal end-to-end testing, I found the following issue:
> Configuration:
>  - GPU is enabled
>  - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set 
> to "/usr/bin/ls" - Any existing valid binary file
>  - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to 
> "0:0,1:1,2:2", so auto-discovery is turned off.
>  If REST endpoint 
> [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu]
>  is called, the following exception is thrown in NM:
> {code:java}
> 2020-01-23 07:55:24,803 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin:
>  Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery 
> executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515)
> {code}
> *Let's break this down:* 
>  1. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo
>  just calls to the
> {code:java}
> gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation();
> {code}
> 2. In 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation,
>  the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation:
> {code:java}
>  try {
>   lastDiscoveredGpuInformation =
>   nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary);
> } catch (IOException e) {
> {code}
> 3. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation
>  finally throws the exception.
>  This is only happens in case of the parameter called "pathOfGpuBinary" is 
> null.
>  Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, 
> that passes it's field called "pathOfGpuBinary" as the only one parameter, we 
> can be sure if this field is null, then we have the exception.
>  4. The only method that can set the "pathOfGpuBinary" fields is with this 
> call chain:
> {code:java}
> GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
>   GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
> {code}
> 5. GpuDiscoverer#initialize contains this code:
> {code:java}
> if (isAutoDiscoveryEnabled()) {
>   numOfErrorExecutionSinceLastSucceed = 0;
>   lookUpAutoDiscoveryBinary(config);
>   
> {code}
> , so 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary
>  is set ONLY IF auto discovery is enabled.
>  Since our tests don't have auto discovery enabled, we have this exception. 
> In this sense, the exception message is very misleading for me:
> {code:java}
> Failed to find GPU discovery executable, please double check 
> 

[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off

2020-01-28 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024961#comment-17024961
 ] 

Hadoop QA commented on YARN-10107:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  8s{color} 
| {color:red} YARN-10107 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-10107 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/25448/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery 
> binary even if auto discovery is turned off
> ---
>
> Key: YARN-10107
> URL: https://issues.apache.org/jira/browse/YARN-10107
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, 
> nm-config-beforechange-gpu.xml.xml, 
> request-response-afterchange-with-autodiscovery.txt, 
> request-response-afterchange.txt, request-response-beforechange.txt
>
>
> During internal end-to-end testing, I found the following issue:
> Configuration:
>  - GPU is enabled
>  - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set 
> to "/usr/bin/ls" - Any existing valid binary file
>  - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to 
> "0:0,1:1,2:2", so auto-discovery is turned off.
>  If REST endpoint 
> [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu]
>  is called, the following exception is thrown in NM:
> {code:java}
> 2020-01-23 07:55:24,803 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin:
>  Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery 
> executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515)
> {code}
> *Let's break this down:* 
>  1. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo
>  just calls to the
> {code:java}
> gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation();
> {code}
> 2. In 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation,
>  the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation:
> {code:java}
>  try {
>   lastDiscoveredGpuInformation =
>   nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary);
> } catch (IOException e) {
> {code}
> 3. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation
>  finally throws the exception.
>  This is only happens in case of the parameter called "pathOfGpuBinary" is 
> null.
>  Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, 
> that passes it's field called "pathOfGpuBinary" as the only one parameter, we 
> can be sure if this field is null, then we have the exception.
>  4. The only method that can set the "pathOfGpuBinary" fields is with this 
> call chain:
> {code:java}
> GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
>   GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
> {code}
> 5. GpuDiscoverer#initialize contains this code:
> {code:java}
> if (isAutoDiscoveryEnabled()) {
>   numOfErrorExecutionSinceLastSucceed = 0;
>   

[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off

2020-01-28 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024960#comment-17024960
 ] 

Szilard Nemeth commented on YARN-10107:
---

Uploaded test evidence files.

> Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery 
> binary even if auto discovery is turned off
> ---
>
> Key: YARN-10107
> URL: https://issues.apache.org/jira/browse/YARN-10107
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, 
> nm-config-beforechange-gpu.xml.xml, 
> request-response-afterchange-with-autodiscovery.txt, 
> request-response-afterchange.txt, request-response-beforechange.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off

2020-01-27 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024620#comment-17024620
 ] 

Hadoop QA commented on YARN-10107:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
36s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
 7s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 43s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 48s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 
27s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
31s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 77m 30s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | YARN-10107 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12991938/YARN-10107.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux f6d6a86ecb41 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 7f40e66 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_232 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/25446/testReport/ |
| Max. process+thread count | 343 (vs. ulimit of 5500) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/25446/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Invoking NMWebServices#getNMResourceInfo tries to execute 

[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off

2020-01-27 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024618#comment-17024618
 ] 

Hadoop QA commented on YARN-10107:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
35s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 49s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 47s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 21m 14s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 78m 13s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | YARN-10107 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12991938/YARN-10107.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux cda7e467fece 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 7f40e66 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_232 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/25445/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/25445/testReport/ |
| Max. process+thread count | 307 (vs. ulimit of 5500) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 

[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off

2020-01-27 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024417#comment-17024417
 ] 

Szilard Nemeth commented on YARN-10107:
---

During internal end-to-end testing, I found the following issue:

Configuration: 
- GPU is enabled
- yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set to 
"/usr/bin/ls" - Any existing valid binary file
- yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to 
"0:0,1:1,2:2", so auto-discovery is turned off.
If REST endpoint 
http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu 
is called, the following exception is thrown in NM:

{code:java}
2020-01-23 07:55:24,803 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin:
 Failed to find GPU discovery executable, please double check 
yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery 
executable, please double check 
yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104)
at 
org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515)
{code}

Let's break this down: 
 1. 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo
 just calls to the
{code:java}
gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation();
{code}
2. In 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation,
 the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation:
{code:java}
 try {
  lastDiscoveredGpuInformation =
  nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary);
} catch (IOException e) {
{code}
3. 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation
 finally throws the exception.
 This is only happens in case of the parameter called "pathOfGpuBinary" is null.
 Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, 
that passes it's field called "pathOfGpuBinary" as the only one parameter, we 
can be sure if this field is null, then we have the exception.
 4. The only method that can set the "pathOfGpuBinary" fields is with this call 
chain:
{code:java}
GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration)  
(org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
  GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper)  
(org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
{code}
5. GpuDiscoverer#initialize contains this code:
{code:java}
if (isAutoDiscoveryEnabled()) {
  numOfErrorExecutionSinceLastSucceed = 0;
  lookUpAutoDiscoveryBinary(config);
  
{code}
, so 
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary
 is set ONLY IF auto discovery is enabled.
 Since our tests don't have auto discovery enabled, we have this exception. In 
this sense, the exception message is very misleading for me:
{code:java}
Failed to find GPU discovery executable, please double check 
yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
{code}
 
 Related jira: https://issues.apache.org/jira/browse/YARN-9337

> Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery 
> binary even if auto discovery is turned off
> ---
>
> Key: YARN-10107
> URL: https://issues.apache.org/jira/browse/YARN-10107
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org