[ 
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024961#comment-17024961
 ] 

Hadoop QA commented on YARN-10107:
----------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  8s{color} 
| {color:red} YARN-10107 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-10107 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/25448/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery 
> binary even if auto discovery is turned off
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10107
>                 URL: https://issues.apache.org/jira/browse/YARN-10107
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, 
> nm-config-beforechange-gpu.xml.xml, 
> request-response-afterchange-with-autodiscovery.txt, 
> request-response-afterchange.txt, request-response-beforechange.txt
>
>
> During internal end-to-end testing, I found the following issue:
> Configuration:
>  - GPU is enabled
>  - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set 
> to "/usr/bin/ls" - Any existing valid binary file
>  - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to 
> "0:0,1:1,2:2", so auto-discovery is turned off.
>  If REST endpoint 
> [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu]
>  is called, the following exception is thrown in NM:
> {code:java}
> 2020-01-23 07:55:24,803 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin:
>  Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery 
> executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515)
> {code}
> *Let's break this down:* 
>  1. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo
>  just calls to the
> {code:java}
> gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation();
> {code}
> 2. In 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation,
>  the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation:
> {code:java}
>  try {
>       lastDiscoveredGpuInformation =
>           nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary);
>     } catch (IOException e) {
> {code}
> 3. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation
>  finally throws the exception.
>  This is only happens in case of the parameter called "pathOfGpuBinary" is 
> null.
>  Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, 
> that passes it's field called "pathOfGpuBinary" as the only one parameter, we 
> can be sure if this field is null, then we have the exception.
>  4. The only method that can set the "pathOfGpuBinary" fields is with this 
> call chain:
> {code:java}
> GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
>   GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
> {code}
> 5. GpuDiscoverer#initialize contains this code:
> {code:java}
> if (isAutoDiscoveryEnabled()) {
>       numOfErrorExecutionSinceLastSucceed = 0;
>       lookUpAutoDiscoveryBinary(config);
>       ....
> {code}
> , so 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary
>  is set ONLY IF auto discovery is enabled.
>  Since our tests don't have auto discovery enabled, we have this exception. 
> In this sense, the exception message is very misleading for me:
> {code:java}
> Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> {code}
>  
>  Related jira: https://issues.apache.org/jira/browse/YARN-9337
> I think this exception message is very misleading and of course, it does not 
> make any sense at all to try to execute the discovery binary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to