[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off
[ https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025728#comment-17025728 ] Szilard Nemeth commented on YARN-10107: --- Thanks [~prabhujoseph]. > Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery > binary even if auto discovery is turned off > --- > > Key: YARN-10107 > URL: https://issues.apache.org/jira/browse/YARN-10107 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, > nm-config-beforechange-gpu.xml.xml, > request-response-afterchange-with-autodiscovery.txt, > request-response-afterchange.txt, request-response-beforechange.txt > > > During internal end-to-end testing, I found the following issue: > Configuration: > - GPU is enabled > - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set > to "/usr/bin/ls" - Any existing valid binary file > - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to > "0:0,1:1,2:2", so auto-discovery is turned off. > If REST endpoint > [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu] > is called, the following exception is thrown in NM: > {code:java} > 2020-01-23 07:55:24,803 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin: > Failed to find GPU discovery executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery > executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104) > at > org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515) > {code} > *Let's break this down:* > 1. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo > just calls to the > {code:java} > gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation(); > {code} > 2. In > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation, > the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation: > {code:java} > try { > lastDiscoveredGpuInformation = > nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary); > } catch (IOException e) { > {code} > 3. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation > finally throws the exception. > This is only happens in case of the parameter called "pathOfGpuBinary" is > null. > Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, > that passes it's field called "pathOfGpuBinary" as the only one parameter, we > can be sure if this field is null, then we have the exception. > 4. The only method that can set the "pathOfGpuBinary" fields is with this > call chain: > {code:java} > GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration) > (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) > GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper) > (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) > {code} > 5. GpuDiscoverer#initialize contains this code: > {code:java} > if (isAutoDiscoveryEnabled()) { > numOfErrorExecutionSinceLastSucceed = 0; > lookUpAutoDiscoveryBinary(config); > > {code} > , so > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary > is set ONLY IF auto discovery is enabled. > Since our tests don't have auto discovery enabled, we have this exception. > In this sense, the exception message is very misleading for me: > {code:java} > Failed to find GPU discovery executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > {code} > > Related jira: https://issues.apache.org/jira/browse/YARN-9337 > I think this exception message is very misleading and of
[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off
[ https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025684#comment-17025684 ] Hudson commented on YARN-10107: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17915 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17915/]) YARN-10107. Fix GpuResourcePlugin#getNMResourceInfo to honor Auto (pjoseph: rev 825db8fe2ab37bd5a9a54485ea9ecbabf3766ed6) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuResourcePlugin.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/TestGpuResourcePlugin.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDiscoverer.java > Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery > binary even if auto discovery is turned off > --- > > Key: YARN-10107 > URL: https://issues.apache.org/jira/browse/YARN-10107 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, > nm-config-beforechange-gpu.xml.xml, > request-response-afterchange-with-autodiscovery.txt, > request-response-afterchange.txt, request-response-beforechange.txt > > > During internal end-to-end testing, I found the following issue: > Configuration: > - GPU is enabled > - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set > to "/usr/bin/ls" - Any existing valid binary file > - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to > "0:0,1:1,2:2", so auto-discovery is turned off. > If REST endpoint > [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu] > is called, the following exception is thrown in NM: > {code:java} > 2020-01-23 07:55:24,803 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin: > Failed to find GPU discovery executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery > executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104) > at > org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515) > {code} > *Let's break this down:* > 1. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo > just calls to the > {code:java} > gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation(); > {code} > 2. In > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation, > the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation: > {code:java} > try { > lastDiscoveredGpuInformation = > nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary); > } catch (IOException e) { > {code} > 3. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation > finally throws the exception. > This is only happens in case of the parameter called "pathOfGpuBinary" is > null. > Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, > that passes it's field called "pathOfGpuBinary" as the only one parameter, we > can be sure if this field is null, then we have the exception. > 4. The only method that can set the "pathOfGpuBinary" fields is with this > call chain: > {code:java} > GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration) > (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) > GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper) >
[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off
[ https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025677#comment-17025677 ] Prabhu Joseph commented on YARN-10107: -- Thank you [~snemeth] for the patch, [~pbacsko] for the review. +1 for [^YARN-10107.001.patch] . Have just committed this to trunk. > Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery > binary even if auto discovery is turned off > --- > > Key: YARN-10107 > URL: https://issues.apache.org/jira/browse/YARN-10107 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, > nm-config-beforechange-gpu.xml.xml, > request-response-afterchange-with-autodiscovery.txt, > request-response-afterchange.txt, request-response-beforechange.txt > > > During internal end-to-end testing, I found the following issue: > Configuration: > - GPU is enabled > - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set > to "/usr/bin/ls" - Any existing valid binary file > - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to > "0:0,1:1,2:2", so auto-discovery is turned off. > If REST endpoint > [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu] > is called, the following exception is thrown in NM: > {code:java} > 2020-01-23 07:55:24,803 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin: > Failed to find GPU discovery executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery > executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104) > at > org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515) > {code} > *Let's break this down:* > 1. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo > just calls to the > {code:java} > gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation(); > {code} > 2. In > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation, > the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation: > {code:java} > try { > lastDiscoveredGpuInformation = > nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary); > } catch (IOException e) { > {code} > 3. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation > finally throws the exception. > This is only happens in case of the parameter called "pathOfGpuBinary" is > null. > Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, > that passes it's field called "pathOfGpuBinary" as the only one parameter, we > can be sure if this field is null, then we have the exception. > 4. The only method that can set the "pathOfGpuBinary" fields is with this > call chain: > {code:java} > GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration) > (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) > GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper) > (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) > {code} > 5. GpuDiscoverer#initialize contains this code: > {code:java} > if (isAutoDiscoveryEnabled()) { > numOfErrorExecutionSinceLastSucceed = 0; > lookUpAutoDiscoveryBinary(config); > > {code} > , so > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary > is set ONLY IF auto discovery is enabled. > Since our tests don't have auto discovery enabled, we have this exception. > In this sense, the exception message is very misleading for me: > {code:java} > Failed to find GPU discovery executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > {code} > > Related jira:
[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off
[ https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025081#comment-17025081 ] Szilard Nemeth commented on YARN-10107: --- Hi [~pbacsko], I think so it's okay to return null. I patched my cluster with the updated code (jars) and tested how the endpoint responds to a request. Please check the attached files. > Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery > binary even if auto discovery is turned off > --- > > Key: YARN-10107 > URL: https://issues.apache.org/jira/browse/YARN-10107 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, > nm-config-beforechange-gpu.xml.xml, > request-response-afterchange-with-autodiscovery.txt, > request-response-afterchange.txt, request-response-beforechange.txt > > > During internal end-to-end testing, I found the following issue: > Configuration: > - GPU is enabled > - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set > to "/usr/bin/ls" - Any existing valid binary file > - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to > "0:0,1:1,2:2", so auto-discovery is turned off. > If REST endpoint > [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu] > is called, the following exception is thrown in NM: > {code:java} > 2020-01-23 07:55:24,803 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin: > Failed to find GPU discovery executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery > executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104) > at > org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515) > {code} > *Let's break this down:* > 1. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo > just calls to the > {code:java} > gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation(); > {code} > 2. In > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation, > the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation: > {code:java} > try { > lastDiscoveredGpuInformation = > nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary); > } catch (IOException e) { > {code} > 3. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation > finally throws the exception. > This is only happens in case of the parameter called "pathOfGpuBinary" is > null. > Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, > that passes it's field called "pathOfGpuBinary" as the only one parameter, we > can be sure if this field is null, then we have the exception. > 4. The only method that can set the "pathOfGpuBinary" fields is with this > call chain: > {code:java} > GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration) > (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) > GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper) > (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) > {code} > 5. GpuDiscoverer#initialize contains this code: > {code:java} > if (isAutoDiscoveryEnabled()) { > numOfErrorExecutionSinceLastSucceed = 0; > lookUpAutoDiscoveryBinary(config); > > {code} > , so > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary > is set ONLY IF auto discovery is enabled. > Since our tests don't have auto discovery enabled, we have this exception. > In this sense, the exception message is very misleading for me: > {code:java} > Failed to find GPU discovery executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. >
[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off
[ https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025077#comment-17025077 ] Peter Bacsko commented on YARN-10107: - [~snemeth] I just have one question. Now in the {{else}} branch, you set {{gpuDeviceInformation}} to {{null}} and it will be wrapped in the response, right? What's the net effect of this? Is it OK to return a {{null}} {{GpuDeviceInformation}}? > Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery > binary even if auto discovery is turned off > --- > > Key: YARN-10107 > URL: https://issues.apache.org/jira/browse/YARN-10107 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, > nm-config-beforechange-gpu.xml.xml, > request-response-afterchange-with-autodiscovery.txt, > request-response-afterchange.txt, request-response-beforechange.txt > > > During internal end-to-end testing, I found the following issue: > Configuration: > - GPU is enabled > - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set > to "/usr/bin/ls" - Any existing valid binary file > - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to > "0:0,1:1,2:2", so auto-discovery is turned off. > If REST endpoint > [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu] > is called, the following exception is thrown in NM: > {code:java} > 2020-01-23 07:55:24,803 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin: > Failed to find GPU discovery executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery > executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104) > at > org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515) > {code} > *Let's break this down:* > 1. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo > just calls to the > {code:java} > gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation(); > {code} > 2. In > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation, > the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation: > {code:java} > try { > lastDiscoveredGpuInformation = > nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary); > } catch (IOException e) { > {code} > 3. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation > finally throws the exception. > This is only happens in case of the parameter called "pathOfGpuBinary" is > null. > Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, > that passes it's field called "pathOfGpuBinary" as the only one parameter, we > can be sure if this field is null, then we have the exception. > 4. The only method that can set the "pathOfGpuBinary" fields is with this > call chain: > {code:java} > GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration) > (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) > GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper) > (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) > {code} > 5. GpuDiscoverer#initialize contains this code: > {code:java} > if (isAutoDiscoveryEnabled()) { > numOfErrorExecutionSinceLastSucceed = 0; > lookUpAutoDiscoveryBinary(config); > > {code} > , so > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary > is set ONLY IF auto discovery is enabled. > Since our tests don't have auto discovery enabled, we have this exception. > In this sense, the exception message is very misleading for me: > {code:java} > Failed to find GPU discovery executable, please double check >
[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off
[ https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024961#comment-17024961 ] Hadoop QA commented on YARN-10107: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 8s{color} | {color:red} YARN-10107 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-10107 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/25448/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery > binary even if auto discovery is turned off > --- > > Key: YARN-10107 > URL: https://issues.apache.org/jira/browse/YARN-10107 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, > nm-config-beforechange-gpu.xml.xml, > request-response-afterchange-with-autodiscovery.txt, > request-response-afterchange.txt, request-response-beforechange.txt > > > During internal end-to-end testing, I found the following issue: > Configuration: > - GPU is enabled > - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set > to "/usr/bin/ls" - Any existing valid binary file > - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to > "0:0,1:1,2:2", so auto-discovery is turned off. > If REST endpoint > [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu] > is called, the following exception is thrown in NM: > {code:java} > 2020-01-23 07:55:24,803 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin: > Failed to find GPU discovery executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery > executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104) > at > org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515) > {code} > *Let's break this down:* > 1. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo > just calls to the > {code:java} > gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation(); > {code} > 2. In > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation, > the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation: > {code:java} > try { > lastDiscoveredGpuInformation = > nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary); > } catch (IOException e) { > {code} > 3. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation > finally throws the exception. > This is only happens in case of the parameter called "pathOfGpuBinary" is > null. > Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, > that passes it's field called "pathOfGpuBinary" as the only one parameter, we > can be sure if this field is null, then we have the exception. > 4. The only method that can set the "pathOfGpuBinary" fields is with this > call chain: > {code:java} > GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration) > (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) > GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper) > (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) > {code} > 5. GpuDiscoverer#initialize contains this code: > {code:java} > if (isAutoDiscoveryEnabled()) { > numOfErrorExecutionSinceLastSucceed = 0; >
[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off
[ https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024960#comment-17024960 ] Szilard Nemeth commented on YARN-10107: --- Uploaded test evidence files. > Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery > binary even if auto discovery is turned off > --- > > Key: YARN-10107 > URL: https://issues.apache.org/jira/browse/YARN-10107 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, > nm-config-beforechange-gpu.xml.xml, > request-response-afterchange-with-autodiscovery.txt, > request-response-afterchange.txt, request-response-beforechange.txt > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off
[ https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024620#comment-17024620 ] Hadoop QA commented on YARN-10107: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 36s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 43s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 48s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 27s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 77m 30s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | YARN-10107 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12991938/YARN-10107.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux f6d6a86ecb41 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 7f40e66 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/25446/testReport/ | | Max. process+thread count | 343 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/25446/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Invoking NMWebServices#getNMResourceInfo tries to execute
[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off
[ https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024618#comment-17024618 ] Hadoop QA commented on YARN-10107: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 35s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 49s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 47s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 21m 14s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 78m 13s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | YARN-10107 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12991938/YARN-10107.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux cda7e467fece 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 7f40e66 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/25445/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/25445/testReport/ | | Max. process+thread count | 307 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U:
[jira] [Commented] (YARN-10107) Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery binary even if auto discovery is turned off
[ https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024417#comment-17024417 ] Szilard Nemeth commented on YARN-10107: --- During internal end-to-end testing, I found the following issue: Configuration: - GPU is enabled - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set to "/usr/bin/ls" - Any existing valid binary file - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to "0:0,1:1,2:2", so auto-discovery is turned off. If REST endpoint http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu is called, the following exception is thrown in NM: {code:java} 2020-01-23 07:55:24,803 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin: Failed to find GPU discovery executable, please double check yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery executable, please double check yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. at org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54) at org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125) at org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104) at org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515) {code} Let's break this down: 1. org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo just calls to the {code:java} gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation(); {code} 2. In org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation, the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation: {code:java} try { lastDiscoveredGpuInformation = nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary); } catch (IOException e) { {code} 3. org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation finally throws the exception. This is only happens in case of the parameter called "pathOfGpuBinary" is null. Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, that passes it's field called "pathOfGpuBinary" as the only one parameter, we can be sure if this field is null, then we have the exception. 4. The only method that can set the "pathOfGpuBinary" fields is with this call chain: {code:java} GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration) (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper) (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) {code} 5. GpuDiscoverer#initialize contains this code: {code:java} if (isAutoDiscoveryEnabled()) { numOfErrorExecutionSinceLastSucceed = 0; lookUpAutoDiscoveryBinary(config); {code} , so org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary is set ONLY IF auto discovery is enabled. Since our tests don't have auto discovery enabled, we have this exception. In this sense, the exception message is very misleading for me: {code:java} Failed to find GPU discovery executable, please double check yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. {code} Related jira: https://issues.apache.org/jira/browse/YARN-9337 > Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery > binary even if auto discovery is turned off > --- > > Key: YARN-10107 > URL: https://issues.apache.org/jira/browse/YARN-10107 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org