[
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Bacsko updated YARN-9265:
-------------------------------
Description:
The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
There are two major issues.
Problem #1
The output of aocl diagnose:
{noformat}
--------------------------------------------------------------------
Device Name:
acl0
Package Pat:
/home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
Vendor: Intel Corp
Physical Dev Name Status Information
pac_a10_f200000 Passed PAC Arria 10 Platform (pac_a10_f200000)
PCIe 08:00.0
FPGA temperature = 79 degrees C.
DIAGNOSTIC_PASSED
--------------------------------------------------------------------
Call "aocl diagnose <device-names>" to run diagnose for specified devices
Call "aocl diagnose all" to run diagnose for all devices
{noformat}
The plugin fails to recognize this and fails with the following message:
{noformat}
2019-01-25 06:46:02,834 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
Using FPGA vendor plugin:
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
2019-01-25 06:46:02,943 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
Trying to diagnose FPGA information ...
2019-01-25 06:46:03,085 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
Using traffic control bandwidth handler
2019-01-25 06:46:03,108 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
2019-01-25 06:46:03,139 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
FPGA Plugin bootstrap success.
2019-01-25 06:46:03,247 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
Couldn't find (?i)bus:slot.func\s=\s.*, pattern
2019-01-25 06:46:03,248 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
2019-01-25 06:46:03,251 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
Failed to get major-minor number from reading /dev/pac_a10_f300000
2019-01-25 06:46:03,252 ERROR
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to
bootstrap configured resource subsystems!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
No FPGA devices detected!
{noformat}
Problem #2
The plugin assumes that the file name under {{/dev}} can be derived from the
"Physical Dev Name", but this is wrong. For example, it thinks that the device
file is {{/dev/pac_a10_f300000}} which is not the case, the actual file is
{{/dev/intel-fpga-port.0}}.
was:
The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
There are two major issues.
Problem #1
The output of aocl diagnose:
{noformat}
--------------------------------------------------------------------
Device Name:
acl0
Package Pat:
/home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
Vendor: Intel Corp
Physical Dev Name Status Information
pac_a10_f200000 Passed PAC Arria 10 Platform (pac_a10_f200000)
PCIe 08:00.0
FPGA temperature = 79 degrees C.
DIAGNOSTIC_PASSED
--------------------------------------------------------------------
Call "aocl diagnose <device-names>" to run diagnose for specified devices
Call "aocl diagnose all" to run diagnose for all devices
{noformat}
The plugin fails to recognize this and fails with the following message:
{noformat}
2019-01-25 06:46:02,834 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
Using FPGA vendor plugin:
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
2019-01-25 06:46:02,943 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
Trying to diagnose FPGA information ...
2019-01-25 06:46:03,085 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
Using traffic control bandwidth handler
2019-01-25 06:46:03,108 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
2019-01-25 06:46:03,139 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
FPGA Plugin bootstrap success.
2019-01-25 06:46:03,247 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
Couldn't find (?i)bus:slot.func\s=\s.*, pattern
2019-01-25 06:46:03,248 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
2019-01-25 06:46:03,251 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
Failed to get major-minor number from reading /dev/pac_a10_f300000
2019-01-25 06:46:03,252 ERROR
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to
bootstrap configured resource subsystems!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
No FPGA devices detected!
{noformat}
Problem #2
The plugin assume that the file name under {{/dev}} can be derived from the
"Physical Dev Name". This is not the case. For example, it thinks that the
device file is {{ /dev/pac_a10_f300000}} which is not the case, the actual file
is {{/dev/intel-fpga-port.0}}.
> FPGA plugin fails to recognize Intel PAC card
> ---------------------------------------------
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
> Issue Type: Sub-task
> Affects Versions: 3.1.0
> Reporter: Peter Bacsko
> Priority: Critical
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> --------------------------------------------------------------------
> Device Name:
> acl0
>
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>
> Vendor: Intel Corp
>
> Physical Dev Name Status Information
>
> pac_a10_f200000 Passed PAC Arria 10 Platform (pac_a10_f200000)
> PCIe 08:00.0
> FPGA temperature = 79 degrees C.
>
> DIAGNOSTIC_PASSED
> --------------------------------------------------------------------
>
> Call "aocl diagnose <device-names>" to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
> Using FPGA vendor plugin:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
> Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
> Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
> Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
> FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
> Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
> Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
> Failed to get major-minor number from reading /dev/pac_a10_f300000
> 2019-01-25 06:46:03,252 ERROR
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
> No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the
> "Physical Dev Name", but this is wrong. For example, it thinks that the
> device file is {{/dev/pac_a10_f300000}} which is not the case, the actual
> file is {{/dev/intel-fpga-port.0}}.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]