You have been subscribed to a public bug:

An AMD Milan Delta system with HGX A100 8-GPUs is having issues
detecting all 8 GPUs due to problem in enabling the fabric manager on
both Ubuntu 18.04 and 20.04. But with other Linux variants -such as
CentOS and RHEL, there’s no problem in detecting all 8-GPUs.

>From clean Ubuntu 18.04 install
A100 Delta board

Output from systemctl status nvidia-fabricmanager process terminated due to 
NVSwitch driver failure
------------
Feb 06 04:44:11 milan-delta systemd[1]: Starting NVIDIA fabric manager 
service...
Feb 06 04:44:12 milan-delta nv-fabricmanager[64822]: request to query NVSwitch 
device information from NVSw>
Feb 06 04:44:12 milan-delta systemd[1]: nvidia-fabricmanager.service: Control 
process exited, code=exited, >
Feb 06 04:44:12 milan-delta systemd[1]: nvidia-fabricmanager.service: Failed 
with result 'exit-code'.
Feb 06 04:44:12 milan-delta systemd[1]: Failed to start NVIDIA fabric manager 
service.
------------

Syslog output
-----------
Feb 6 04:44:14 milan-delta kernel: [ 1185.231538] NVRM: GPU 0000:85:00.0: 
RmInitAdapter failed! (0x23:0xffff:624)
Feb 6 04:44:14 milan-delta kernel: [ 1185.231895] NVRM: GPU 0000:85:00.0: 
rm_init_adapter failed, device minor number 2
Feb 6 04:44:14 milan-delta nvidia-persistenced: device 0000:85:00.0 - failed to 
open.
Feb 6 04:44:14 milan-delta nvidia-persistenced: device 0000:8b:00.0 - registered
-----------


The dmesg
-----------
[ 1170.435712] NVRM: This PCI I/O region assigned to your NVIDIA device is 
invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:45:00.0)
[ 1170.435714] NVRM: The system BIOS may have misconfigured your GPU.
[ 1170.435725] nvidia: probe of 0000:45:00.0 failed with error -1

[ 1182.379923] nvidia: loading out-of-tree module taints kernel.
[ 1182.379936] nvidia: module license 'NVIDIA' taints kernel.
[ 1182.379937] Disabling lock debugging due to kernel taint
[ 1182.389651] nvidia: module verification failed: signature and/or required 
key missing - tainting kernel
[ 1182.406795] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 235
[ 1182.406939] nvidia-nvswitch: Probing device 0000:d4:00.0, Vendor Id = 
0x10de, Device Id = 0x1af1, Class = 0x68000
[ 1182.407252] nvidia-nvswitch0: Failed to map BAR0 region : -12
-----------

** Affects: ubuntu
     Importance: Undecided
         Status: New

-- 
Milan Delta A100 GPU fails to detect on Ubuntu 18.04 and 20.04 
https://bugs.launchpad.net/bugs/1915413
You received this bug notification because you are a member of Ubuntu Bugs, 
which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to