Some more information:

1. driver seem to be loading fine in guest
bronekk@euclid:~$ sudo dmesg | grep -E "nvidia|0d:00"
[    0.810066] pci 0000:0d:00.0: [10de:1eb1] type 00 class 0x030000
[    0.814518] pci 0000:0d:00.0: reg 0x10: [mem 0xc0000000-0xc0ffffff]
[    0.818518] pci 0000:0d:00.0: reg 0x14: [mem 0x1000000000-0x100fffffff 64bit 
pref]
[    0.825110] pci 0000:0d:00.0: reg 0x1c: [mem 0x1010000000-0x1011ffffff 64bit 
pref]
[    0.829048] pci 0000:0d:00.0: reg 0x24: [io  0x9000-0x907f]
[    0.834899] pci 0000:0d:00.0: PME# supported from D0 D3hot D3cold
[    0.836042] pci 0000:0d:00.1: [10de:10f8] type 00 class 0x040300
[    0.837841] pci 0000:0d:00.1: reg 0x10: [mem 0xc1000000-0xc1003fff]
[    0.845020] pci 0000:0d:00.2: [10de:1ad8] type 00 class 0x0c0330
[    0.847351] pci 0000:0d:00.2: reg 0x10: [mem 0x1012000000-0x101203ffff 64bit 
pref]
[    0.854518] pci 0000:0d:00.2: reg 0x1c: [mem 0x1012040000-0x101204ffff 64bit 
pref]
[    0.858820] pci 0000:0d:00.2: PME# supported from D0 D3hot D3cold
[    0.862836] pci 0000:0d:00.3: [10de:1ad9] type 00 class 0x0c8000
[    0.864838] pci 0000:0d:00.3: reg 0x10: [mem 0xc1004000-0xc1004fff]
[    0.873964] pci 0000:0d:00.3: PME# supported from D0 D3hot D3cold
[    0.932598] pci 0000:0d:00.0: vgaarb: VGA device added: 
decodes=io+mem,owns=none,locks=none                                             
                                       [    0.934523] pci 0000:0d:00.0: vgaarb: 
bridge control possible                                                         
                                                         [    0.936134] pci 
0000:0d:00.0: vgaarb: setting as boot device (VGA legacy resources not 
available)                                                                      
        [    1.440190] pci 0000:0d:00.1: D0 power state depends on 0000:0d:00.0 
                                                                                
                          [    1.441170] pci 0000:0d:00.2: D0 power state 
depends on 0000:0d:00.0                                                         
                                                  [    1.443582] pci 
0000:0d:00.3: D0 power state depends on 0000:0d:00.0
[    2.619525] xhci_hcd 0000:0d:00.2: xHCI Host Controller
[    2.620624] xhci_hcd 0000:0d:00.2: new USB bus registered, assigned bus 
number 11
[    2.622792] xhci_hcd 0000:0d:00.2: hcc params 0x0180ff05 hci version 0x110 
quirks 0x0000000000000010
[    2.672211] usb usb11: SerialNumber: 0000:0d:00.2
[    2.676422] xhci_hcd 0000:0d:00.2: xHCI Host Controller
[    2.677944] xhci_hcd 0000:0d:00.2: new USB bus registered, assigned bus 
number 12
[    2.681209] xhci_hcd 0000:0d:00.2: Host supports USB 3.1 Enhanced SuperSpeed
[    2.705956] usb usb12: SerialNumber: 0000:0d:00.2
[    3.926249] nvidia: loading out-of-tree module taints kernel.
[    3.927118] nvidia: module license 'NVIDIA' taints kernel.
[    3.938804] nvidia: module verification failed: signature and/or required 
key missing - tainting kernel
[    3.966693] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 249
[    3.971181] nvidia 0000:0d:00.0: vgaarb: changed VGA decodes: 
olddecodes=io+mem,decodes=none:owns=none
[    4.070078] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for 
UNIX platforms  460.91.03  Fri Jul  2 05:43:38 UTC 2021
[    4.349705] [drm] [nvidia-drm] [GPU ID 0x00000d00] Loading driver
[    4.352647] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0d:00.0 on 
minor 0
[    4.527067] audit: type=1400 audit(1640858541.112:5): apparmor="STATUS" 
operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=650 
comm="apparmor_parser"
[    4.527073] audit: type=1400 audit(1640858541.112:6): apparmor="STATUS" 
operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" 
pid=650 comm="apparmor_parser"
[    4.915963] snd_hda_intel 0000:0d:00.1: Disabling MSI
[    4.954737] snd_hda_intel 0000:0d:00.1: Handle vga_switcheroo audio client
[    5.244486] input: HDA NVidia HDMI/DP,pcm=3 as 
/devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input6
[    5.247732] input: HDA NVidia HDMI/DP,pcm=7 as 
/devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input7
[    5.250636] input: HDA NVidia HDMI/DP,pcm=8 as 
/devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input8
[    5.253520] input: HDA NVidia HDMI/DP,pcm=9 as 
/devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input9
[    5.256445] input: HDA NVidia HDMI/DP,pcm=10 as 
/devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input10
[    5.259401] input: HDA NVidia HDMI/DP,pcm=11 as 
/devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input11
[    5.262271] input: HDA NVidia HDMI/DP,pcm=12 as 
/devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input12


bronekk@euclid:~$ sudo nvidia-smi
Thu Dec 30 10:04:48 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 4000     On   | 00000000:0D:00.0 Off |                  N/A |
| 30%   39C    P8     3W / 125W |      1MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

bronekk@euclid:~$ sudo lspci -vnn -s 0d:00.0
0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL [Quadro 
RTX 4000] [10de:1eb1] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Dell TU104GL [Quadro RTX 4000] [1028:12a0]
        Physical Slot: 0-12
        Flags: bus master, fast devsel, latency 0, IRQ 116
        Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 1000000000 (64-bit, prefetchable) [size=256M]
        Memory at 1010000000 (64-bit, prefetchable) [size=32M]
        I/O ports at 9000 [size=128]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
<?>
        Kernel driver in use: nvidia
        Kernel modules: nvidia


2. host should not be trying to access the card:

bronekk@gauss ~ % sudo lspci -vnn -s 81:00.0
81:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL [Quadro 
RTX 4000] [10de:1eb1] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Dell Device [1028:12a0]
        Flags: bus master, fast devsel, latency 0, IRQ 381, IOMMU group 31
        Memory at bc000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 20000000000 (64-bit, prefetchable) [size=256M]
        Memory at 20010000000 (64-bit, prefetchable) [size=32M]
        I/O ports at b000 [size=128]
        Expansion ROM at bd000000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [258] L1 PM Substates
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 
<?>
        Capabilities: [900] Secondary PCI Express
        Capabilities: [bb0] Physical Resizable BAR
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau


bronekk@gauss ~ % sudo cat /etc/modprobe.d/40-blacklist.conf
# This host is headless, prevent any modules from attaching to video hardware

# NVIDIA
blacklist nouveau
blacklist nvidia

# AMD
blacklist radeon
blacklist amdgpu
blacklist amdkfd
blacklist fglrx

# HDMI sound on a GPU
blacklist snd_hda_intel

# Framebuffers (ALL of them)
blacklist vesafb
blacklist aty128fb
blacklist atyfb
blacklist radeonfb
blacklist cirrusfb
blacklist cyber2000fb
blacklist cyblafb
blacklist gx1fb
blacklist hgafb
blacklist i810fb
blacklist intelfb
blacklist kyrofb
blacklist lxfb
blacklist matroxfb_base
blacklist neofb
blacklist nvidiafb
blacklist pm2fb
blacklist rivafb
blacklist s1d13xxxfb
blacklist savagefb
blacklist sisfb
blacklist sstfb
blacklist tdfxfb
blacklist tridentfb
blacklist vfb
blacklist viafb
blacklist vt8623fb
blacklist udlfb

bronekk@gauss ~ % sudo cat /etc/modprobe.d/30-vfio.conf
# 10de:* are NVIDIA
# 1912:0015 is Renesas Technology Corp. uPD720202 USB 3.0 Host Controller
options vfio-pci ids=10de:1eb1,10de:10f8,10de:1ad8,10de:1ad9,1912:0015
options vfio-pci disable_vga=1

bronekk@gauss ~ % sudo lspci -nn | grep -F "10de:"
81:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL [Quadro 
RTX 4000] [10de:1eb1] (rev a1)
81:00.1 Audio device [0403]: NVIDIA Corporation TU104 HD Audio Controller 
[10de:10f8] (rev a1)
81:00.2 USB controller [0c03]: NVIDIA Corporation TU104 USB 3.1 Host Controller 
[10de:1ad8] (rev a1)
81:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI 
Controller [10de:1ad9] (rev a1)

3. device mapping in libvirt:

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x81' slot='0x00' function='0x0'/>
      </source>
      <rom bar='off'/>
      <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' function='0x0' 
multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x81' slot='0x00' function='0x1'/>
      </source>
      <rom bar='off'/>
      <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' 
function='0x1'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x81' slot='0x00' function='0x2'/>
      </source>
      <rom bar='off'/>
      <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' 
function='0x2'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x81' slot='0x00' function='0x3'/>
      </source>
      <rom bar='off'/>
      <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' 
function='0x3'/>
    </hostdev>


4. something is definitely wrong inside the guest, since I am getting these:

[ 1236.179163] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [Xorg:2982]
[ 1236.179961] Modules linked in: hid_generic usbhid hid rfkill 
snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg soundwire_intel 
soundwire_generic_allocation snd_soc_core ghash_clmulni_intel snd_compress 
soundwire_cadence nls_ascii snd_hda_codec nls_cp437 vfat fat aesni_intel 
snd_hda_core libaes snd_hwdep crypto_simd soundwire_bus cryptd nvidia_drm(POE) 
snd_pcm glue_helper snd_timer drm_kms_helper snd iTCO_wdt intel_pmc_bxt joydev 
iTCO_vendor_support sg serio_raw cec watchdog soundcore virtio_console 
virtio_balloon pcspkr evdev efi_pstore qemu_fw_cfg nvidia_modeset(POE) 
nvidia(POE) drm fuse configfs efivarfs virtio_rng rng_core ip_tables x_tables 
autofs4 ext4 crc16 mbcache jbd2 crc32c_generic sd_mod t10_pi sr_mod crc_t10dif 
cdrom crct10dif_generic ahci libahci xhci_pci libata xhci_hcd virtio_scsi 
virtio_net net_failover failover scsi_mod usbcore crct10dif_pclmul psmouse 
crct10dif_common crc32_pclmul crc32c_intel i2c_i801 virtio_pci lpc_ich 
i2c_smbus virtio_ring usb_common virtio but
 ton
[ 1236.189681] CPU: 12 PID: 2982 Comm: Xorg Tainted: P           OEL    
5.10.0-10-amd64 #1 Debian 5.10.84-1
[ 1236.190725] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
02/06/2015
[ 1236.191711] RIP: 0010:_nv032887rm+0x12/0x40 [nvidia]
[ 1236.192286] Code: d2 0e 31 c0 e8 af 7d 78 ff e8 ca 3c eb ff 31 c0 48 83 c4 
08 c3 0f 1f 00 48 83 ec 08 39 4a 10 76 17 48 8b 02 c1 e9 02 8b 04 88 <48> 83 c4 
08 c3 66 0f 1f 84 00 00 00 00 00 be 00 00 d5 09 bf 0a ad
[ 1236.194379] RSP: 0018:ffffa9b840f6ba98 EFLAGS: 00000256
[ 1236.194977] RAX: 00000000164000a1 RBX: 0000000000000020 RCX: 0000000000000000
[ 1236.195804] RDX: ffff9995889fd0a0 RSI: ffff9995889fc008 RDI: ffff99958b67d008
[ 1236.196617] RBP: ffff999586b02a00 R08: 0000000000000020 R09: 0000000000000000
[ 1236.197425] R10: ffff9995889fc008 R11: ffff9995889fd0a0 R12: 0000000000000000
[ 1236.198217] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9995889fc008
[ 1236.199011] FS:  00007f7bcbbd6a40(0000) GS:ffff999cdfb00000(0000) 
knlGS:0000000000000000
[ 1236.199931] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1236.200587] CR2: 0000564dd482a3a8 CR3: 0000000102d80005 CR4: 0000000000770ee0
[ 1236.201392] PKRU: 55555554
[ 1236.201699] Call Trace:
[ 1236.202118]  ? _nv009235rm+0x1f1/0x230 [nvidia]
[ 1236.202763]  ? _nv036126rm+0x62/0x70 [nvidia]
[ 1236.203393]  ? _nv028825rm+0x46/0x4a0 [nvidia]
[ 1236.204041]  ? _nv009323rm+0x7b/0x90 [nvidia]
[ 1236.204667]  ? _nv009319rm+0xfb/0x4f0 [nvidia]
[ 1236.205302]  ? _nv037231rm+0xfd/0x180 [nvidia]
[ 1236.205939]  ? _nv034489rm+0x248/0x370 [nvidia]
[ 1236.206528]  ? _nv009448rm+0x3d/0x90 [nvidia]
[ 1236.207153]  ? _nv029075rm+0x14c/0x670 [nvidia]
[ 1236.207759]  ? _nv028910rm+0x520/0x900 [nvidia]
[ 1236.208378]  ? _nv002525rm+0x9/0x20 [nvidia]
[ 1236.208966]  ? _nv003517rm+0x1b/0x80 [nvidia]
[ 1236.209551]  ? _nv013021rm+0x6fe/0x770 [nvidia]
[ 1236.210149]  ? _nv038021rm+0xb3/0x150 [nvidia]
[ 1236.210736]  ? _nv038020rm+0x388/0x4e0 [nvidia]
[ 1236.211336]  ? _nv036312rm+0xbe/0x140 [nvidia]
[ 1236.211939]  ? _nv036313rm+0x42/0x70 [nvidia]
[ 1236.212525]  ? _nv008273rm+0x4b/0x90 [nvidia]
[ 1236.213117]  ? _nv000709rm+0x4ef/0x880 [nvidia]
[ 1236.213709]  ? rm_ioctl+0x54/0xb0 [nvidia]
[ 1236.214228]  ? nvidia_ioctl+0x66c/0x880 [nvidia]
[ 1236.214816]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[ 1236.215516]  ? __x64_sys_ioctl+0x83/0xb0
[ 1236.215972]  ? do_syscall_64+0x33/0x80
[ 1236.216405]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

On Wed, 29 Dec 2021, at 11:16 PM, Bronek Kozicki wrote:
> Hi
>
> Hoping someone solved this one before.
>
> My host if Epyc Milan, running on Asrock ROMED8-2T, GPU is NVIDIA 
> Quadro RTX 4000, running on fresh Arch Linux install. The guest is 
> Debian 11 , with NVIDIA-460 drivers . I can see the drivers are 
> correctly loaded in the guest (with nvidia-smi), but Xorg fails to 
> initialize. The /var/log/Xorg.0.log tail is:
>
>
> [   254.714] (II) NVIDIA: Using 24576.00 MB of virtual memory for 
> indirect memory
> [   254.714] (II) NVIDIA:     access.
> [   257.719] (EE) NVIDIA(GPU-0): Failed to initialize DMA.
> [   257.720] (EE) NVIDIA(0): Failed to allocate push buffer
> [   257.829] (EE) 
> Fatal server error:
> [   257.829] (EE) AddScreen/ScreenInit failed for driver 0
> [   257.829] (EE) 
> [   257.829] (EE) 
> Please consult the The X.Org Foundation support 
>        at http://wiki.x.org
>  for help. 
> [   257.829] (EE) Please also check the log file at 
> "/var/log/Xorg.0.log" for additional information.
> [   257.829] (EE) 
> [   257.829] (EE) Server terminated with error (1). Closing log file.
>
> I am running similar configuration (same card, also Debian 11 and 
> nvidia-460 drivers) on a different host, with an older Intel Xeon CPU. 
> No problems there.
>
> Any hints?
>
>
> B.
>
> -- 
>   Bronek Kozicki
>   b...@incorrekt.com
>
> _______________________________________________
> vfio-users mailing list
> vfio-users@redhat.com
> https://listman.redhat.com/mailman/listinfo/vfio-users

-- 
  Bronek Kozicki
  b...@incorrekt.com


_______________________________________________
vfio-users mailing list
vfio-users@redhat.com
https://listman.redhat.com/mailman/listinfo/vfio-users

Reply via email to