On Sat, 18 Aug 2018 01:48:24 +0100 Gary <g...@mups.co.uk> wrote: > Hi all, > > I have vfio-pci configured to allow Linux host to run on intel iGPU > whilst a 8GB Sapphire Nitro+ RX580 is passed through using virt-manager > to a Windows 10 VM. As long as I eject the GPU in windows before > shutting down the VM, everything works (amd reset bug?). > > I would however like to use the RX580 in the host when the VM is not > running. In order to do this I removed the vfio-pci ids= option allowing > the amdgpu module to bind as normal. I also updated my xorg config to: > > Section "Device" > Identifier "Intel Graphics" > Driver "intel" > Option "DRI" "3" > EndSection > > Section "ServerFlags" > Option "AutoAddGPU" "off" > EndSection > > Section "Device" > Identifier "AMDGPU" > Driver "amdgpu" > Option "DRI3" "1" > Option "Ignore" "1" > EndSection > > This allows me to use the intel graphics or via DRI_PRIME=1 the AMD > graphics. I can also start the VM and virt-manager will rebind the > GPU/GPUAudio to vfio-pci and the VM works nicely. > > The problem with this setup comes when I eject the GPU in windows. > virt-manager in the host locks up and dmesg shows a kernel bug message > (full error at end of email) > > > [ 423.535829] ------------[ cut here ]------------ > [ 423.535830] kernel BUG at /build/linux-hvYKKE/linux-4.17.8/drivers > /iommu/intel-iommu.c:732! > [ 423.535835] invalid opcode: 0000 [#1] SMP PTI > [ 423.535836] Modules linked in: tun fuse ebtable_filter... > > > After a power cycle and thinking this may be to do with the amdgpu > module rebind, I tried unloading the amdgpu module whilst the the VM was > running and thus the GPU bound to vfio-pci. Ejecting the GPU in windows > no longer caused virt-manager to lockup and I could then shut down the > VM via virt-manager. > > However, this just delays the issue, when an attempt is made to rebind > the AMDGPU I once more get a lockup, this time with the dmesg error: > > [ 982.416988] BUG: unable to handle kernel paging request at > ffffb9ad1281a2b4 > [ 982.416992] PGD 41e921067 P4D 41e921067 PUD 0 > [ 982.416995] Oops: 0002 [#1] SMP PTI > [ 982.416997] Modules linked in: amdgpu(+) chash gpu_sched... > > Note, the lockup is of the graphics output. I can still SSH into the > machine, although trying to shut the machine down does not get too far. > > Is this in anyway related to the AMD reset bug? If not, any idea if > there's a fix or workaround or any further information I could provide > to help troubleshoot this? > > > Full trace from dmesg for the two errors follows > > ----------------------- FIRST Error ------------------------------ > [ 423.535829] ------------[ cut here ]------------ > [ 423.535830] kernel BUG at > /build/linux-hvYKKE/linux-4.17.8/drivers/iommu/intel-iommu.c:732! > [ 423.535835] invalid opcode: 0000 [#1] SMP PTI > [ 423.535836] Modules linked in: tun fuse ebtable_filter ebtables > bridge stp llc cpufreq_powersave cpufreq_userspace cpufreq_conservative > binfmt_misc nls_ascii nls_cp437 vfat fat snd_hda_codec_realtek > snd_hda_codec_generic amdkfd ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 > xt_hl ip6t_rt amdgpu snd_hda_codec_hdmi iTCO_wdt iTCO_vendor_support > intel_rapl nf_conntrack_ipv6 nf_defrag_ipv6 x86_pkg_temp_thermal > intel_powerclamp snd_hda_intel coretemp chash snd_hda_codec gpu_sched > snd_hda_core kvm_intel i915 kvm ttm snd_hwdep efi_pstore intel_cstate > snd_pcm intel_uncore intel_rapl_perf ipt_REJECT nf_reject_ipv4 serio_raw > snd_timer pcspkr efivars drm_kms_helper nf_log_ipv4 sg snd drm joydev > evdev mei_me lpc_ich i2c_algo_bit soundcore mei shpchp ie31200_edac > nf_log_common xt_LOG video button xt_limit xt_tcpudp xt_addrtype > [ 423.535866] nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack > ip6table_filter ip6_tables nf_conntrack_netbios_ns > nf_conntrack_broadcast nf_nat_ftp nf_nat vfio_pci vfio_virqfd > vfio_iommu_type1 nf_conntrack_ftp vfio irqbypass nf_conntrack parport_pc > ppdev lp iptable_filter parport sunrpc efivarfs ip_tables x_tables > autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb btrfs zstd_decompress > zstd_compress xxhash algif_skcipher af_alg dm_crypt raid10 raid456 > async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq > libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod dm_mod > sd_mod hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel > ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd > glue_helper psmouse ahci i2c_i801 libahci xhci_pci ehci_pci libata > xhci_hcd ehci_hcd > [ 423.535894] alx scsi_mod mdio thermal usbcore usb_common fan > [ 423.535899] CPU: 2 PID: 3815 Comm: libvirtd Not tainted > 4.17.0-0.bpo.1-amd64 #1 Debian 4.17.8-1~bpo9+1 > [ 423.535900] Hardware name: Gigabyte Technology Co., Ltd. To be filled > by O.E.M./B75-D3V, BIOS F9 10/23/2013 > [ 423.535905] RIP: 0010:domain_get_iommu+0x4e/0x60 > [ 423.535906] RSP: 0018:ffffa52d48a4bb48 EFLAGS: 00010202 > [ 423.535907] RAX: 0000000000000001 RBX: 0000000080c27000 RCX: > 0000000000000000 > [ 423.535908] RDX: 0000000000000000 RSI: 0000000000000000 RDI: > ffff8b4a595d4d00 > [ 423.535909] RBP: 0000000000000000 R08: 00000000000272d0 R09: > ffffffff994ef4b7 > [ 423.535910] R10: ffffa52d48a4ba60 R11: ffffe0d58fd21f20 R12: > ffff8b4a5c5fb0a0 > [ 423.535911] R13: 000000ffffffffff R14: ffff8b4a595d4d00 R15: > 0000000000001000 > [ 423.535913] FS: 00007f287deb2700(0000) GS:ffff8b4a6e300000(0000) > knlGS:0000000000000000 > [ 423.535914] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 423.535915] CR2: fffff80077770000 CR3: 000000041772c003 CR4: > 00000000001626e0 > [ 423.535916] Call Trace: > [ 423.535920] __intel_map_single+0x61/0x180
This one is because the GPU is still bound to a VM IOMMU domain, probably because the audio function is still bound to the VM and userspace bindings are done at the group level. This is a user/libvirt error, your scenario has allowed libvirt to attempt to rebind the GPU to a host driver while the audio device in the same IOMMU group is still bound to vfio-pci and in use by the user. Had intel-iommu not hit a BUG_ON, vfio would for the isolation violation. > [ 423.535957] amdgpu_gart_init+0x5e/0x100 [amdgpu] > [ 423.535983] gmc_v8_0_sw_init+0x669/0x700 [amdgpu] > [ 423.535997] ? drm_detect_hdmi_monitor+0x3e/0xe0 [drm] > [ 423.536017] amdgpu_device_init+0x102a/0x1490 [amdgpu] > [ 423.536019] ? kmalloc_order+0x14/0x40 > [ 423.536039] amdgpu_driver_load_kms+0x86/0x2c0 [amdgpu] > [ 423.536046] drm_dev_register+0x132/0x1c0 [drm] > [ 423.536066] amdgpu_pci_probe+0x1b5/0x280 [amdgpu] > [ 423.536069] local_pci_probe+0x44/0xa0 > [ 423.536072] ? _cond_resched+0x16/0x40 > [ 423.536074] pci_device_probe+0x102/0x1b0 > [ 423.536077] driver_probe_device+0x2b2/0x490 > [ 423.536079] ? __driver_attach+0xe0/0xe0 > [ 423.536080] bus_for_each_drv+0x64/0xb0 > [ 423.536082] __device_attach+0xd9/0x150 > [ 423.536084] bus_rescan_devices_helper+0x30/0x50 > [ 423.536086] store_drivers_probe+0x2d/0x60 > [ 423.536088] kernfs_fop_write+0x10f/0x190 > [ 423.536091] vfs_write+0xb0/0x190 > [ 423.536093] ksys_write+0x52/0xc0 > [ 423.536095] do_syscall_64+0x55/0x110 > [ 423.536097] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 423.536098] RIP: 0033:0x7f28a4c4b1ad > [ 423.536099] RSP: 002b:00007f287deb1930 EFLAGS: 00000293 ORIG_RAX: > 0000000000000001 > [ 423.536101] RAX: ffffffffffffffda RBX: 0000000000000016 RCX: > 00007f28a4c4b1ad > [ 423.536102] RDX: 000000000000000c RSI: 00007f2858008d24 RDI: > 0000000000000016 > [ 423.536103] RBP: 000000000000000c R08: 00007f28540009e0 R09: > 0000000000000000 > [ 423.536104] R10: 00007f28a84ce903 R11: 0000000000000293 R12: > 00007f2858008d24 > [ 423.536105] R13: 0000000000000000 R14: 0000000000000016 R15: > 00007f2854000a00 > [ 423.536106] Code: 74 0d eb 29 48 83 c7 04 8b 4f fc 85 c9 75 0a 83 c0 > 01 39 d0 75 ee 31 c0 c3 48 98 48 c1 e0 03 48 8b 15 a7 4e 14 01 48 8b 04 > 02 c3 <0f> 0b 31 c0 eb ee 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 > [ 423.536126] RIP: domain_get_iommu+0x4e/0x60 RSP: ffffa52d48a4bb48 > [ 423.536128] ---[ end trace 68f635a30860d3cb ]--- > > > > > ----------------------- SECOND Error ------------------------------ > > [ 981.069606] [drm] amdgpu kernel modesetting enabled. > [ 981.069826] [drm] initializing kernel modesetting (POLARIS10 > 0x1002:0x67DF 0x1DA2:0xE366 0xE7). > [ 981.069845] [drm] register mmio base: 0xF7D00000 > [ 981.069845] [drm] register mmio size: 262144 > [ 981.069851] [drm] probing gen 2 caps for device 8086:151 = 261ad03/e > [ 981.069852] [drm] probing mlw for device 8086:151 = 261ad03 > [ 981.069853] [drm] add ip block number 0 <vi_common> > [ 981.069854] [drm] add ip block number 1 <gmc_v8_0> > [ 981.069855] [drm] add ip block number 2 <tonga_ih> > [ 981.069855] [drm] add ip block number 3 <powerplay> > [ 981.069856] [drm] add ip block number 4 <dm> > [ 981.069856] [drm] add ip block number 5 <gfx_v8_0> > [ 981.069857] [drm] add ip block number 6 <sdma_v3_0> > [ 981.069857] [drm] add ip block number 7 <uvd_v6_0> > [ 981.069858] [drm] add ip block number 8 <vce_v3_0> > [ 981.069861] kfd kfd: skipped device 1002:67df, PCI rejects atomics > [ 981.069868] [drm] UVD is enabled in VM mode > [ 981.069868] [drm] UVD ENC is enabled in VM mode > [ 981.069869] [drm] VCE enabled in VM mode > [ 982.413309] ATOM BIOS: 113-BE366EU-Z48 > [ 982.413358] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, > fragment size is 9-bit > [ 982.413429] amdgpu 0000:01:00.0: firmware: direct-loading firmware > amdgpu/polaris10_mc.bin > [ 982.413437] amdgpu 0000:01:00.0: VRAM: 8192M 0x000000F400000000 - > 0x000000F5FFFFFFFF (8192M used) > [ 982.413438] amdgpu 0000:01:00.0: GTT: 256M 0x0000000000000000 - > 0x000000000FFFFFFF > [ 982.413446] [drm] Detected VRAM RAM=8192M, BAR=256M > [ 982.413447] [drm] RAM width 256bits GDDR5 > [ 982.413562] [TTM] Zone kernel: Available graphics memory: 7701472 kiB > [ 982.413563] [TTM] Zone dma32: Available graphics memory: 2097152 kiB > [ 982.413564] [TTM] Initializing pool allocator > [ 982.413568] [TTM] Initializing DMA pool allocator > [ 982.413858] [drm] amdgpu: 8192M of VRAM memory ready > [ 982.413859] [drm] amdgpu: 8192M of GTT memory ready. > [ 982.413876] DMAR: 64bit 0000:01:00.0 uses identity mapping > [ 982.413877] [drm] GART: num cpu pages 65536, num gpu pages 65536 > [ 982.413910] [drm] PCIE GART of 256M enabled (table at > 0x000000F400040000). > [ 982.414019] amdgpu 0000:01:00.0: firmware: direct-loading firmware > amdgpu/polaris10_pfp_2.bin > [ 982.414033] amdgpu 0000:01:00.0: firmware: direct-loading firmware > amdgpu/polaris10_me_2.bin > [ 982.414046] amdgpu 0000:01:00.0: firmware: direct-loading firmware > amdgpu/polaris10_ce_2.bin > [ 982.414046] [drm] Chained IB support enabled! > [ 982.414058] amdgpu 0000:01:00.0: firmware: direct-loading firmware > amdgpu/polaris10_rlc.bin > [ 982.414138] amdgpu 0000:01:00.0: firmware: direct-loading firmware > amdgpu/polaris10_mec_2.bin > [ 982.414240] amdgpu 0000:01:00.0: firmware: direct-loading firmware > amdgpu/polaris10_mec2_2.bin > [ 982.415203] amdgpu 0000:01:00.0: firmware: direct-loading firmware > amdgpu/polaris10_sdma.bin > [ 982.415220] amdgpu 0000:01:00.0: firmware: direct-loading firmware > amdgpu/polaris10_sdma1.bin > [ 982.415397] amdgpu 0000:01:00.0: firmware: direct-loading firmware > amdgpu/polaris10_uvd.bin > [ 982.415400] [drm] Found UVD firmware Version: 1.130 Family ID: 16 > [ 982.416620] amdgpu 0000:01:00.0: firmware: direct-loading firmware > amdgpu/polaris10_vce.bin > [ 982.416624] [drm] Found VCE firmware Version: 53.26 Binary ID: 3 > [ 982.416988] BUG: unable to handle kernel paging request at > ffffb9ad1281a2b4 > [ 982.416992] PGD 41e921067 P4D 41e921067 PUD 0 > [ 982.416995] Oops: 0002 [#1] SMP PTI > [ 982.416997] Modules linked in: amdgpu(+) chash gpu_sched ttm tun fuse > ebtable_filter ebtables bridge stp llc cpufreq_powersave > cpufreq_userspace cpufreq_conservative binfmt_misc intel_rapl > x86_pkg_temp_thermal intel_powerclamp nls_ascii nls_cp437 vfat fat > coretemp iTCO_wdt iTCO_vendor_support kvm_intel ip6t_REJECT > nf_reject_ipv6 snd_hda_codec_realtek nf_log_ipv6 kvm amdkfd intel_cstate > snd_hda_codec_generic efi_pstore intel_uncore xt_hl intel_rapl_perf > ip6t_rt i915 efivars serio_raw pcspkr snd_hda_codec_hdmi snd_hda_intel > snd_hda_codec drm_kms_helper snd_hda_core snd_hwdep snd_pcm drm > snd_timer joydev mei_me nf_conntrack_ipv6 evdev snd sg lpc_ich soundcore > mei shpchp i2c_algo_bit ie31200_edac nf_defrag_ipv6 video button > ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit > xt_tcpudp > [ 982.417033] xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 > xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns > nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack > iptable_filter vfio_pci vfio_virqfd vfio_iommu_type1 vfio irqbypass > sunrpc parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4 > ext4 crc16 mbcache jbd2 fscrypto ecb btrfs zstd_decompress zstd_compress > xxhash algif_skcipher af_alg dm_crypt raid10 raid456 async_raid6_recov > async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c > crc32c_generic raid1 raid0 multipath linear md_mod dm_mod sd_mod > hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel > ghash_clmulni_intel pcbc ahci aesni_intel aes_x86_64 crypto_simd libahci > cryptd psmouse glue_helper i2c_i801 xhci_pci libata ehci_pci xhci_hcd > [ 982.417071] ehci_hcd scsi_mod alx mdio usbcore usb_common fan > thermal [last unloaded: chash] > [ 982.417078] CPU: 2 PID: 3332 Comm: modprobe Not tainted > 4.17.0-0.bpo.1-amd64 #1 Debian 4.17.8-1~bpo9+1 > [ 982.417080] Hardware name: Gigabyte Technology Co., Ltd. To be filled > by O.E.M./B75-D3V, BIOS F9 10/23/2013 > [ 982.417142] RIP: > 0010:smu7_populate_single_firmware_entry.isra.5+0x89/0xe0 [amdgpu] > [ 982.417143] RSP: 0018:ffffb991420d7950 EFLAGS: 00010246 > [ 982.417145] RAX: 000000000000008c RBX: 0000000000000003 RCX: > 0000000000000000 > [ 982.417147] RDX: ffffffffc0f68a64 RSI: 0000000000000004 RDI: > ffff8cafdb9c4360 > [ 982.417148] RBP: ffffb9ad1281a2b4 R08: 0000000000000002 R09: > ffffb991493be000 > [ 982.417149] R10: 00000000802a0001 R11: 0000000000000001 R12: > ffff8cafd698d040 > [ 982.417151] R13: ffff8cafa26fe000 R14: 000000000000047e R15: > 0000000000000003 > [ 982.417154] FS: 00007fb5f5737700(0000) GS:ffff8cafee300000(0000) > knlGS:0000000000000000 > [ 982.417155] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 982.417157] CR2: ffffb9ad1281a2b4 CR3: 00000003ee264003 CR4: > 00000000001606e0 > [ 982.417158] Call Trace: > [ 982.417208] smu7_request_smu_load_fw+0x97/0x320 [amdgpu] > [ 982.417252] polaris10_start_smu+0x64/0x4c0 [amdgpu] > [ 982.417293] ? amdgpu_ucode_init_bo+0xe2/0x270 [amdgpu] > [ 982.417341] pp_hw_init+0x4c/0xd0 [amdgpu] > [ 982.417378] amdgpu_device_init+0x13c3/0x1490 [amdgpu] > [ 982.417383] ? kmalloc_order+0x14/0x40 > [ 982.417419] amdgpu_driver_load_kms+0x86/0x2c0 [amdgpu] > [ 982.417433] drm_dev_register+0x132/0x1c0 [drm] > [ 982.417469] amdgpu_pci_probe+0x1b5/0x280 [amdgpu] > [ 982.417474] local_pci_probe+0x44/0xa0 > [ 982.417478] ? _cond_resched+0x16/0x40 > [ 982.417481] pci_device_probe+0x102/0x1b0 This one looks more like "GPU drivers are not good at hotplug ¯\_(ツ)_/¯" > [ 982.417484] driver_probe_device+0x2b2/0x490 > [ 982.417486] __driver_attach+0xdd/0xe0 > [ 982.417489] ? driver_probe_device+0x490/0x490 > [ 982.417491] bus_for_each_dev+0x67/0xc0 > [ 982.417494] ? klist_add_tail+0x3b/0x70 > [ 982.417496] bus_add_driver+0x16a/0x260 > [ 982.417499] driver_register+0x57/0xc0 > [ 982.417501] ? 0xffffffffc1199000 > [ 982.417503] do_one_initcall+0x4d/0x1c5 > [ 982.417506] ? _cond_resched+0x16/0x40 > [ 982.417509] ? kmem_cache_alloc_trace+0x15d/0x1c0 > [ 982.417512] ? do_init_module+0x22/0x218 > [ 982.417515] do_init_module+0x5b/0x218 > [ 982.417518] load_module.constprop.55+0x2548/0x2d50 > [ 982.417521] ? vfs_read+0x119/0x130 > [ 982.417524] ? __do_sys_finit_module+0xd2/0x100 > [ 982.417526] __do_sys_finit_module+0xd2/0x100 > [ 982.417530] do_syscall_64+0x55/0x110 > [ 982.417532] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 982.417535] RIP: 0033:0x7fb5f52ac229 > [ 982.417536] RSP: 002b:00007ffe1335d988 EFLAGS: 00000246 ORIG_RAX: > 0000000000000139 > [ 982.417538] RAX: ffffffffffffffda RBX: 00005596844ee4c0 RCX: > 00007fb5f52ac229 > [ 982.417540] RDX: 0000000000000000 RSI: 0000559683708638 RDI: > 0000000000000006 > [ 982.417541] RBP: 0000559683708638 R08: 0000000000000000 R09: > 0000000000000000 > [ 982.417542] R10: 0000000000000006 R11: 0000000000000246 R12: > 0000000000000000 > [ 982.417544] R13: 00005596844ef830 R14: 0000000000040000 R15: > 0000000000000000 > [ 982.417545] Code: c0 83 e3 fb 0f 94 c0 66 89 45 18 31 c0 48 8b 4c 24 > 30 65 48 33 0c 25 28 00 00 00 75 5c 48 83 c4 38 5b 5d 41 5c c3 0f b7 44 > 24 02 <66> 89 5d 00 c7 45 0c 00 00 00 00 c7 45 10 00 00 00 00 66 89 45 > [ 982.417614] RIP: smu7_populate_single_firmware_entry.isra.5+0x89/0xe0 > [amdgpu] RSP: ffffb991420d7950 > [ 982.417615] CR2: ffffb9ad1281a2b4 > [ 982.417617] ---[ end trace 095f6331aad830c9 ]--- > > > Thanks, > > Gary > > _______________________________________________ > vfio-users mailing list > vfio-users@redhat.com > https://www.redhat.com/mailman/listinfo/vfio-users _______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users