[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #19 from Dan Horák --- They (= https://src.fedoraproject.org/fork/sharkcz/rpms/kernel/blob/talos/f/ppc64-talos-amdgpu-reset.patch) can go upstream. I have a 4.20 kernel on the host with recent firmware for polaris11, the skiroot boot environment (4.15 IIRC) uses polaris11 firmware files from 20181015. The system boots OK in the skiroot kernel -> kexec -> host kernel sequence. The diff between the firmware sets is --- a-old 2019-01-08 14:34:02.300578731 +0100 +++ a-new 2019-01-08 14:34:14.519024951 +0100 @@ -1,19 +1,21 @@ 5dc1006ce1896d997232b9165f2ce8ebad52a63c polaris11_ce.bin 193b0baaabf2a0a9e37a201a251013f26b0b70ee polaris11_ce_2.bin +8ee2e5db95fe589e8292642e638a15d1b8291bcb polaris11_k_mc.bin cf2b8cd1d7f723f0edb6f17123711d6fa21ef379 polaris11_k_smc.bin +a4ab9c0484cc9957112ab803d88e5b967c412c01 polaris11_k2_smc.bin f6551d45c0b652955009560bea1694c5ca86c1af polaris11_mc.bin a3bfc83f5f52978365428e8756ed165655984a3b polaris11_me.bin eb17beac2b09e25cfdc4662afc353f51a1c23272 polaris11_mec.bin -c960ba31f13806c889d076bbe0b796281fdb0075 polaris11_mec_2.bin +4e2290c5e030d6211168802fbe60db53b7c076c5 polaris11_mec_2.bin eb17beac2b09e25cfdc4662afc353f51a1c23272 polaris11_mec2.bin -a13f4c7ce6e30ed930c7a5756196b24114247691 polaris11_mec2_2.bin +f7956bba6312950db2de12336f79ccb28201593a polaris11_mec2_2.bin a7fb9fab4529707592ce3dd449cd27fbf415fb94 polaris11_me_2.bin 6377d75775fbe5353c8397ff03d230be0f4d6bcf polaris11_pfp.bin -af8c1b94170ada698379d1dab33924003ee525c0 polaris11_pfp_2.bin +01eda59a1f159889d9a0ea2a9744eae4e09eaa1c polaris11_pfp_2.bin 38c512c82fe4773f33e53ba1ada414ddbe9b9e09 polaris11_rlc.bin 82d8fcf56ac3051981b9e70199b115ed9d46995f polaris11_sdma.bin 8b21e98cb7e0ab000d131543c13a3ed95aa6687a polaris11_sdma1.bin e01ac87abb011582d1da84eda9444353de082d11 polaris11_smc.bin -6b804243472b5653ba449106426a0da1c46a9d84 polaris11_smc_sk.bin +f8680ef51f84df00b388d9c230a28d150836ce08 polaris11_smc_sk.bin 85a2f70f1f3b63e02a1bfbaba73a1729cee2104e polaris11_uvd.bin a9abead599bb8497f38d587f682726a00bc067d2 polaris11_vce.bin -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 Alex Deucher changed: What|Removed |Added See Also||https://bugs.freedesktop.or ||g/show_bug.cgi?id=108754 -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #18 from Alex Deucher --- (In reply to Dan Horák from comment #17) > Fedora/ppc64le users can find a pre-built kernel with the patchset at > https://copr.fedorainfracloud.org/coprs/sharkcz/talos-kernel/build/817728/ Should these patches go upstream? Can you confirm they fix your issues? -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #17 from Dan Horák --- Fedora/ppc64le users can find a pre-built kernel with the patchset at https://copr.fedorainfracloud.org/coprs/sharkcz/talos-kernel/build/817728/ -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #16 from Dan Horák --- Reset on init sounds better to me as the loader kernel (in kexec case) is more difficult to update than the host kernel. And for the record - after updating the skiroot kernel firmware version to the latest there is no problem/crash. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #15 from Alex Deucher --- Created attachment 142316 --> https://bugs.freedesktop.org/attachment.cgi?id=142316=edit more involved fix These patches attempt to reset the GPU on init if the GPU was already running from a previous load of the driver. Compile tested only at the moment. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #14 from Alex Deucher --- Created attachment 142303 --> https://bugs.freedesktop.org/attachment.cgi?id=142303=edit possible fix (In reply to Benjamin Herrenschmidt from comment #12) > We have no control on what firmware is loaded by the target distro so the > right thing is going to reset the adapter. > > We'll probably need to add something to the amdgpu shutdown() path to force > an adapter reset. Does the attached patch help? I'd been hesitant to add reset to the shutdown path because it adds latency to the regular shutdown path and users complain when that slows down. > > Do you have details of what specific PCIe config space write you use ? FLR ? The reset sequence is asic specific. older parts just happened to use PCI config space to trigger a GPU reset via an AMD specific sequence. Newer GPUs reset via the PSP. FLR is only available on SR-IOV capable skus so it's not a general solution. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #13 from Christian König --- (In reply to Benjamin Herrenschmidt from comment #12) > We'll probably need to add something to the amdgpu shutdown() path to force > an adapter reset. If that would be possible we would have already done that. The problem is that you do a full ASIC reset. So not only the GPU is affected, but also bridges, sound codecs etc... If any of those parts have a driver loaded while you do the reset you usually crash the system. Additional to that AFAIK this doesn't work on APUs. Because there the GPU is part of the CPU and so you would need to to reset both. How about stopping to use amdgpu in the boot loader? For just displaying a splash screen vesafb or efifb should do fine as well. > Do you have details of what specific PCIe config space write you use ? FLR ? Alex knows the details of that, but a FLR alone doesn't work AFAIK. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #12 from Benjamin Herrenschmidt --- We have no control on what firmware is loaded by the target distro so the right thing is going to reset the adapter. We'll probably need to add something to the amdgpu shutdown() path to force an adapter reset. Do you have details of what specific PCIe config space write you use ? FLR ? -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #11 from Dan Horák --- Thanks for the info, I've documented that in the Talos wiki under https://wiki.raptorcs.com/wiki/Troubleshooting/GPU#AMDGPU_driver_crashes_after_firmware_update -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 Christian König changed: What|Removed |Added CC||joel.s...@gmail.com --- Comment #10 from Christian König --- *** Bug 108607 has been marked as a duplicate of this bug. *** -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 Christian König changed: What|Removed |Added Resolution|--- |NOTABUG Status|NEW |RESOLVED --- Comment #9 from Christian König --- (In reply to Benjamin Herrenschmidt from comment #6) > They may or may not be related ... Alex, kexec is how we boot these > machines, there's a Linux kernel in flash that runs a Linux based bootloader. Yeah, you guys should have noted that because that combination is known to not work correctly. The problem is that some parts of the hardware are explicitly designed in a way which only allows loading one firmware after an ASIC reset. So as long as kexec doesn't makes a full PCIe level ASIC reset the second driver load is intended to fail. We have the same problem with virtualization and used to have a workaround in KVM which triggers the ASIC reset with a PCIe config space write. Alex should know the details. Only solution I can see is to either use the same workaround as the KVM guys or use the same firmware for both the loader and the final kernel. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #8 from Dan Horák --- I should have mentioned I'm kexec-ing too. It's from 4.15.9 (in skiroot) to Fedora kernels 4.16, 4.17, 4.18 and now 4.19 during the time. It worked fine until the recent amdgpu firmware update. The skiroot kernel uses amdgpu firmware from ~June. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #7 from Joel --- (In reply to Benjamin Herrenschmidt from comment #6) > Dan... did you do some firmware changes here ? Could it have to do with the > versions differences between petitboot and the final kernel ? FWIW, Talos II machines use a fork of op-build that include the amdgpu driver in petitboot. (They also appear to be stuck on 4.16). I was experimenting with the same. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #6 from Benjamin Herrenschmidt --- They may or may not be related ... Alex, kexec is how we boot these machines, there's a Linux kernel in flash that runs a Linux based bootloader. Until recently however, that didn't have an amdgpu driver. This might have changed. Dan... did you do some firmware changes here ? Could it have to do with the versions differences between petitboot and the final kernel ? Alex, whether we track it here or separately, we probably need to look into kexec support. It's not just us, there's a bit of momentum around kexec based bootloaders (google's on it too) as was seen at the recent OSFC (firmware conf). A workaround in the meantime (for kexec problems) could be to hot reset the card during the transition I suppose. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #5 from Alex Deucher --- (In reply to Joel from comment #4) > I see a similar backtrace on 4.19.0-11706-g11743c56785c (Linus' tree > mid-merge window). > > My system has a "fiji" card. The first kernel is 4.19 (upstream release), > and the second kernel where the backtrace occurs is with 4.19+. > > The second kernel is kexec'd from the first. Please file your own bug. Kexec is not likely to work and should be tracked separately. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #4 from Joel --- I see a similar backtrace on 4.19.0-11706-g11743c56785c (Linus' tree mid-merge window). My system has a "fiji" card. The first kernel is 4.19 (upstream release), and the second kernel where the backtrace occurs is with 4.19+. The second kernel is kexec'd from the first. If I don't load amdgpu in the first kernel, the second kernel works. So there is something missed in the shutdown path. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #3 from Dan Horák --- Ha, so it's the firmware stored in the initrds, what is different (lsinitrd lied). And the latest polaris11* ones provoke the crash. When I manually replaced them with the ones from the rc8 initrd, I've successfully booted into the 4.19 GA kernel. -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #2 from Dan Horák --- (In reply to Michel Dänzer from comment #1) > There were no amdgpu driver changes between rc8 and final... Are you sure > this is 100% reproducible with the latter and not reproducible with the > former? If so, can you bisect? till now 100% reproduceable, will try bisecting the kernel sources and also will look what else might have changed for the record - you can find the kernels at https://copr.fedorainfracloud.org/coprs/sharkcz/talos-kernel/builds/ - amdgpu.dc=0 or 1 makes no difference (0 is my default value, see bug 107049) -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 --- Comment #1 from Michel Dänzer --- There were no amdgpu driver changes between rc8 and final... Are you sure this is 100% reproducible with the latter and not reproducible with the former? If so, can you bisect? -- You are receiving this mail because: You are the assignee for the bug.___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Bug 108585] *ERROR* hw_init of IP block failed -22
https://bugs.freedesktop.org/show_bug.cgi?id=108585 Bug ID: 108585 Summary: *ERROR* hw_init of IP block failed -22 Product: DRI Version: unspecified Hardware: PowerPC OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: d...@danny.cz CC: bcroc...@redhat.com Created attachment 142253 --> https://bugs.freedesktop.org/attachment.cgi?id=142253=edit full dmesg output amdgpu driver fails to initialize Radeon WX4100 PRO on my Talos Power9 system with kernel 4.19 (GA). There is no such problem with 4.19-rc8 (and earlier). ... [2.421393] [drm] amdgpu kernel modesetting enabled. [2.421512] amdgpu :01:00.0: enabling device (0540 -> 0542) [2.421732] [drm] initializing kernel modesetting (POLARIS11 0x1002:0x67E3 0x1002:0x0B0D 0x00). [2.421776] [drm] register mmio base: 0x [2.421781] [drm] register mmio size: 262144 [2.421787] [drm] PCI I/O BAR is not found. [2.421798] [drm] add ip block number 0 [2.421801] [drm] add ip block number 1 [2.421805] [drm] add ip block number 2 [2.421808] [drm] add ip block number 3 [2.421811] [drm] add ip block number 4 [2.421814] [drm] add ip block number 5 [2.421818] [drm] add ip block number 6 [2.421821] [drm] add ip block number 7 [2.421824] [drm] add ip block number 8 [2.421837] [drm] UVD is enabled in VM mode [2.421840] [drm] UVD ENC is enabled in VM mode [2.421845] [drm] VCE enabled in VM mode [2.609475] md/raid1:md127: active with 2 out of 2 mirrors [2.625800] md127: detected capacity change from 0 to 481708474368 [2.627770] md/raid1:md126: active with 2 out of 2 mirrors [2.643643] md126: detected capacity change from 0 to 1072693248 [2.769520] usb 1-4: new high-speed USB device number 4 using xhci_hcd [2.769550] ATOM BIOS: 113-D0150600-103 [2.769747] [drm] vm size is 256 GB, 2 levels, block size is 10-bit, fragment size is 9-bit [2.769846] pci :01 : [PE# 00] pseudo-bypass sizes: tracker 32800 bitmap 8192 TCEs 65536 [2.769851] pci :01 : [PE# 00] TCE tables configured for pseudo-bypass [2.769903] amdgpu :01:00.0: BAR 2: releasing [mem 0x61000-0x6101f 64bit pref] [2.769907] amdgpu :01:00.0: BAR 0: releasing [mem 0x6-0x60fff 64bit pref] [2.769939] pci :00:00.0: BAR 15: releasing [mem 0x6-0x6003fbff0 64bit pref] [2.769956] pci :00:00.0: BAR 15: assigned [mem 0x6-0x600017fff 64bit pref] [2.769961] amdgpu :01:00.0: BAR 0: assigned [mem 0x6-0x6 64bit pref] [2.769972] amdgpu :01:00.0: BAR 2: assigned [mem 0x60001-0x60001001f 64bit pref] [2.770004] pci :00:00.0: PCI bridge to [bus 01] [2.770009] pci :00:00.0: bridge window [mem 0x600c0-0x600c07fef] [2.770015] pci :00:00.0: bridge window [mem 0x6-0x6003fbff0 64bit pref] [2.770066] amdgpu :01:00.0: VRAM: 4096M 0x00F4 - 0x00F4 (4096M used) [2.770069] amdgpu :01:00.0: GART: 256M 0x - 0x0FFF [2.770075] [drm] Detected VRAM RAM=4096M, BAR=4096M [2.770077] [drm] RAM width 128bits GDDR5 [2.770162] [TTM] Zone kernel: Available graphics memory: 32717248 kiB [2.770165] [TTM] Zone dma32: Available graphics memory: 2097152 kiB [2.770166] [TTM] Initializing pool allocator [2.771771] [drm] amdgpu: 4096M of VRAM memory ready [2.771774] [drm] amdgpu: 4096M of GTT memory ready. [2.771790] [drm] GART: num cpu pages 4096, num gpu pages 65536 [2.771839] [drm] PCIE GART of 256M enabled (table at 0x00F4008D). [2.771911] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [2.771913] [drm] Driver supports precise vblank timestamp query. [2.772311] [drm] AMDGPU Display Connectors [2.772313] [drm] Connector 0: [2.772315] [drm] DP-1 [2.772316] [drm] HPD5 [2.772318] [drm] DDC: 0x4868 0x4868 0x4869 0x4869 0x486a 0x486a 0x486b 0x486b [2.772320] [drm] Encoders: [2.772322] [drm] DFP1: INTERNAL_UNIPHY1 [2.772323] [drm] Connector 1: [2.772325] [drm] DP-2 [2.772326] [drm] HPD4 [2.772328] [drm] DDC: 0x486c 0x486c 0x486d 0x486d 0x486e 0x486e 0x486f 0x486f [2.772330] [drm] Encoders: [2.772332] [drm] DFP2: INTERNAL_UNIPHY1 [2.772333] [drm] Connector 2: [2.772335] [drm] DP-3 [2.772336] [drm] HPD3 [2.772338] [drm] DDC: 0x4870 0x4870 0x4871 0x4871 0x4872 0x4872 0x4873 0x4873 [2.772340] [drm] Encoders: [2.772341] [drm] DFP3: INTERNAL_UNIPHY [2.772343] [drm] Connector 3: [2.772345] [drm] DP-4 [2.772346] [drm] HPD2 [2.772348] [drm] DDC: 0x4874