Re: regression/bisected/6.8 commit f7fe64ad0f22ff034f8ebcfbd7299ee9cc9b57d7 leads to GPU hang when I open GNOME activities
On Wed, Jan 24, 2024 at 7:19 AM Mikhail Gavrilov wrote: > > Who could dig into it, please? You decided to revert it? https://lkml.org/lkml/2024/1/22/1866 Also I forgot to attach the kernel build .config in the previous message. I'm going to fix it here. It may be useful for reproducing my bug script. -- Best Regards, Mike Gavrilov. .config.zip Description: Zip archive
Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
On Fri, Dec 15, 2023 at 5:37 PM Christian König wrote: > > I have no idea :) > > From the logs I can see that the AMDGPU now has the proper BARs assigned: > > [5.722015] pci :03:00.0: [1002:73df] type 00 class 0x038000 > [5.722051] pci :03:00.0: reg 0x10: [mem > 0xf8-0xfb 64bit pref] > [5.722081] pci :03:00.0: reg 0x18: [mem > 0xfc-0xfc0fff 64bit pref] > [5.722112] pci :03:00.0: reg 0x24: [mem 0xfca0-0xfcaf] > [5.722134] pci :03:00.0: reg 0x30: [mem 0xfcb0-0xfcb1 pref] > [5.722368] pci :03:00.0: PME# supported from D1 D2 D3hot D3cold > [5.722484] pci :03:00.0: 63.008 Gb/s available PCIe bandwidth, > limited by 8.0 GT/s PCIe x8 link at :00:01.1 (capable of 252.048 > Gb/s with 16.0 GT/s PCIe x16 link) > > And with that the driver can work perfectly fine. > > Have you updated the BIOS or added/removed some other hardware? Maybe > somebody added a quirk for your BIOS into the PCIe code or something > like that. No, nothing changed in hardware. But I found the commit which fixes it. > git bisect unfixed 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 is the first fixed commit commit 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 Author: Vasant Hegde Date: Thu Sep 21 09:21:45 2023 + iommu/amd: Introduce iommu_dev_data.flags to track device capabilities Currently we use struct iommu_dev_data.iommu_v2 to keep track of the device ATS, PRI, and PASID capabilities. But these capabilities can be enabled independently (except PRI requires ATS support). Hence, replace the iommu_v2 variable with a flags variable, which keep track of the device capabilities. From commit 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs"), device PRI/PASID is shared between PF and any associated VFs. Hence use pci_pri_supported() and pci_pasid_features() instead of pci_find_ext_capability() to check device PRI/PASID support. Signed-off-by: Vasant Hegde Reviewed-by: Jason Gunthorpe Reviewed-by: Jerry Snitselaar Link: https://lore.kernel.org/r/20230921092147.5930-13-vasant.he...@amd.com Signed-off-by: Joerg Roedel drivers/iommu/amd/amd_iommu_types.h | 3 ++- drivers/iommu/amd/iommu.c | 46 ++--- 2 files changed, 30 insertions(+), 19 deletions(-) > git bisect log git bisect start '--term-new=fixed' '--term-old=unfixed' # status: waiting for both good and bad commits # fixed: [33cc938e65a98f1d29d0a18403dbbee050dcad9a] Linux 6.7-rc4 git bisect fixed 33cc938e65a98f1d29d0a18403dbbee050dcad9a # status: waiting for good commit(s), bad commit known # unfixed: [ffc253263a1375a65fa6c9f62a893e9767fbebfa] Linux 6.6 git bisect unfixed ffc253263a1375a65fa6c9f62a893e9767fbebfa # unfixed: [7d461b291e65938f15f56fe58da2303b07578a76] Merge tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm git bisect unfixed 7d461b291e65938f15f56fe58da2303b07578a76 # unfixed: [e14aec23025eeb1f2159ba34dbc1458467c4c347] s390/ap: fix AP bus crash on early config change callback invocation git bisect unfixed e14aec23025eeb1f2159ba34dbc1458467c4c347 # unfixed: [be3ca57cfb777ad820c6659d52e60bbdd36bf5ff] Merge tag 'media/v6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media git bisect unfixed be3ca57cfb777ad820c6659d52e60bbdd36bf5ff # fixed: [c0d12d769299e1e08338988c7745009e0db2a4a0] Merge tag 'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm git bisect fixed c0d12d769299e1e08338988c7745009e0db2a4a0 # fixed: [4bbdb725a36b0d235f3b832bd0c1e885f0442d9f] Merge tag 'iommu-updates-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu git bisect fixed 4bbdb725a36b0d235f3b832bd0c1e885f0442d9f # unfixed: [25b6377007ebe1c3ede773fd6979f613386db000] Merge tag 'drm-next-2023-11-07' of git://anongit.freedesktop.org/drm/drm git bisect unfixed 25b6377007ebe1c3ede773fd6979f613386db000 # unfixed: [67c0afb6424fee94238d9a32b97c407d0c97155e] Merge tag 'exfat-for-6.7-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat git bisect unfixed 67c0afb6424fee94238d9a32b97c407d0c97155e # unfixed: [3613047280ec42a4e1350fdc1a6dd161ff4008cc] Merge tag 'v6.6-rc7' into core git bisect unfixed 3613047280ec42a4e1350fdc1a6dd161ff4008cc # fixed: [cedc811c76778bdef91d405717acee0de54d8db5] iommu/amd: Remove DMA_FQ type from domain allocation path git bisect fixed cedc811c76778bdef91d405717acee0de54d8db5 # unfixed: [b0cc5dae1ac0c18748706a4beb636e3b726dd744] iommu/amd: Rename ats related variables git bisect unfixed b0cc5dae1ac0c18748706a4beb636e3b726dd744 # fixed: [5a0b11a180a9b82b4437a4be1cf73530053f139b] iommu/amd: Remove iommu_v2 module git bisect fixed 5a0b11a180a9b82b4437a4be1cf73530053f139b # fixed: [92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6] iommu/amd: Introduce iommu_dev_data.flags to track device capabilities git bisect fixed 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 # unfixed:
Re: regression/bisected/6.7rc1: Instead of desktop I see a horizontal flashing bar with a picture of the desktop background on white screen
On Fri, Dec 15, 2023 at 9:14 PM Hamza Mahfooz wrote: > > Can you try the following patch with old fw (version 0x07002100 should > be fine)?: https://patchwork.freedesktop.org/patch/572298/ > Tested-by: Mikhail Gavrilov on 7900XTX hardware. Can I ask? What does SubVP actually do? I read on Phoronix that this is new feature of DCN 3.2 hardware https://www.phoronix.com/news/AMDGPU-Linux-6.5-Improvements But I didn't notice that anything began to work better after enabling this feature. On the contrary, my kernel logs began to become overgrown with unpleasant errors. See here: https://gitlab.freedesktop.org/drm/amd/-/issues/2796 I bisected this issue and bisect heads me to commit 299004271cbf0315da327c4bd67aec3e7041cb32 which enables SubVP high refresh rate. But without SubVP I also had 120Hz and 4K. So I ask again what is the profit of SubVP? -- Best Regards, Mike Gavrilov.
Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
On Tue, Feb 28, 2023 at 5:43 PM Christian König wrote: > > The point is it doesn't need to talk to the amdgpu hardware. What it > does is that it talks to the good old VGA/VESA emulation and that just > happens to be still enabled by the BIOS/GRUB. > > And that VGA/VESA emulation doesn't need any BAR or whatever to keep the > hw running in the state where it was initialized before the kernel > started. The kernel just grabs the addresses where it needs to write the > display data and keeps going with that. > > But when a hw specific driver wants to load this is the first thing > which gets disabled because we need to load new firmware. And with the > BARs disabled this can't be re-enabled without rebooting the system. > > > My suggestion is that if > > amdgpu fails to talk to the hardware, then let another suitable driver > > do it. I attached a system log when I apply "pci=nocrs" with > > "modprobe.blacklist=amdgpu" for showing that graphics work right in > > this case. > > To do this, does the Linux module loading mechanism need to be refined? > > That's actually working as expected. The real problem is that the BIOS > on that system is so broken that we can't access the hw correctly. > > What we could to do is to check the BARs very early on and refuse to > load when they are disable. The problem with this approach is that there > are systems where it is normal that the BARs are disable until the > driver loads and get enabled during the hardware initialization process. > > What you might want to look into is to find a quirk for the BIOS to > properly enable the nvme controller. > That's interesting. I noticed that now amdgpu could work even with parameter [pci=nocrs] on 6.7.0-0.rc4 and higher kernels. It means BARs became available? I attached here the kerner log and lspci. What's changed? -- Best Regards, Mike Gavrilov. <> <>
Re: 6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6
On Thu, Nov 16, 2023 at 11:56 PM Alex Deucher wrote: > > This patch should address the issue: > https://patchwork.freedesktop.org/patch/567101/ > If you still see issues, you may also need this series: > https://patchwork.freedesktop.org/series/126220/ > > Alex Thanks. The first one patch is enough. Tested-on: 7900XTX, 6900XT and 6800M. Tested-by: Mikhail Gavrilov -- Best Regards, Mike Gavrilov.
Re: regression/bisected/6.7rc1: Instead of desktop I see a horizontal flashing bar with a picture of the desktop background on white screen
On Wed, Nov 15, 2023 at 11:39 PM Lee, Alvin wrote: > > This change has a DMCUB dependency - are you able to update your DMCUB > version as well? > I can confirm this issue was gone after updating firmware. ❯ dmesg | grep DMUB [ 11.496679] [drm] Loading DMUB firmware via PSP: version=0x07002300 [ 12.000314] [drm] DMUB hardware initialized: version=0x07002300 -- Best Regards, Mike Gavrilov.
Re: regression/bisected/6.7rc1: Instead of desktop I see a horizontal flashing bar with a picture of the desktop background on white screen
On Wed, Nov 15, 2023 at 11:14 PM Hamza Mahfooz wrote: > > What version of DMUB firmware are you on? > The easiest way to find out would be using the following: > > # dmesg | grep DMUB > Sapphire AMD Radeon RX 7900 XTX PULSE OC: ❯ dmesg | grep DMUB [ 14.341362] [drm] Loading DMUB firmware via PSP: version=0x07002100 [ 14.725547] [drm] DMUB hardware initialized: version=0x07002100 Reference GIGABYTE Radeon RX 7900 XTX 24G: ❯ dmesg | grep DMUB [ 11.405115] [drm] Loading DMUB firmware via PSP: version=0x07002100 [ 11.773395] [drm] DMUB hardware initialized: version=0x07002100 -- Best Regards, Mike Gavrilov.
Re: regression/bisected/6.7rc1: Instead of desktop I see a horizontal flashing bar with a picture of the desktop background on white screen
On Tue, Nov 14, 2023 at 11:03 PM Mikhail Gavrilov wrote: > > On Tue, Nov 14, 2023 at 3:55 PM Mikhail Gavrilov > wrote: > > > > Hi, > > Yesterday came the 6.7-rc1 kernel. > > And surprisingly it turned out it is not working with my LG C3. > > I use this OLED TV as my primary monitor. > > After login to GNOME I see a horizontal flashing bar with a picture of > > the desktop background on white screen. > > Demonstration: https://youtu.be/7F76VfRkrVo > > > > I made a bisection. > > And bisect said that the first bad commit is: > > commit ed6e2782e974750f671e1101250bb19045be > > Author: Alvin Lee > > Date: Mon Oct 23 14:33:16 2023 -0400 > > > > drm/amd/display: For cursor P-State allow for SubVP > > > > [Description] > > - Similar to FPO, SubVP should also force cursor P-State > > allow instead of relying on natural assertion > > - Implement code path to force and unforce cursor P-State > > allow for SubVP > > > > Reviewed-by: Samson Tam > > Acked-by: Hersen Wu > > Signed-off-by: Alvin Lee > > Tested-by: Daniel Wheeler > > Signed-off-by: Alex Deucher > > > > drivers/gpu/drm/amd/display/dc/hwss/dcn32/dcn32_hwseq.c | 17 > > ++--- > > 1 file changed, 2 insertions(+), 15 deletions(-) > > > > My hardware specs: https://linux-hardware.org/?probe=1c989dab38 > > > > -- > > Best Regards, > > Mike Gavrilov. > > I forgot kernel logs. Not sure it would be helpful because I didn't > notice anything unusual. > This only appears on 7900XTX and 120Hz. -- Best Regards, Mike Gavrilov.
regression/bisected/6.7rc1: Instead of desktop I see a horizontal flashing bar with a picture of the desktop background on white screen
Hi, Yesterday came the 6.7-rc1 kernel. And surprisingly it turned out it is not working with my LG C3. I use this OLED TV as my primary monitor. After login to GNOME I see a horizontal flashing bar with a picture of the desktop background on white screen. Demonstration: https://youtu.be/7F76VfRkrVo I made a bisection. And bisect said that the first bad commit is: commit ed6e2782e974750f671e1101250bb19045be Author: Alvin Lee Date: Mon Oct 23 14:33:16 2023 -0400 drm/amd/display: For cursor P-State allow for SubVP [Description] - Similar to FPO, SubVP should also force cursor P-State allow instead of relying on natural assertion - Implement code path to force and unforce cursor P-State allow for SubVP Reviewed-by: Samson Tam Acked-by: Hersen Wu Signed-off-by: Alvin Lee Tested-by: Daniel Wheeler Signed-off-by: Alex Deucher drivers/gpu/drm/amd/display/dc/hwss/dcn32/dcn32_hwseq.c | 17 ++--- 1 file changed, 2 insertions(+), 15 deletions(-) My hardware specs: https://linux-hardware.org/?probe=1c989dab38 -- Best Regards, Mike Gavrilov.
Re: 6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6
On Wed, Nov 8, 2023 at 12:12 AM Alex Deucher wrote: > > The attached patch should fix it. Not sure why your GPU shows up as > busy. The AGP aperture was just disabled. Tested-by: Mikhail Gavrilov Thanks, after applying the patch GPU loading meets expectations. Games are working so overall all looking good for now. -- Best Regards, Mike Gavrilov.
Re: 6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6
On Mon, Nov 6, 2023 at 8:29 PM Alex Deucher wrote: > > Already fixed in this commit: > https://gitlab.freedesktop.org/agd5f/linux/-/commit/d1d4c0b7b65b7fab2bc6f97af9e823b1c42ccdb0 > Which is in included in last weeks PR. > Thanks, it fixed the issue above. But, unfortunately this is not the only problem which I see on my laptop. Now I am observing 100% GPU loading all the time. And it looks as I show on this screenshot: https://postimg.cc/QHLQncMg And another bisect round says that this commit is blame: ❯ git bisect good de59b69932e64d77445d973a101d81d6e7e670c6 is the first bad commit commit de59b69932e64d77445d973a101d81d6e7e670c6 Author: Alex Deucher Date: Wed Sep 20 13:27:58 2023 -0400 drm/amdgpu/gmc: set a default disable value for AGP To disable AGP, the start needs to be set to a higher value than the end. Set a default disable value for the AGP aperture and allow the IP specific GMC code to enable it selectively be calling amdgpu_gmc_agp_location(). Reviewed-by: Christian König Signed-off-by: Alex Deucher drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 27 --- drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_object.c| 3 +++ drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c| 3 ++- drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c| 3 ++- drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 3 ++- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +- 10 files changed, 37 insertions(+), 18 deletions(-) I checked twice and ensure that it not happens on commit 29495d81457a483c2859ccde59cc063034bfe47d -- Best Regards, Mike Gavrilov.
Re: [bug/bisected] commit a2848d08742c8e8494675892c02c0d22acbe3cf8 cause general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN NOPTI
On Tue, Jul 18, 2023 at 7:13 AM Chen, Guchun wrote: > > [Public] > > Hello Mike, > > I guess this patch can resolve your problem. > https://patchwork.freedesktop.org/patch/547897/ > > Regards, > Guchun > Tested-by: Mikhail Gavrilov Thanks, the issue was gone with this patch. I didn't say anything above about how to reproduce this problem. Case was like this: On a dual GPU laptop, I ran Google Chrome on a discrete graphics card. I used for it this command: $ DRI_PRIME=1 google-chrome-unstable --disable-features=Vulkan -- Best Regards, Mike Gavrilov.
Re: [regression][6.5] KASAN: slab-out-of-bounds in amdgpu_vm_pt_create+0x555/0x670 [amdgpu] on Radeon 7900XTX
On Fri, Jul 14, 2023 at 4:09 PM Chen, Guchun wrote: > > Thanks for your patience on this, Mike. I think > https://patchwork.freedesktop.org/patch/547592/ can help this, please take a > try. Tested-by: Mikhail Gavrilov Thanks it looks good. I spent the whole weekend with these patches on top of 3f01e9fed845 and didn't notice any regressions. -- Best Regards, Mike Gavrilov.
Re: [regression][6.5] KASAN: slab-out-of-bounds in amdgpu_vm_pt_create+0x555/0x670 [amdgpu] on Radeon 7900XTX
On Fri, Jul 7, 2023 at 6:01 AM Chen, Guchun wrote: > > [Public] > > Hi Mike, > > Yes, we are aware of this problem, and we are working on that. The problem is > caused by recent code stores xcp_id to amdgpu bo for accounting memory usage > and so on. However, not all VMs are attached to that like the case in > amdgpu_mes_self_test. > I would like to take part in testing the fix. -- Best Regards, Mike Gavrilov.
Re: [6.4-rc7][regression] slab-out-of-bounds in amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]
On Wed, Jun 21, 2023 at 12:47 PM Zhu, Jiadong wrote: > > [AMD Official Use Only - General] > > Hi, > > It is fixed on > https://patchwork.freedesktop.org/patch/542647/?series=119384=2 > > Could you make sure if this patch is included. > I confirm this patch fixes the issue. But this patch is still not merged yet in 6.4 that is a problem. -- Best Regards, Mike Gavrilov.
[6.4-rc7][regression] slab-out-of-bounds in amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]
Hi, after commit 5b711e7f9c73e5ff44d6ac865711d9a05c2a0360 I see KASAN sanitizer bug message at every boot: Backtrace: [ 18.600551] == [ 18.600558] BUG: KASAN: slab-out-of-bounds in amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu] [ 18.600943] Write of size 8 at addr 8881e4d3a098 by task kworker/8:1/133 [ 18.600952] CPU: 8 PID: 133 Comm: kworker/8:1 Tainted: GW L--- --- 6.4.0-0.rc7.53.fc39.x86_64+debug #1 [ 18.600960] Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.331 02/24/2023 [ 18.600966] Workqueue: events amdgpu_device_delayed_init_work_handler [amdgpu] [ 18.601253] Call Trace: [ 18.601256] [ 18.601260] dump_stack_lvl+0x76/0xd0 [ 18.601267] print_report+0xcf/0x670 [ 18.601275] ? amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu] [ 18.601573] ? amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu] [ 18.601865] kasan_report+0xa8/0xe0 [ 18.601870] ? amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu] [ 18.602163] amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu] [ 18.602455] gfx_v9_0_ring_emit_ib_gfx+0x4cc/0xd50 [amdgpu] [ 18.602767] ? amdgpu_sw_ring_ib_begin+0x1b4/0x3d0 [amdgpu] [ 18.603061] amdgpu_ib_schedule+0x7cb/0x1570 [amdgpu] [ 18.603354] gfx_v9_0_ring_test_ib+0x375/0x540 [amdgpu] [ 18.603656] ? __pfx_gfx_v9_0_ring_test_ib+0x10/0x10 [amdgpu] [ 18.603959] ? __pfx_lock_acquire+0x10/0x10 [ 18.603966] amdgpu_ib_ring_tests+0x2bc/0x490 [amdgpu] [ 18.604260] amdgpu_device_delayed_init_work_handler+0x15/0x30 [amdgpu] [ 18.604544] process_one_work+0x888/0x1460 [ 18.604551] ? worker_thread+0x2c8/0x12c0 [ 18.604555] ? __pfx_process_one_work+0x10/0x10 [ 18.604562] worker_thread+0x104/0x12c0 [ 18.604567] ? __kthread_parkme+0xc1/0x1f0 [ 18.604573] ? __pfx_worker_thread+0x10/0x10 [ 18.604577] kthread+0x2ee/0x3c0 [ 18.604581] ? __pfx_kthread+0x10/0x10 [ 18.604586] ret_from_fork+0x2c/0x50 [ 18.604593] [ 18.604598] Allocated by task 466: [ 18.604601] kasan_save_stack+0x33/0x60 [ 18.604606] kasan_set_track+0x25/0x30 [ 18.604610] __kasan_kmalloc+0x8f/0xa0 [ 18.604614] __kmalloc+0x62/0x160 [ 18.604618] amdgpu_ring_mux_init+0x6e/0x1b0 [amdgpu] [ 18.604905] gfx_v9_0_sw_init+0xffe/0x2930 [amdgpu] [ 18.605197] amdgpu_device_init+0x3c36/0x7fc0 [amdgpu] [ 18.605476] amdgpu_driver_load_kms+0x1d/0x4b0 [amdgpu] [ 18.605753] amdgpu_pci_probe+0x279/0x9a0 [amdgpu] [ 18.606029] local_pci_probe+0xdd/0x190 [ 18.606034] pci_device_probe+0x23a/0x770 [ 18.606039] really_probe+0x3e2/0xb80 [ 18.606044] __driver_probe_device+0x18c/0x450 [ 18.606048] driver_probe_device+0x4a/0x120 [ 18.606052] __driver_attach+0x1e5/0x4a0 [ 18.606056] bus_for_each_dev+0x109/0x190 [ 18.606061] bus_add_driver+0x2a1/0x570 [ 18.606064] driver_register+0x134/0x460 [ 18.606069] do_one_initcall+0xd5/0x3b0 [ 18.606073] do_init_module+0x238/0x770 [ 18.606079] load_module+0x5581/0x6f10 [ 18.606082] __do_sys_init_module+0x1f2/0x220 [ 18.606086] do_syscall_64+0x60/0x90 [ 18.606091] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 18.606099] The buggy address belongs to the object at 8881e4d3a000 which belongs to the cache kmalloc-128 of size 128 [ 18.606106] The buggy address is located 24 bytes to the right of allocated 128-byte region [8881e4d3a000, 8881e4d3a080) [ 18.606115] The buggy address belongs to the physical page: [ 18.606119] page:024dbf3d refcount:1 mapcount:0 mapping: index:0x0 pfn:0x1e4d3a [ 18.606126] head:024dbf3d order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 18.606132] flags: 0x17c0010200(slab|head|node=0|zone=2|lastcpupid=0x1f) [ 18.606138] page_type: 0x() [ 18.606143] raw: 0017c0010200 8881000428c0 dead0122 [ 18.606148] raw: 00200020 0001 [ 18.606153] page dumped because: kasan: bad access detected [ 18.606159] Memory state around the buggy address: [ 18.606162] 8881e4d39f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 18.606167] 8881e4d3a000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 18.606172] >8881e4d3a080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 18.606176] ^ [ 18.606180] 8881e4d3a100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fc [ 18.606184] 8881e4d3a180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 18.606189] == [ 18.606201] Disabling lock debugging due to kernel taint >From bisect log: 5b711e7f9c73e5ff44d6ac865711d9a05c2a0360 is the first bad commit commit 5b711e7f9c73e5ff44d6ac865711d9a05c2a0360 Author: Jiadong Zhu Date: Thu May 25 18:42:15 2023 +0800 drm/amdgpu: Implement gfx9
Re: [PATCH 2/2] drm/amdgpu: make sure that BOs have a backing store
On Mon, Jun 5, 2023 at 2:11 PM Christian König wrote: > > It's perfectly possible that the BO is about to be destroyed and doesn't > have a backing store associated with it. > Thanks Christian. I appreciate your brilliant work. "KASAN: null-ptr-deref in range [0x0010-0x0017]" finally fixed. Tested-by: Mikhail Gavrilov -- Best Regards, Mike Gavrilov.
Re: KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] - RIP: 0010:amdgpu_bo_get_memory+0x80/0x360 [amdgpu]
On Mon, May 8, 2023 at 3:40 PM Mikhail Gavrilov wrote: > > No one can reproduce this? > I prepared a video instruction which can helps: > https://youtu.be/0ipQnMpZG1Y > > 1. Run script which would calculate watchers: > $ for i in {1..9}; do sudo curl -s > https://raw.githubusercontent.com/fatso83/dotfiles/master/utils/scripts/inotify-consumers > | bash; done > > 2. Run the game "Devision 2" > > 3. Run 20 windows of Google Chrome with such script > $ for i in {1..20}; do google-chrome-unstable > --profile-directory="Test-2" --new-window --start-maximized > "youtube.com" &; done > > I hope after it you see the desired backtrace. > I found another way to reproduce the problem. Demonstration: https://youtu.be/6cvs4cCMo4M 1. Run the game "Devision 2" 2. Run 20 windows of Google Chrome with such script $ for i in {1..20}; do google-chrome-unstable --profile-directory="Test-2" --new-window --start-maximized "youtube.com" &; done 3. Run "nvtop" and got kernel bug. After it "nvtop" stop working until reboot. Can anyone confirm it, please? -- Best Regards, Mike Gavrilov.
Re: KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] - RIP: 0010:amdgpu_bo_get_memory+0x80/0x360 [amdgpu]
On Fri, May 5, 2023 at 6:44 PM Mikhail Gavrilov wrote: > I need to say that it may not be easy to reproduce this bug. > For helping reproduce: > 1. I looped script above: > $ for i in {1..9}; do sudo curl -s > https://raw.githubusercontent.com/fatso83/dotfiles/master/utils/scripts/inotify-consumers > | bash; done > 2. Launched google chrome with 26 opened windows > 3. And played in the game Division 2. > A little time and luck and I get the desired backtrace again and again. > > I am ready to answer any question and open for testing any patches. > Thanks. No one can reproduce this? I prepared a video instruction which can helps: https://youtu.be/0ipQnMpZG1Y 1. Run script which would calculate watchers: $ for i in {1..9}; do sudo curl -s https://raw.githubusercontent.com/fatso83/dotfiles/master/utils/scripts/inotify-consumers | bash; done 2. Run the game "Devision 2" 3. Run 20 windows of Google Chrome with such script $ for i in {1..20}; do google-chrome-unstable --profile-directory="Test-2" --new-window --start-maximized "youtube.com" &; done I hope after it you see the desired backtrace. -- Best Regards, Mike Gavrilov.
Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
On Thu, Apr 20, 2023 at 3:32 PM Mikhail Gavrilov wrote: > > Important don't give up. > https://youtu.be/25zhHBGIHJ8 [40 min] > https://youtu.be/utnDR26eYBY [50 min] > https://youtu.be/DJQ_tiimW6g [12 min] > https://youtu.be/Y6AH1oJKivA [6 min] > Yes the issue is everything reproducible, but time to time it not > happens at first attempt. > I also uploaded other videos which proves that the issue definitely > exists if someone will launch those games in turn. > Reproducibility is only a matter of time. > > Anyway I didn't want you to spend so much time trying to reproduce it. > This monkey business fits me more than you. > It would be better if I could collect more useful info. Christian, Did you manage to reproduce the problem? At the weekend I faced with slab-use-after-free in amdgpu_vm_handle_moved. I didn't play in the games at this time. The Xwayland process was affected so it leads to desktop hang. == BUG: KASAN: slab-use-after-free in amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] Read of size 8 at addr 888295c66190 by task Xwayland:cs0/173185 CPU: 21 PID: 173185 Comm: Xwayland:cs0 Tainted: GWL --- --- 6.3.0-0.rc7.20230420gitcb0856346a60.59.fc39.x86_64+debug #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4601 02/02/2023 Call Trace: dump_stack_lvl+0x76/0xd0 print_report+0xcf/0x670 ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] kasan_report+0xa8/0xe0 ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] amdgpu_cs_ioctl+0x2b7e/0x5630 [amdgpu] ? __pfx___lock_acquire+0x10/0x10 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ? mark_lock+0x101/0x16e0 ? __lock_acquire+0xe54/0x59f0 ? __pfx_lock_release+0x10/0x10 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] drm_ioctl_kernel+0x1fc/0x3d0 ? __pfx_drm_ioctl_kernel+0x10/0x10 drm_ioctl+0x4c5/0xaa0 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ? __pfx_drm_ioctl+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x66/0x80 ? lockdep_hardirqs_on+0x81/0x110 ? _raw_spin_unlock_irqrestore+0x4f/0x80 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu] __x64_sys_ioctl+0x131/0x1a0 do_syscall_64+0x60/0x90 ? do_syscall_64+0x6c/0x90 ? lockdep_hardirqs_on+0x81/0x110 ? do_syscall_64+0x6c/0x90 ? lockdep_hardirqs_on+0x81/0x110 ? do_syscall_64+0x6c/0x90 ? lockdep_hardirqs_on+0x81/0x110 ? do_syscall_64+0x6c/0x90 ? lockdep_hardirqs_on+0x81/0x110 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7ffb71b0892d Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 RSP: 002b:7ffb677fe840 EFLAGS: 0246 ORIG_RAX: 0010 RAX: ffda RBX: 7ffb677fe9f8 RCX: 7ffb71b0892d RDX: 7ffb677fe900 RSI: c0186444 RDI: 000d RBP: 7ffb677fe890 R08: 7ffb677fea50 R09: 7ffb677fe8e0 R10: 556c4611bec0 R11: 0246 R12: 7ffb677fe900 R13: c0186444 R14: 000d R15: 7ffb677fe9f8 Allocated by task 173181: kasan_save_stack+0x33/0x60 kasan_set_track+0x25/0x30 __kasan_kmalloc+0x8f/0xa0 __kmalloc_node+0x65/0x160 amdgpu_bo_create+0x31e/0xfb0 [amdgpu] amdgpu_bo_create_user+0xca/0x160 [amdgpu] amdgpu_gem_create_ioctl+0x398/0x980 [amdgpu] drm_ioctl_kernel+0x1fc/0x3d0 drm_ioctl+0x4c5/0xaa0 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu] __x64_sys_ioctl+0x131/0x1a0 do_syscall_64+0x60/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Freed by task 173185: kasan_save_stack+0x33/0x60 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2e/0x50 __kasan_slab_free+0x10b/0x1a0 slab_free_freelist_hook+0x11e/0x1d0 __kmem_cache_free+0xc0/0x2e0 ttm_bo_release+0x667/0x9e0 [ttm] amdgpu_bo_unref+0x35/0x70 [amdgpu] amdgpu_gem_object_free+0x73/0xb0 [amdgpu] drm_gem_handle_delete+0xe3/0x150 drm_ioctl_kernel+0x1fc/0x3d0 drm_ioctl+0x4c5/0xaa0 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu] __x64_sys_ioctl+0x131/0x1a0 do_syscall_64+0x60/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Last potentially related work creation: kasan_save_stack+0x33/0x60 __kasan_record_aux_stack+0x97/0xb0 __call_rcu_common.constprop.0+0xf8/0x1af0 drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched] dma_resv_reserve_fences+0x4dc/0x7f0 ttm_eu_reserve_buffers+0x3f6/0x1190 [ttm] amdgpu_cs_ioctl+0x204d/0x5630 [amdgpu] drm_ioctl_kernel+0x1fc/0x3d0 drm_ioctl+0x4c5/0xaa0 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu] __x64_sys_ioctl+0x131/0x1a0 do_syscall_64+0x60/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Second to last potentially related work creation: kasan_save_stack+0x33/0x60 __kasan_record_aux_stack+0x97/0xb0 __call_rcu_common.constprop.0+0xf8/0x1af0 drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched] amdgpu_ctx_add_fence+0x2b1/0x
Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
On Thu, Apr 20, 2023 at 2:59 PM Christian König wrote: > Could you try drm-misc-next as well? If as I assume I cloned right repo $ git clone -b drm-misc-next git://anongit.freedesktop.org/drm/drm-misc linux-drm-misc-next for my hardware last commit on this branch is turned out completely unworking. Instead of the GDM login screen I see a black screen and hear howls of GPU fans. In the kernel logs I see general protection fault: general protection fault, probably for non-canonical address 0xdc2b: [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0158-0x015f] CPU: 0 PID: 749 Comm: sdma0 Tainted: GWL 6.3.0-rc4-misc-next-91c249b2b9f6a80c744387b6713adf275ffd296b+ #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4601 02/02/2023 RIP: 0010:drm_sched_get_cleanup_job+0x41b/0x5c0 [gpu_sched] Code: fa 48 c1 ea 03 80 3c 02 00 75 5c 49 8b 9f 80 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb 58 01 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 55 48 01 ab 58 01 00 00 e9 0c fd ff ff 48 89 ef e8 RSP: 0018:c9000548fdb8 EFLAGS: 00010216 RAX: dc00 RBX: RCX: RDX: 002b RSI: 0004 RDI: 0158 RBP: 085c R08: R09: 888170711783 R10: ed102e0e22f0 R11: 8da81678 R12: 8881707116b0 R13: 888170711780 R14: 888266f89820 R15: 888266f89808 FS: () GS:888fa200() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 560cea4a8000 CR3: 000191602000 CR4: 00350ef0 Call Trace: drm_sched_main+0xc3/0x930 [gpu_sched] ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched] ? __pfx_autoremove_wake_function+0x10/0x10 ? __kthread_parkme+0xc1/0x1f0 ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched] kthread+0x2a2/0x340 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2c/0x50 Modules linked in: amdgpu(+) drm_ttm_helper ttm video crct10dif_pclmul drm_suballoc_helper crc32_pclmul iommu_v2 crc32c_intel drm_buddy polyval_clmulni gpu_sched polyval_generic ucsi_ccg drm_display_helper typec_ucsi nvme ghash_clmulni_intel igb typec ccp sha512_ssse3 cec nvme_core sp5100_tco dca i2c_algo_bit nvme_common wmi ip6_tables ip_tables fuse ---[ end trace ]--- RIP: 0010:drm_sched_get_cleanup_job+0x41b/0x5c0 [gpu_sched] Code: fa 48 c1 ea 03 80 3c 02 00 75 5c 49 8b 9f 80 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb 58 01 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 55 48 01 ab 58 01 00 00 e9 0c fd ff ff 48 89 ef e8 RSP: 0018:c9000548fdb8 EFLAGS: 00010216 RAX: dc00 RBX: RCX: RDX: 002b RSI: 0004 RDI: 0158 RBP: 085c R08: R09: 888170711783 R10: ed102e0e22f0 R11: 8da81678 R12: 8881707116b0 R13: 888170711780 R14: 888266f89820 R15: 888266f89808 FS: () GS:888fa200() knlGS: I also attached a full system log. -- Best Regards, Mike Gavrilov. system-log.tar.xz Description: application/xz
Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
On Thu, Apr 20, 2023 at 2:59 PM Christian König wrote: > > Could you try drm-misc-next as well? > > Going to give drm-fixes another round of testing. > > Thanks, > Christian. Important don't give up. https://youtu.be/25zhHBGIHJ8 [40 min] https://youtu.be/utnDR26eYBY [50 min] https://youtu.be/DJQ_tiimW6g [12 min] https://youtu.be/Y6AH1oJKivA [6 min] Yes the issue is everything reproducible, but time to time it not happens at first attempt. I also uploaded other videos which proves that the issue definitely exists if someone will launch those games in turn. Reproducibility is only a matter of time. Anyway I didn't want you to spend so much time trying to reproduce it. This monkey business fits me more than you. It would be better if I could collect more useful info. -- Best Regards, Mike Gavrilov.
Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
On Wed, Apr 19, 2023 at 1:12 PM Christian König wrote: > > I'm already looking into this, but can't figure out why we run into > problems here. > > What happens is that a CS is aborted without sending the job to the > scheduler and in this case the cleanup function doesn't seem to work. > > Christian. I can easily reproduce it on any AMD GPU hardware. You can add more logs to debug and I return with new logs which explains this. Thanks. -- Best Regards, Mike Gavrilov.
Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
Christian? ❯ /usr/src/kernels/6.3.0-0.rc7.56.fc39.x86_64/scripts/faddr2line /lib/debug/lib/modules/6.3.0-0.rc7.56.fc39.x86_64/kernel/drivers/gpu/drm/scheduler/gpu-sched.ko.debug drm_sched_job_cleanup+0x9a drm_sched_job_cleanup+0x9a/0x130: drm_sched_job_cleanup at /usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c:808 (discriminator 3) ❯ cat -s -n /usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c | head -818 | tail -20 799 /* drm_sched_job_arm() has been called */ 800 dma_fence_put(>s_fence->finished); 801 } else { 802 /* aborted job before committing to run it */ 803 drm_sched_fence_free(job->s_fence); 804 } 805 806 job->s_fence = NULL; 807 808 xa_for_each(>dependencies, index, fence) { 809 dma_fence_put(fence); 810 } 811 xa_destroy(>dependencies); 812 813 } 814 EXPORT_SYMBOL(drm_sched_job_cleanup); 815 816 /** 817 * drm_sched_ready - is the scheduler ready 818 * > git blame drivers/gpu/drm/scheduler/sched_main.c -L 800,819 dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-17 10:49:16 +0200 800) dma_fence_put(>s_fence->finished); dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-17 10:49:16 +0200 801) } else { dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-17 10:49:16 +0200 802) /* aborted job before committing to run it */ d4c16733e7960 drivers/gpu/drm/scheduler/sched_main.c(Boris Brezillon 2021-09-03 14:05:54 +0200 803) drm_sched_fence_free(job->s_fence); dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-17 10:49:16 +0200 804) } dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-17 10:49:16 +0200 805) 26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat Masetty 2018-10-29 15:02:28 +0530 806) job->s_fence = NULL; ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-05 12:46:49 +0200 807) ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-05 12:46:49 +0200 808) xa_for_each(>dependencies, index, fence) { ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-05 12:46:49 +0200 809) dma_fence_put(fence); ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-05 12:46:49 +0200 810) } ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-05 12:46:49 +0200 811) xa_destroy(>dependencies); ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-05 12:46:49 +0200 812) 26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat Masetty 2018-10-29 15:02:28 +0530 813) } 26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat Masetty 2018-10-29 15:02:28 +0530 814) EXPORT_SYMBOL(drm_sched_job_cleanup); 26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat Masetty 2018-10-29 15:02:28 +0530 815) e688b728228b9 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c (Christian König 2015-08-20 17:01:01 +0200 816) /** 2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan Deshmukh 2018-05-29 11:23:07 +0530 817) * drm_sched_ready - is the scheduler ready 2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan Deshmukh 2018-05-29 11:23:07 +0530 818) * 2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan Deshmukh 2018-05-29 11:23:07 +0530 819) * @sched: scheduler instance Daniel, because Christian, looks a little busy. Can you help? The git blame says that you are the author of code which KASAN mentions in its report. The issue is reproducible on all available AMD hardware: 6800M, 6900XT, 7900XTX. -- Best Regards, Mike Gavrilov.
Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
On Tue, Apr 11, 2023 at 10:40 PM Mikhail Gavrilov wrote: > > Hi, > KASAN continues to find problems in the drm_sched_job_cleanup code at 6.3rc6. > I not got any feedback in the thread > https://lore.kernel.org/lkml/cabxgcsmvub2ra4d+k5cna0_2521tox++d4nmoukki4x2-q_...@mail.gmail.com/ > Therefore, I decided to start a separate thread. Since the problems > are different, the symptoms are also different. > > Reproduction scenario. > After launching one of the listed games: > - Cyberpunk 2077 > - Forza Horizon 4 > - Forza Horizon 5 > - Sackboy: A Big Adventure > > Firstly after some time (may be after several attempts) appears bug > message from KASAN: > == > BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched] > Read of size 4 at addr 0078 by task ForzaHorizon4.e/31587 > > CPU: 15 PID: 31587 Comm: ForzaHorizon4.e Tainted: GWL > --- --- 6.3.0-0.rc6.49.fc39.x86_64+debug #1 > Hardware name: System manufacturer System Product Name/ROG STRIX > X570-I GAMING, BIOS 4601 02/02/2023 > Call Trace: > > dump_stack_lvl+0x72/0xc0 > kasan_report+0xa4/0xe0 > ? drm_sched_job_cleanup+0x96/0x290 [gpu_sched] > kasan_check_range+0x104/0x1b0 > drm_sched_job_cleanup+0x96/0x290 [gpu_sched] > ? __pfx_drm_sched_job_cleanup+0x10/0x10 [gpu_sched] > ? slab_free_freelist_hook+0x11e/0x1d0 > ? amdgpu_cs_parser_fini+0x363/0x5a0 [amdgpu] > amdgpu_job_free+0x40/0x1b0 [amdgpu] > amdgpu_cs_parser_fini+0x3c9/0x5a0 [amdgpu] > ? __pfx_amdgpu_cs_parser_fini+0x10/0x10 [amdgpu] > amdgpu_cs_ioctl+0x3d9/0x5630 [amdgpu] > ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] > ? __kmem_cache_free+0xbc/0x2e0 > ? mark_lock+0x101/0x16e0 > ? __lock_acquire+0xe54/0x59f0 > ? kasan_save_stack+0x3f/0x50 > ? __pfx_lock_release+0x10/0x10 > ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] > drm_ioctl_kernel+0x1f8/0x3d0 > ? __pfx_drm_ioctl_kernel+0x10/0x10 > drm_ioctl+0x4c1/0xaa0 > ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] > ? __pfx_drm_ioctl+0x10/0x10 > ? _raw_spin_unlock_irqrestore+0x62/0x80 > ? lockdep_hardirqs_on+0x7d/0x100 > ? _raw_spin_unlock_irqrestore+0x4b/0x80 > amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu] > __x64_sys_ioctl+0x12d/0x1a0 > do_syscall_64+0x5c/0x90 > ? do_syscall_64+0x68/0x90 > ? lockdep_hardirqs_on+0x7d/0x100 > ? do_syscall_64+0x68/0x90 > ? do_syscall_64+0x68/0x90 > ? lockdep_hardirqs_on+0x7d/0x100 > ? do_syscall_64+0x68/0x90 > ? asm_exc_page_fault+0x22/0x30 > ? lockdep_hardirqs_on+0x7d/0x100 > entry_SYSCALL_64_after_hwframe+0x72/0xdc > RIP: 0033:0x7fb8a270881d > Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 > 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 > 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 > RSP: 002b:467ad060 EFLAGS: 0246 ORIG_RAX: 0010 > RAX: ffda RBX: 467ad358 RCX: 7fb8a270881d > RDX: 467ad140 RSI: c0186444 RDI: 005a > RBP: 467ad0b0 R08: 7fb7f00d3eb0 R09: 467ad100 > R10: 7fb88c68fb20 R11: 0246 R12: 467ad140 > R13: c0186444 R14: 005a R15: 7fb7f00d3e50 > > == > > Finally it ends up with the games listed above stopping working they > stuck after a kernel warning: > general protection fault, probably for non-canonical address > 0xdc0f: [#1] PREEMPT SMP KASAN NOPTI > KASAN: null-ptr-deref in range [0x0078-0x007f] > CPU: 15 PID: 31587 Comm: ForzaHorizon4.e Tainted: GB WL > --- --- 6.3.0-0.rc6.49.fc39.x86_64+debug #1 > Hardware name: System manufacturer System Product Name/ROG STRIX > X570-I GAMING, BIOS 4601 02/02/2023 > RIP: 0010:drm_sched_job_cleanup+0xa7/0x290 [gpu_sched] > Code: d6 01 00 00 4c 8b 75 20 be 04 00 00 00 4d 8d 66 78 4c 89 e7 e8 > ba 4d 4e c9 4c 89 e2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 > 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 8a > RSP: 0018:c9003676f5a8 EFLAGS: 00010216 > RAX: dc00 RBX: 88816f81f020 RCX: 0001 > RDX: 000f RSI: 0008 RDI: 9053e5e0 > RBP: 88816f81f000 R08: 0001 R09: 9053e5e7 > R10: fbfff20a7cbc R11: 6e696c6261736944 R12: 0078 > R13: 192006cedeb5 R14: R15: c9003676f870 > FS: 4680f6c0() GS:888fa5c0() knlGS:2991 > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7fb854d6f010 CR3: 00017b2d6000 CR4: 00350ee0 > Call Trace
Re: BUG: KASAN: slab-use-after-free in drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
On Fri, Mar 24, 2023 at 7:37 PM Christian König wrote: > > Yeah, that one > > Thanks for the info, looks like this isn't fixed. > > Christian. > Hi, glad to see that "BUG: KASAN: slab-use-after-free in drm_sched_get_cleanup_job+0x47b/0x5c0" was fixed in 6.3-rc5. For history it would be good to know the commit which fixes this issue. I waited for this moment because I know other one issue which was also found by KASAN santiniser. BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched] Read of size 4 at addr 0078 by task GameThread/23915 CPU: 10 PID: 23915 Comm: GameThread Tainted: GWL --- --- 6.3.0-0.rc5.42.fc39.x86_64+debug #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4601 02/02/2023 Call Trace: dump_stack_lvl+0x72/0xc0 kasan_report+0xa4/0xe0 ? drm_sched_job_cleanup+0x96/0x290 [gpu_sched] kasan_check_range+0x104/0x1b0 drm_sched_job_cleanup+0x96/0x290 [gpu_sched] ? __pfx_drm_sched_job_cleanup+0x10/0x10 [gpu_sched] ? slab_free_freelist_hook+0x11e/0x1d0 ? amdgpu_cs_parser_fini+0x363/0x5a0 [amdgpu] amdgpu_job_free+0x40/0x1b0 [amdgpu] amdgpu_cs_parser_fini+0x3c9/0x5a0 [amdgpu] ? __pfx_amdgpu_cs_parser_fini+0x10/0x10 [amdgpu] amdgpu_cs_ioctl+0x3d9/0x5630 [amdgpu] ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ? mark_lock+0x101/0x16e0 ? __lock_acquire+0xe54/0x59f0 ? __pfx_lock_release+0x10/0x10 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] drm_ioctl_kernel+0x1f8/0x3d0 ? __pfx_drm_ioctl_kernel+0x10/0x10 drm_ioctl+0x4c1/0xaa0 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ? __pfx_drm_ioctl+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x62/0x80 ? lockdep_hardirqs_on+0x7d/0x100 ? _raw_spin_unlock_irqrestore+0x4b/0x80 amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu] __x64_sys_ioctl+0x12d/0x1a0 do_syscall_64+0x5c/0x90 ? do_syscall_64+0x68/0x90 ? lockdep_hardirqs_on+0x7d/0x100 ? do_syscall_64+0x68/0x90 ? do_syscall_64+0x68/0x90 ? lockdep_hardirqs_on+0x7d/0x100 ? do_syscall_64+0x68/0x90 ? do_syscall_64+0x68/0x90 ? lockdep_hardirqs_on+0x7d/0x100 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fe97a50881d Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 RSP: 002b:7c35d3f0 EFLAGS: 0246 ORIG_RAX: 0010 RAX: ffda RBX: 7c35d6e8 RCX: 7fe97a50881d RDX: 7c35d4d0 RSI: c0186444 RDI: 00ae RBP: 7c35d440 R08: 7fe8fc0f0970 R09: 7c35d490 R10: 7fb79000 R11: 0246 R12: 7c35d4d0 R13: c0186444 R14: 00ae R15: 7fe8fc0f0900 I know at least 3 games which 100% triggering this bug: - Cyberpunk 2077 - Forza Horizon 4 - Forza Horizon 5 We would continue to discuss it here or better create a new thread (for someone who is also faced with this issue could easily find a solution on the internet)? A full kernel log as usual attached here. -- Best Regards, Mike Gavrilov. dmesg.tar.xz Description: application/xz
Re: BUG: KASAN: slab-use-after-free in drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
On Tue, Mar 21, 2023 at 11:47 PM Christian König wrote: > > Hi Mikhail, > > That looks like a reference counting issue to me. > > I'm going to take a look, but we have already fixed one of those recently. > > Probably best that you try this on drm-fixes, just to double check that > this isn't the same issue. > Hi Christian, you meant this branch? $ git clone -b drm-fixes git://anongit.freedesktop.org/drm/drm linux-drm If yes I just checked and unfortunately see this issue unfixed there. [ 1984.295833] == [ 1984.295876] BUG: KASAN: slab-use-after-free in drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched] [ 1984.295898] Read of size 8 at addr 88814cadc4c0 by task sdma1/764 [ 1984.295924] CPU: 12 PID: 764 Comm: sdma1 Tainted: GWL 6.3.0-rc3-drm-fixes+ #1 [ 1984.295937] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4601 02/02/2023 [ 1984.295951] Call Trace: [ 1984.295963] [ 1984.295975] dump_stack_lvl+0x72/0xc0 [ 1984.295991] print_report+0xcf/0x670 [ 1984.296007] ? drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched] [ 1984.296030] ? drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched] [ 1984.296047] kasan_report+0xa4/0xe0 [ 1984.296118] ? drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched] [ 1984.296149] drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched] [ 1984.296175] drm_sched_main+0x643/0x990 [gpu_sched] [ 1984.296204] ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched] [ 1984.296222] ? __pfx_autoremove_wake_function+0x10/0x10 [ 1984.296290] ? __kthread_parkme+0xc1/0x1f0 [ 1984.296304] ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched] [ 1984.296321] kthread+0x29e/0x340 [ 1984.296334] ? __pfx_kthread+0x10/0x10 [ 1984.296501] ret_from_fork+0x2c/0x50 [ 1984.296518] [ 1984.296539] Allocated by task 12194: [ 1984.296552] kasan_save_stack+0x2f/0x50 [ 1984.296566] kasan_set_track+0x21/0x30 [ 1984.296578] __kasan_kmalloc+0x8b/0x90 [ 1984.296590] amdgpu_driver_open_kms+0x10b/0x5a0 [amdgpu] [ 1984.297051] drm_file_alloc+0x46e/0x880 [ 1984.297064] drm_open_helper+0x161/0x460 [ 1984.297076] drm_open+0x1e7/0x5c0 [ 1984.297089] drm_stub_open+0x24d/0x400 [ 1984.297107] chrdev_open+0x215/0x620 [ 1984.297125] do_dentry_open+0x5f1/0x1000 [ 1984.297146] path_openat+0x1b3d/0x28a0 [ 1984.297164] do_filp_open+0x1bd/0x400 [ 1984.297180] do_sys_openat2+0x140/0x420 [ 1984.297197] __x64_sys_openat+0x11f/0x1d0 [ 1984.297213] do_syscall_64+0x5b/0x80 [ 1984.297231] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 1984.297266] Freed by task 12195: [ 1984.297284] kasan_save_stack+0x2f/0x50 [ 1984.297303] kasan_set_track+0x21/0x30 [ 1984.297323] kasan_save_free_info+0x2a/0x50 [ 1984.297343] __kasan_slab_free+0x107/0x1a0 [ 1984.297361] slab_free_freelist_hook+0x11e/0x1d0 [ 1984.297373] __kmem_cache_free+0xbc/0x2e0 [ 1984.297385] amdgpu_driver_postclose_kms+0x582/0x8d0 [amdgpu] [ 1984.297821] drm_file_free.part.0+0x638/0xb70 [ 1984.297834] drm_release+0x1ea/0x470 [ 1984.297845] __fput+0x213/0x9e0 [ 1984.297857] task_work_run+0x11b/0x200 [ 1984.297869] exit_to_user_mode_prepare+0x23a/0x260 [ 1984.297883] syscall_exit_to_user_mode+0x16/0x50 [ 1984.297896] do_syscall_64+0x67/0x80 [ 1984.297907] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 1984.298033] Last potentially related work creation: [ 1984.298044] kasan_save_stack+0x2f/0x50 [ 1984.298057] __kasan_record_aux_stack+0x97/0xb0 [ 1984.298075] __call_rcu_common.constprop.0+0xf8/0x1af0 [ 1984.298095] amdgpu_bo_list_put+0x1a4/0x1f0 [amdgpu] [ 1984.298557] amdgpu_cs_parser_fini+0x293/0x5a0 [amdgpu] [ 1984.299055] amdgpu_cs_ioctl+0x4f2a/0x5630 [amdgpu] [ 1984.299624] drm_ioctl_kernel+0x1f8/0x3d0 [ 1984.299637] drm_ioctl+0x4c1/0xaa0 [ 1984.299649] amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu] [ 1984.300083] __x64_sys_ioctl+0x12d/0x1a0 [ 1984.300097] do_syscall_64+0x5b/0x80 [ 1984.300109] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 1984.300135] Second to last potentially related work creation: [ 1984.300149] kasan_save_stack+0x2f/0x50 [ 1984.300167] __kasan_record_aux_stack+0x97/0xb0 [ 1984.300185] __call_rcu_common.constprop.0+0xf8/0x1af0 [ 1984.300203] amdgpu_bo_list_put+0x1a4/0x1f0 [amdgpu] [ 1984.300692] amdgpu_cs_parser_fini+0x293/0x5a0 [amdgpu] [ 1984.301133] amdgpu_cs_ioctl+0x4f2a/0x5630 [amdgpu] [ 1984.301577] drm_ioctl_kernel+0x1f8/0x3d0 [ 1984.301598] drm_ioctl+0x4c1/0xaa0 [ 1984.301610] amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu] [ 1984.302043] __x64_sys_ioctl+0x12d/0x1a0 [ 1984.302056] do_syscall_64+0x5b/0x80 [ 1984.302068] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 1984.302090] The buggy address belongs to the object at 88814cadc000 which belongs to the cache kmalloc-4k of size 4096 [ 1984.302103] The buggy address is located 1216 bytes inside of freed 4096-byte region [88814cadc000, 88814cadd000) [ 1984.302129] The buggy address belongs to the phys
[6.3][regression] commit a4e771729a51168bc36317effaa9962e336d4f5e lead to flood kernel logs with warning messages "at kernel/workqueue.c:3167 __flush_work+0x472/0x500"
Hi, I didn't faced to issue drm_bridge_hpd_enable+0x94/0x9c [drm] but fixing this issue leads to warning messages on my laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007 which has two AMD GPU. Discrete Radeon 6800M and integrated in CPU Cezanne Vega 8. I found bad commit by bisecting: ❯ git bisect bad a4e771729a51168bc36317effaa9962e336d4f5e is the first bad commit commit a4e771729a51168bc36317effaa9962e336d4f5e Author: Dmitry Baryshkov Date: Tue Jan 24 12:45:48 2023 +0200 drm/probe_helper: sort out poll_running vs poll_enabled There are two flags attemting to guard connector polling: poll_enabled and poll_running. While poll_enabled semantics is clearly defined and fully adhered (mark that drm_kms_helper_poll_init() was called and not finalized by the _fini() call), the poll_running flag doesn't have such clearliness. This flag is used only in drm_helper_probe_single_connector_modes() to guard calling of drm_kms_helper_poll_enable, it doesn't guard the drm_kms_helper_poll_fini(), etc. Change it to only be set if the polling is actually running. Tie HPD enablement to this flag. This fixes the following warning reported after merging the HPD series: Hot plug detection already enabled WARNING: CPU: 2 PID: 9 at drivers/gpu/drm/drm_bridge.c:1257 drm_bridge_hpd_enable+0x94/0x9c [drm] Modules linked in: videobuf2_memops snd_soc_simple_card snd_soc_simple_card_utils fsl_imx8_ddr_perf videobuf2_common snd_soc_imx_spdif adv7511 etnaviv imx8m_ddrc imx_dcss mc cec nwl_dsi gov CPU: 2 PID: 9 Comm: kworker/u8:0 Not tainted 6.2.0-rc2-15208-g25b283acd578 #6 Hardware name: NXP i.MX8MQ EVK (DT) Workqueue: events_unbound deferred_probe_work_func pstate: 6005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : drm_bridge_hpd_enable+0x94/0x9c [drm] lr : drm_bridge_hpd_enable+0x94/0x9c [drm] sp : 89ef3740 x29: 89ef3740 x28: 09331f00 x27: 1000 x26: 0020 x25: 81148ed8 x24: 0a8fe000 x23: fffd x22: 05086348 x21: 81133ee0 x20: 0550d800 x19: 05086288 x18: 0006 x17: x16: 896ef008 x15: 972891004260 x14: 2a1403e19400 x13: 972891004260 x12: 2a1403e19400 x11: 7100385f29400801 x10: 0aa0 x9 : 88112744 x8 : 00250b00 x7 : 0003 x6 : 0011 x5 : x4 : bd986a48 x3 : 0001 x2 : x1 : x0 : 0025 Call trace: drm_bridge_hpd_enable+0x94/0x9c [drm] drm_bridge_connector_enable_hpd+0x2c/0x3c [drm_kms_helper] drm_kms_helper_poll_enable+0x94/0x10c [drm_kms_helper] drm_helper_probe_single_connector_modes+0x1a8/0x510 [drm_kms_helper] drm_client_modeset_probe+0x204/0x1190 [drm] __drm_fb_helper_initial_config_and_unlock+0x5c/0x4a4 [drm_kms_helper] drm_fb_helper_initial_config+0x54/0x6c [drm_kms_helper] drm_fbdev_client_hotplug+0xd0/0x140 [drm_kms_helper] drm_fbdev_generic_setup+0x90/0x154 [drm_kms_helper] dcss_kms_attach+0x1c8/0x254 [imx_dcss] dcss_drv_platform_probe+0x90/0xfc [imx_dcss] platform_probe+0x70/0xcc really_probe+0xc4/0x2e0 __driver_probe_device+0x80/0xf0 driver_probe_device+0xe0/0x164 __device_attach_driver+0xc0/0x13c bus_for_each_drv+0x84/0xe0 __device_attach+0xa4/0x1a0 device_initial_probe+0x1c/0x30 bus_probe_device+0xa4/0xb0 deferred_probe_work_func+0x90/0xd0 process_one_work+0x200/0x474 worker_thread+0x74/0x43c kthread+0xfc/0x110 ret_from_fork+0x10/0x20 ---[ end trace ]--- Reported-by: Laurentiu Palcu Fixes: c8268795c9a9 ("drm/probe-helper: enable and disable HPD on connectors") Tested-by: Marek Szyprowski Tested-by: Chen-Yu Tsai Acked-by: Laurentiu Palcu Tested-by: Laurentiu Palcu Tested-by: Laurent Pinchart Signed-off-by: Dmitry Baryshkov Signed-off-by: Neil Armstrong Link: https://patchwork.freedesktop.org/patch/msgid/20230124104548.3234554-2-dmitry.barysh...@linaro.org (cherry picked from commit d33a54e3991dfce88b4fc6d9c3360951c2c5660d) Signed-off-by: Thomas Zimmermann drivers/gpu/drm/drm_probe_helper.c | 42 +++--- 1 file changed, 21 insertions(+), 21 deletions(-) Of course I tried to check the bisect assumption by reverting this commit. And I can confirm without commit a4e771729a51168bc36317effaa9962e336d4f5e the warning messages do not appear within a day. I attached a full kernel log if someone would be interested to see it. -- Best Regards, Mike Gavrilov. git bisect start # status: waiting for both good and bad commits # good: [5b7c4cabbb65f5c469464da6c5f614cbd7f730f2] Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next git
Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
On Mon, Feb 27, 2023 at 3:22 PM Christian König > > Unfortunately yes. We could clean that up a bit more so that you don't > run into a BUG() assertion, but what essentially happens here is that we > completely fail to talk to the hardware. > > In this situation we can't even re-enable vesa or text console any more. > Then I don't understand why when amdgpu is blacklisted via modprobe.blacklist=amdgpu then I see graphics and could login into GNOME. Yes without hardware acceleration, but it is better than non working graphics. It means there is some other driver (I assume this is "video") which can successfully talk to the AMD hardware in conditions where amdgpu cannot do this. My suggestion is that if amdgpu fails to talk to the hardware, then let another suitable driver do it. I attached a system log when I apply "pci=nocrs" with "modprobe.blacklist=amdgpu" for showing that graphics work right in this case. To do this, does the Linux module loading mechanism need to be refined? -- Best Regards, Mike Gavrilov. system-without-amdgpu.tar.xz Description: application/xz
Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
On Fri, Feb 24, 2023 at 8:31 PM Christian König wrote: > > Sorry I totally missed that you attached the full dmesg to your original > mail. > > Yeah, the driver did fail gracefully. But then X doesn't come up and > then gdm just dies. Are you sure that these messages should be present when the driver fails gracefully? turning off the locking correctness validator. CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L --- --- 6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug #1 Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.320 09/07/2022 Call Trace: dump_stack_lvl+0x57/0x90 register_lock_class+0x47d/0x490 __lock_acquire+0x74/0x21f0 ? lock_release+0x155/0x450 lock_acquire+0xd2/0x320 ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu] ? lock_is_held_type+0xce/0x120 _raw_spin_lock_irqsave+0x4d/0xa0 ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu] amdgpu_irq_disable_all+0x37/0xf0 [amdgpu] amdgpu_device_fini_hw+0x43/0x2c0 [amdgpu] amdgpu_driver_load_kms+0xe8/0x190 [amdgpu] amdgpu_pci_probe+0x140/0x420 [amdgpu] local_pci_probe+0x41/0x90 pci_device_probe+0xc3/0x230 really_probe+0x1b6/0x410 __driver_probe_device+0x78/0x170 driver_probe_device+0x1f/0x90 __driver_attach+0xd2/0x1c0 ? __pfx___driver_attach+0x10/0x10 bus_for_each_dev+0x8a/0xd0 bus_add_driver+0x141/0x230 driver_register+0x77/0x120 ? __pfx_init_module+0x10/0x10 [amdgpu] do_one_initcall+0x6e/0x350 do_init_module+0x4a/0x220 __do_sys_init_module+0x192/0x1c0 do_syscall_64+0x5b/0x80 ? asm_exc_page_fault+0x22/0x30 ? lockdep_hardirqs_on+0x7d/0x100 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fd58cfcb1be Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 RSP: 002b:7ffd1d1065d8 EFLAGS: 0246 ORIG_RAX: 00af RAX: ffda RBX: 55b0b5aa6d70 RCX: 7fd58cfcb1be RDX: 55b0b5a96670 RSI: 016b6156 RDI: 7fd589392010 RBP: 7ffd1d106690 R08: 55b0b5a93bd0 R09: 016b6ff0 R10: 55b5eea2c333 R11: 0246 R12: 55b0b5a96670 R13: 0002 R14: 55b0b5a9c170 R15: 55b0b5aa58a0 amdgpu: probe of :03:00.0 failed with error -12 amdgpu :08:00.0: enabling device (0006 -> 0007) [drm] initializing kernel modesetting (RENOIR 0x1002:0x1638 0x1043:0x16C2 0xC4). list_add corruption. prev->next should be next (c0940328), but was . (prev=8c9b734062b0). [ cut here ] kernel BUG at lib/list_debug.c:30! invalid opcode: [#1] PREEMPT SMP NOPTI CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L --- --- 6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug #1 Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.320 09/07/2022 RIP: 0010:__list_add_valid+0x74/0x90 Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d RSP: 0018:a50f81aafa00 EFLAGS: 00010246 RAX: 0075 RBX: 8c9b734062b0 RCX: RDX: RSI: 0027 RDI: RBP: 8c9b734062b0 R08: R09: a50f81aaf8a0 R10: 0003 R11: 8caa1d2fffe8 R12: 8c9b7c0a5e48 R13: R14: c13a6d20 R15: FS: 7fd58c6a5940() GS:8ca9d9a0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 55b0b5a955e0 CR3: 00017e86 CR4: 00750ee0 PKRU: 5554 Call Trace: ttm_device_init+0x184/0x1c0 [ttm] amdgpu_ttm_init+0xb8/0x610 [amdgpu] ? _printk+0x60/0x80 gmc_v9_0_sw_init+0x4a3/0x7c0 [amdgpu] amdgpu_device_init+0x14e5/0x2520 [amdgpu] amdgpu_driver_load_kms+0x15/0x190 [amdgpu] amdgpu_pci_probe+0x140/0x420 [amdgpu] local_pci_probe+0x41/0x90 pci_device_probe+0xc3/0x230 really_probe+0x1b6/0x410 __driver_probe_device+0x78/0x170 driver_probe_device+0x1f/0x90 __driver_attach+0xd2/0x1c0 ? __pfx___driver_attach+0x10/0x10 bus_for_each_dev+0x8a/0xd0 bus_add_driver+0x141/0x230 driver_register+0x77/0x120 ? __pfx_init_module+0x10/0x10 [amdgpu] do_one_initcall+0x6e/0x350 do_init_module+0x4a/0x220 __do_sys_init_module+0x192/0x1c0 do_syscall_64+0x5b/0x80 ? asm_exc_page_fault+0x22/0x30 ? lockdep_hardirqs_on+0x7d/0x100 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fd58cfcb1be Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 48 RSP: 002b:7ffd1d1065d8 EFLAGS: 0246 ORIG_RAX: 00af RAX: ffda RBX: 55b0b5aa6d70 RCX: 7fd58cfcb1be RDX: 55b0b5a96670 RSI: 016b6156 RDI: 7fd589392010 RBP:
Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
On Fri, Feb 24, 2023 at 12:13 PM Christian König wrote: > > Hi Mikhail, > > this is pretty clearly a problem with the system and/or it's BIOS and > not the GPU hw or the driver. > > The option pci=nocrs makes the kernel ignore additional resource windows > the BIOS reports through ACPI. This then most likely leads to problems > with amdgpu because it can't bring up its PCIe resources any more. > > The output of "sudo lspci - -s $BUSID_OF_AMDGPU" might help > understand the problem I attach both lspci for pci=nocrs and without pci=nocrs. The differences for Cezanne Radeon Vega Series: with pci=nocrs: Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Interrupt: pin A routed to IRQ 255 Region 4: I/O ports at e000 [disabled] [size=256] Capabilities: [c0] MSI-X: Enable- Count=4 Masked- Without pci=nocrs: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Interrupt: pin A routed to IRQ 44 Region 4: I/O ports at e000 [size=256] Capabilities: [c0] MSI-X: Enable+ Count=4 Masked- The differences for Navi 22 Radeon 6800M: with pci=nocrs: Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Interrupt: pin A routed to IRQ 255 Region 0: Memory at f8 (64-bit, prefetchable) [disabled] [size=16G] Region 2: Memory at fc (64-bit, prefetchable) [disabled] [size=256M] Region 5: Memory at fca0 (32-bit, non-prefetchable) [disabled] [size=1M] AtomicOpsCtl: ReqEn- Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: Data: Without pci=nocrs: Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 103 Region 0: Memory at f8 (64-bit, prefetchable) [size=16G] Region 2: Memory at fc (64-bit, prefetchable) [size=256M] Region 5: Memory at fca0 (32-bit, non-prefetchable) [size=1M] AtomicOpsCtl: ReqEn+ Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: fee0 Data: > but I strongly suggest to try a BIOS update first. This is the first thing that was done. And I am afraid no more BIOS updates. https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/ I also have experience in dealing with manufacturers' tech support. Usually it ends with "we do not provide drivers for Linux". -- Best Regards, Mike Gavrilov. ❯ sudo lspci - -s 08:00.0 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c4) (prog-if 00 [VGA controller]) Subsystem: ASUSTeK Computer Inc. Radeon Vega 8 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ SERR- Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16 TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit
amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
Hi, I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But it is impossible to use without AC power because the system losts nvme when I disconnect the power adapter. Messages from kernel log when it happens: nvme nvme0: controller is down; will reset: CSTS=0x, PCI_STATUS=0x10 nvme nvme0: Does your device have a faulty power saving mode enabled? nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug I tried to use recommended parameters (nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve this issue, but without successed. In the linux-nvme mail list the last advice was to try the "pci=nocrs" parameter. But with this parameter the amdgpu driver refuses to work and makes the system unbootable. I can solve the problem with the booting system by blacklisting the driver but it is not a good solution, because I don't wanna lose the GPU. Why amdgpu not work with "pci=nocrs" ? And is it possible to solve this incompatibility? It is very important because when I boot the system without amdgpu driver with "pci=nocrs" nvme is not losts when I disconnect the power adapter. So "pci=nocrs" really helps. Below that I see in kernel log when adds "pci=nocrs" parameter: amdgpu :03:00.0: amdgpu: Fetched VBIOS from ATRM amdgpu: ATOM BIOS: SWBRT77321.001 [drm] VCN(0) decode is enabled in VM mode [drm] VCN(0) encode is enabled in VM mode [drm] JPEG decode is enabled in VM mode Console: switching to colour dummy device 80x25 amdgpu :03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default) [drm] GPU posting now... [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit amdgpu :03:00.0: amdgpu: VRAM: 12272M 0x0080 - 0x0082FEFF (12272M used) amdgpu :03:00.0: amdgpu: GART: 512M 0x - 0x1FFF amdgpu :03:00.0: amdgpu: AGP: 267894784M 0x0084 - 0x [drm] Detected VRAM RAM=12272M, BAR=16384M [drm] RAM width 192bits GDDR6 [drm] amdgpu: 12272M of VRAM memory ready [drm] amdgpu: 31774M of GTT memory ready. amdgpu :03:00.0: amdgpu: (-14) failed to allocate kernel bo [drm] Debug VRAM access will use slowpath MM access amdgpu :03:00.0: amdgpu: Failed to DMA MAP the dummy page [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block failed -12 amdgpu :03:00.0: amdgpu: amdgpu_device_ip_init failed amdgpu :03:00.0: amdgpu: Fatal error during GPU init amdgpu :03:00.0: amdgpu: amdgpu: finishing device. Of course a full system log is also attached. -- Best Regards, Mike Gavrilov. system-log-Fatal-error-during-GPU-init.tar.xz Description: application/xz
Re: [bug][vaapi][h264] The commit 7cbe08a930a132d84b4cf79953b00b074ec7a2a7 on certain video files leads to problems with VAAPI hardware decoding.
On Fri, Feb 17, 2023 at 8:30 PM Alex Deucher wrote: > > On Fri, Feb 17, 2023 at 1:10 AM Mikhail Gavrilov > wrote: > > > > On Fri, Dec 9, 2022 at 7:37 PM Leo Liu wrote: > > > > > > Please try the latest AMDGPU driver: > > > > > > https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next/ > > > > > > > Sorry Leo, I miss your message. > > This issue is still actual for 6.2-rc8. > > > > In my first message I was mistaken. > > > > > Before kernel 5.16 this only led to an artifact in the form of > > > a green bar at the top of the screen, then starting from 5.17 > > > the GPU began to freeze. > > > > The real behaviour before 5.18: > > - vlc could plays video with small artifacts in the form of a green > > bar on top of the video > > - after playing video process vlc correctly exiting > > > > On 5.18 this behaviour changed: > > - vlc show black screen instead of playing video > > - after playing the process not exiting > > - if I tries kill vlc process with 'kill -9' vlc became zombi process > > and many other processes start hangs (in kernel log appears follow > > lines after 2 minutes) > > > > INFO: task vlc:sh8:5248 blocked for more than 122 seconds. > > Tainted: GWL --- 5.18.0-60.fc37.x86_64+debug > > #1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > task:vlc:sh8 state:D stack:13616 pid: 5248 ppid: 1934 > > flags:0x4006 > > Call Trace: > > > > __schedule+0x492/0x1650 > > ? _raw_spin_unlock_irqrestore+0x40/0x60 > > ? debug_check_no_obj_freed+0x12d/0x250 > > schedule+0x4e/0xb0 > > schedule_timeout+0xe1/0x120 > > ? lock_release+0x215/0x460 > > ? trace_hardirqs_on+0x1a/0xf0 > > ? _raw_spin_unlock_irqrestore+0x40/0x60 > > dma_fence_default_wait+0x197/0x240 > > ? __bpf_trace_dma_fence+0x10/0x10 > > dma_fence_wait_timeout+0x229/0x260 > > drm_sched_entity_fini+0x101/0x270 [gpu_sched] > > amdgpu_vm_fini+0x2b5/0x460 [amdgpu] > > ? idr_destroy+0x70/0xb0 > > ? mutex_destroy+0x1e/0x50 > > amdgpu_driver_postclose_kms+0x1ec/0x2c0 [amdgpu] > > drm_file_free.part.0+0x20d/0x260 > > drm_release+0x6a/0x120 > > __fput+0xab/0x270 > > task_work_run+0x5c/0xa0 > > do_exit+0x394/0xc40 > > ? rcu_read_lock_sched_held+0x10/0x70 > > do_group_exit+0x33/0xb0 > > get_signal+0xbbc/0xbc0 > > arch_do_signal_or_restart+0x30/0x770 > > ? do_futex+0xfd/0x190 > > ? __x64_sys_futex+0x63/0x190 > > exit_to_user_mode_prepare+0x172/0x270 > > syscall_exit_to_user_mode+0x16/0x50 > > do_syscall_64+0x67/0x80 > > ? do_syscall_64+0x67/0x80 > > ? rcu_read_lock_sched_held+0x10/0x70 > > ? trace_hardirqs_on_prepare+0x5e/0x110 > > ? do_syscall_64+0x67/0x80 > > ? rcu_read_lock_sched_held+0x10/0x70 > > entry_SYSCALL_64_after_hwframe+0x44/0xae > > RIP: 0033:0x7f82c2364529 > > RSP: 002b:7f8210ff8c00 EFLAGS: 0246 ORIG_RAX: 00ca > > RAX: fe00 RBX: RCX: 7f82c2364529 > > RDX: RSI: 0189 RDI: 7f823022542c > > RBP: 7f8210ff8c30 R08: R09: > > R10: R11: 0246 R12: > > R13: R14: 0001 R15: 7f823022542c > > > > INFO: lockdep is turned off. > > > > I bisected this issue and problematic commit is > > > > ❯ git bisect bad > > 5f3854f1f4e211f494018160b348a1c16e58013f is the first bad commit > > commit 5f3854f1f4e211f494018160b348a1c16e58013f > > Author: Alex Deucher > > Date: Thu Mar 24 18:04:00 2022 -0400 > > > > drm/amdgpu: add more cases to noretry=1 > > > > Port current list from amd-staging-drm-next. > > > > Signed-off-by: Alex Deucher > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > Unfortunately I couldn't simply revert this commit on 6.2-rc8 for > > checking, because it leads to conflicts. > > > > Alex, you as author of this commit could help me with it? > > append amdgpu.noretry=0 to the kernel command line in grub. Thanks, I checked the "amdgpu.noretry=0" and after the page fault occurs vlc could play video with little artifacts. So I have some questions: 1. Why retrys was disabled by default if it really stills needed for recoverable page faults? As Christian answered me before here: https
Re: [bug][vaapi][h264] The commit 7cbe08a930a132d84b4cf79953b00b074ec7a2a7 on certain video files leads to problems with VAAPI hardware decoding.
On Fri, Dec 9, 2022 at 7:37 PM Leo Liu wrote: > > Please try the latest AMDGPU driver: > > https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next/ > Sorry Leo, I miss your message. This issue is still actual for 6.2-rc8. In my first message I was mistaken. > Before kernel 5.16 this only led to an artifact in the form of > a green bar at the top of the screen, then starting from 5.17 > the GPU began to freeze. The real behaviour before 5.18: - vlc could plays video with small artifacts in the form of a green bar on top of the video - after playing video process vlc correctly exiting On 5.18 this behaviour changed: - vlc show black screen instead of playing video - after playing the process not exiting - if I tries kill vlc process with 'kill -9' vlc became zombi process and many other processes start hangs (in kernel log appears follow lines after 2 minutes) INFO: task vlc:sh8:5248 blocked for more than 122 seconds. Tainted: GWL --- 5.18.0-60.fc37.x86_64+debug #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:vlc:sh8 state:D stack:13616 pid: 5248 ppid: 1934 flags:0x4006 Call Trace: __schedule+0x492/0x1650 ? _raw_spin_unlock_irqrestore+0x40/0x60 ? debug_check_no_obj_freed+0x12d/0x250 schedule+0x4e/0xb0 schedule_timeout+0xe1/0x120 ? lock_release+0x215/0x460 ? trace_hardirqs_on+0x1a/0xf0 ? _raw_spin_unlock_irqrestore+0x40/0x60 dma_fence_default_wait+0x197/0x240 ? __bpf_trace_dma_fence+0x10/0x10 dma_fence_wait_timeout+0x229/0x260 drm_sched_entity_fini+0x101/0x270 [gpu_sched] amdgpu_vm_fini+0x2b5/0x460 [amdgpu] ? idr_destroy+0x70/0xb0 ? mutex_destroy+0x1e/0x50 amdgpu_driver_postclose_kms+0x1ec/0x2c0 [amdgpu] drm_file_free.part.0+0x20d/0x260 drm_release+0x6a/0x120 __fput+0xab/0x270 task_work_run+0x5c/0xa0 do_exit+0x394/0xc40 ? rcu_read_lock_sched_held+0x10/0x70 do_group_exit+0x33/0xb0 get_signal+0xbbc/0xbc0 arch_do_signal_or_restart+0x30/0x770 ? do_futex+0xfd/0x190 ? __x64_sys_futex+0x63/0x190 exit_to_user_mode_prepare+0x172/0x270 syscall_exit_to_user_mode+0x16/0x50 do_syscall_64+0x67/0x80 ? do_syscall_64+0x67/0x80 ? rcu_read_lock_sched_held+0x10/0x70 ? trace_hardirqs_on_prepare+0x5e/0x110 ? do_syscall_64+0x67/0x80 ? rcu_read_lock_sched_held+0x10/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f82c2364529 RSP: 002b:7f8210ff8c00 EFLAGS: 0246 ORIG_RAX: 00ca RAX: fe00 RBX: RCX: 7f82c2364529 RDX: RSI: 0189 RDI: 7f823022542c RBP: 7f8210ff8c30 R08: R09: R10: R11: 0246 R12: R13: R14: 0001 R15: 7f823022542c INFO: lockdep is turned off. I bisected this issue and problematic commit is ❯ git bisect bad 5f3854f1f4e211f494018160b348a1c16e58013f is the first bad commit commit 5f3854f1f4e211f494018160b348a1c16e58013f Author: Alex Deucher Date: Thu Mar 24 18:04:00 2022 -0400 drm/amdgpu: add more cases to noretry=1 Port current list from amd-staging-drm-next. Signed-off-by: Alex Deucher drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 3 +++ 1 file changed, 3 insertions(+) Unfortunately I couldn't simply revert this commit on 6.2-rc8 for checking, because it leads to conflicts. Alex, you as author of this commit could help me with it? -- Best Regards, Mike Gavrilov.
Re: [regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70
On Thu, Feb 9, 2023 at 10:17 PM Leo Li wrote: > > Hi Mikhail, seems like your report flew past me, thanks for the ping. > > This might be a simple issue of not backing off when deadlock was hit. > drm_atomic_normalize_zpos() can return an error code, and I ignored it > (oops!) > > Can you give this patch a try? > https://gitlab.freedesktop.org/-/snippets/7414 > > - Leo > Thanks, I think the time for testing was enough. I observed three computers with different GPUs 6800M, 6900XT and 7900XTX for more than 3 days. And a warning message about drm_modeset_drop_locks no longer appears anymore. I hope this patch will have time to be merged in 6.2 before release. Tested-by: Mikhail Gavrilov -- Best Regards, Mike Gavrilov. uptime.tar.xz Description: application/xz
Re: [regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70
Harry, please don't ignore me. This issue still happens in 6.1 and 6.2 Leo you are the author of the problematic commit please don't stand aside. Really nobody is interested in clean logs without warnings and errors? I am 100% sure that reverting commit b261509952bc19d1012cf732f853659be6ebc61e will stop these warnings. I also attached fresh logs from 6.2.0-0.rc6. 6.2-rc7 I started to build without commit b261509952bc19d1012cf732f853659be6ebc61e to avoid these warnings. On Thu, Oct 13, 2022 at 6:36 PM Mikhail Gavrilov > > Hi! > I bisected an issue of the 6.0 kernel which started happening after > 6.0-rc7 on all my machines. > > Backtrace of this issue looks like as: > > [ 2807.339439] [ cut here ] > [ 2807.339445] WARNING: CPU: 11 PID: 2061 at > drivers/gpu/drm/drm_modeset_lock.c:276 > drm_modeset_drop_locks+0x63/0x70 > [ 2807.339453] Modules linked in: tls uinput rfcomm snd_seq_dummy > snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast > nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet > nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink > qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir > snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek sunrpc snd_sof > snd_hda_codec_hdmi snd_hda_codec_generic snd_sof_utils snd_hda_intel > snd_intel_dspcfg mt7921e snd_intel_sdw_acpi binfmt_misc snd_soc_core > mt7921_common snd_hda_codec snd_compress vfat ac97_bus edac_mce_amd > mt76_connac_lib snd_pcm_dmaengine fat snd_hda_core snd_rpl_pci_acp6x > snd_pci_acp6x mt76 btusb snd_hwdep kvm_amd btrtl snd_seq btbcm > mac80211 snd_seq_device kvm btintel btmtk libarc4 snd_pcm > snd_pci_acp5x bluetooth snd_timer snd_rn_pci_acp3x irqbypass > snd_acp_config snd_soc_acpi cfg80211 rapl snd joydev pcspkr > asus_nb_wmi wmi_bmof > [ 2807.339519] snd_pci_acp3x soundcore i2c_piix4 k10temp amd_pmc > asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi > crct10dif_pclmul iommu_v2 crc32_pclmul ledtrig_audio crc32c_intel > gpu_sched sparse_keymap platform_profile hid_multitouch > polyval_clmulni nvme ucsi_acpi drm_buddy polyval_generic > drm_display_helper ghash_clmulni_intel serio_raw nvme_core ccp > typec_ucsi rfkill sp5100_tco r8169 cec nvme_common typec wmi video > i2c_hid_acpi i2c_hid ip6_tables ip_tables fuse > [ 2807.339540] Unloaded tainted modules: acpi_cpufreq():1 > acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 > acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 > amd64_edac():1 acpi_cpufreq():1 acpi_cpufreq():1 amd64_edac():1 > amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 > amd64_edac():1 acpi_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 > fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1 > amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 > fjes():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 > amd64_edac():1 fjes():1 acpi_cpufreq():1 amd64_edac():1 > pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 amd64_edac():1 > pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 > fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 > acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 fjes():1 > acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 > acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 acpi_cpufreq():1 > pcc_cpufreq():1 fjes():1 > [ 2807.339579] acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 > acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 > acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 > acpi_cpufreq():1 fjes():1 acpi_cpufreq():1 fjes():1 fjes():1 fjes():1 > fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 > fjes():1 fjes():1 fjes():1 fjes():1 > [ 2807.339596] CPU: 11 PID: 2061 Comm: gnome-shell Tainted: GW >L 6.0.0-rc4-07-cb0eca01ad9756e853efec3301203c2b5b45aa9f+ #16 > [ 2807.339598] Hardware name: ASUSTeK COMPUTER INC. ROG Strix > G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022 > [ 2807.339600] RIP: 0010:drm_modeset_drop_locks+0x63/0x70 > [ 2807.339602] Code: 42 08 48 89 10 48 89 1b 48 8d bb 50 ff ff ff 48 > 89 5b 08 e8 3f 41 55 00 48 8b 45 78 49 39 c4 75 c6 5b 5d 41 5c c3 cc > cc cc cc <0f> 0b eb ac 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 > 41 54 > [ 2807.339604] RSP: 0018:b6ad46e07b80 EFLAGS: 00010282 > [ 2807.339606] RAX: 0001 RBX: RCX: > 0002 > [ 2807.339607] RDX: 0001 RSI: a6a118b1 RDI: > b6ad46e07c00 > [ 2807.339608] RBP: b6ad46e07c00 R08: R09: > > [ 2807.339609] R10: R11: 0001 R12: > > [ 2807.339610] R13: 9
Re: [PATCH] drm/amd: fix memory leak in amdgpu_cs_sync_rings
On Fri, Feb 3, 2023 at 12:10 AM Bert Karwatzki wrote: > > I hope I got it right this time: > Here is the fix for > Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2360 > > From 6e064c9565ef0da890f3fcb2a4f6a8cd44a12fdb Mon Sep 17 00:00:00 2001 > From: Bert Karwatzki > Date: Thu, 2 Feb 2023 19:50:27 +0100 > Subject: [PATCH] Fix memory leak in amdgpu_cs_sync_rings. > > Signed-off-by: Bert Karwatzki > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > index 0f4cb41078c1..08eced097bd8 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > @@ -1222,10 +1222,13 @@ static int amdgpu_cs_sync_rings(struct > amdgpu_cs_parser *p) > * next job actually sees the results from the > previous one > * before we start executing on the same scheduler > ring. > */ > - if (!s_fence || s_fence->sched != sched) > + if (!s_fence || s_fence->sched != sched) { > + dma_fence_put(fence); > continue; > + } > > r = amdgpu_sync_fence(>gang_leader->explicit_sync, > fence); > + dma_fence_put(fence); > if (r) > return r; > } > -- > 2.39.1 > As a bug reporter I can confirm this patch fixes a memory leak. Tested-by: Mikhail Gavrilov -- Best Regards, Mike Gavrilov.
Re: [PATCH] drm/amdgpu: grab extra fence reference for drm_sched_job_add_dependency
On Thu, Jan 5, 2023 at 3:03 PM Christian König wrote: > > That one should be fixed by: > > commit 9f1ecfc5dcb47a7ca37be47b0eaca0f37f1ae93d > Author: Dmitry Osipenko > Date: Wed Nov 23 03:13:03 2022 +0300 > Christian, This patch was written Nov. 23, 2022, but still not submitted in 6.2! Why? It will close my questions about amdgpu right now. Tested-by: Mikhail Gavrilov -- Best Regards, Mike Gavrilov.
[6.2][regression] looks like commit aab9cf7b6954136f4339136a1a7fc0602a2c4d8b leads to use-after-free and random computer hangs
Hi, The kernel 6.2 preparation cycle has begun. And after the kernel was updated on my Fedora Rawhide I started receiving use-after-free errors with complete computer hangs. At least a good reproducer of this behaviour is launch of the game "Marvel's Avengers". The backtrace of the issue looks like: [ 550.435083] [ cut here ] [ 550.435110] refcount_t: underflow; use-after-free. [ 550.435808] WARNING: CPU: 9 PID: 738 at lib/refcount.c:25 refcount_warn_saturate+0x97/0x110 [ 550.435812] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack [ 550.435887] refcount_t: saturated; leaking memory. [ 550.435893] nf_defrag_ipv6 nf_defrag_ipv4 [ 550.435902] WARNING: CPU: 26 PID: 5032 at lib/refcount.c:19 refcount_warn_saturate+0x74/0x110 [ 550.435907] ip_set [ 550.435909] Modules linked in: [ 550.435910] nf_tables [ 550.435912] uinput rfcomm [ 550.435918] nfnetlink [ 550.435919] snd_seq_dummy snd_hrtimer [ 550.435925] qrtr [ 550.435926] netconsole nft_objref [ 550.435931] bnep [ 550.435933] nf_conntrack_netbios_ns nf_conntrack_broadcast [ 550.435938] sunrpc [ 550.435939] nft_fib_inet [ 550.435941] binfmt_misc [ 550.435942] nft_fib_ipv4 [ 550.435943] iwlmvm [ 550.435130] WARNING: CPU: 25 PID: 740 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 550.435945] nft_fib_ipv6 [ 550.435946] btusb [ 550.435947] nft_fib nft_reject_inet [ 550.435954] btrtl [ 550.435955] nf_reject_ipv4 nf_reject_ipv6 [ 550.435963] btbcm [ 550.435964] nft_reject nft_ct [ 550.435969] btintel [ 550.435971] nft_chain_nat nf_nat [ 550.435977] btmtk [ 550.435979] nf_conntrack nf_defrag_ipv6 [ 550.435984] snd_seq_midi [ 550.435985] nf_defrag_ipv4 ip_set [ 550.435991] snd_seq_midi_event [ 550.435992] nf_tables [ 550.435993] bluetooth [ 550.435995] nfnetlink [ 550.435996] hid_logitech_hidpp [ 550.435142] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc iwlmvm btusb btrtl btbcm btintel btmtk snd_seq_midi snd_seq_midi_event bluetooth hid_logitech_hidpp snd_usb_audio iwlwifi xpad ff_memless snd_usbmidi_lib snd_rawmidi mc ecdh_generic intel_rapl_msr intel_rapl_common mt76x2u mt76x2_common joydev snd_hda_codec_realtek mt76x02_usb edac_mce_amd snd_hda_codec_generic mt76_usb snd_hda_codec_hdmi mt76x02_lib kvm_amd snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec mt76 vfat kvm snd_hda_core fat snd_seq snd_hwdep irqbypass snd_seq_device mac80211 snd_pcm eeepc_wmi asus_wmi ledtrig_audio sparse_keymap rapl platform_profile wmi_bmof snd_timer snd pcspkr i2c_piix4 [ 550.435997] qrtr bnep [ 550.436003] snd_usb_audio [ 550.436004] sunrpc binfmt_misc [ 550.436010] iwlwifi [ 550.436012] iwlmvm btusb [ 550.436018] xpad [ 550.436019] btrtl btbcm [ 550.436025] ff_memless [ 550.436026] btintel [ 550.436027] snd_usbmidi_lib [ 550.436029] btmtk [ 550.436030] snd_rawmidi [ 550.436031] snd_seq_midi snd_seq_midi_event [ 550.436037] mc [ 550.436038] bluetooth [ 550.436039] ecdh_generic [ 550.436041] hid_logitech_hidpp snd_usb_audio [ 550.436046] intel_rapl_msr [ 550.436048] iwlwifi xpad [ 550.436054] intel_rapl_common [ 550.436055] ff_memless [ 550.436056] mt76x2u [ 550.436058] snd_usbmidi_lib snd_rawmidi [ 550.436063] mt76x2_common [ 550.436064] mc ecdh_generic [ 550.436070] joydev [ 550.436071] intel_rapl_msr intel_rapl_common [ 550.436076] snd_hda_codec_realtek [ 550.436078] mt76x2u [ 550.436079] mt76x02_usb [ 550.436080] mt76x2_common joydev [ 550.436086] edac_mce_amd [ 550.436088] snd_hda_codec_realtek mt76x02_usb [ 550.436094] snd_hda_codec_generic [ 550.436095] edac_mce_amd [ 550.436096] mt76_usb [ 550.436098] snd_hda_codec_generic mt76_usb [ 550.436104] snd_hda_codec_hdmi [ 550.436106] snd_hda_codec_hdmi [ 550.436107] mt76x02_lib [ 550.435234] k10temp soundcore libarc4 acpi_cpufreq cfg80211 hid_logitech_dj rfkill zram amdgpu drm_ttm_helper ttm video iommu_v2 gpu_sched drm_buddy crct10dif_pclmul crc32_pclmul crc32c_intel igb ucsi_ccg drm_display_helper nvme typec_ucsi ghash_clmulni_intel ccp typec cec sp5100_tco dca sha512_ssse3 nvme_core wmi ip6_tables ip_tables fuse [ 550.436108] mt76x02_lib kvm_amd [ 550.436115] kvm_amd [ 550.436116] snd_hda_intel snd_intel_dspcfg [ 550.436122] snd_hda_intel [ 550.436123] snd_intel_sdw_acpi [ 550.435284] CPU: 25 PID: 740 Comm: sdma2 Tainted: GWL
Re: Screen corruption using radeon kernel driver
On Wed, Nov 30, 2022 at 11:07:32AM -0500, Alex Deucher wrote: > On Wed, Nov 30, 2022 at 10:42 AM Robin Murphy wrote: > > > > On 2022-11-30 14:28, Alex Deucher wrote: > > > On Wed, Nov 30, 2022 at 7:54 AM Robin Murphy wrote: > > >> > > >> On 2022-11-29 17:11, Mikhail Krylov wrote: > > >>> On Tue, Nov 29, 2022 at 11:05:28AM -0500, Alex Deucher wrote: > > >>>> On Tue, Nov 29, 2022 at 10:59 AM Mikhail Krylov > > >>>> wrote: > > >>>>> > > >>>>> On Tue, Nov 29, 2022 at 09:44:19AM -0500, Alex Deucher wrote: > > >>>>>> On Mon, Nov 28, 2022 at 3:48 PM Mikhail Krylov > > >>>>>> wrote: > > >>>>>>> > > >>>>>>> On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher wrote: > > >>>>>>> > > >>>>>>>>>> [excessive quoting removed] > > >>>>>>> > > >>>>>>>>> So, is there any progress on this issue? I do understand it's not > > >>>>>>>>> a high > > >>>>>>>>> priority one, and today I've checked it on 6.0 kernel, and > > >>>>>>>>> unfortunately, it still persists... > > >>>>>>>>> > > >>>>>>>>> I'm considering writing a patch that will allow user to override > > >>>>>>>>> need_dma32/dma_bits setting with a module parameter. I'll have > > >>>>>>>>> some time > > >>>>>>>>> after the New Year for that. > > >>>>>>>>> > > >>>>>>>>> Is it at all possible that such a patch will be merged into > > >>>>>>>>> kernel? > > >>>>>>>>> > > >>>>>>>> On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov > > >>>>>>>> wrote: > > >>>>>>>> Unless someone familiar with HIMEM can figure out what is going > > >>>>>>>> wrong > > >>>>>>>> we should just revert the patch. > > >>>>>>>> > > >>>>>>>> Alex > > >>>>>>> > > >>>>>>> > > >>>>>>> Okay, I was suggesting that mostly because > > >>>>>>> > > >>>>>>> a) it works for me with dma_bits = 40 (I understand that's what it > > >>>>>>> is > > >>>>>>> without the original patch applied); > > >>>>>>> > > >>>>>>> b) there's a hint of uncertainity on this line > > >>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359 > > >>>>>>> saying that for AGP dma_bits = 32 is the safest option, so > > >>>>>>> apparently there are > > >>>>>>> setups, unlike mine, where dma_bits = 32 is better than 40. > > >>>>>>> > > >>>>>>> But I'm in no position to argue, just wanted to make myself clear. > > >>>>>>> I'm okay with rebuilding the kernel for my machine until the > > >>>>>>> original > > >>>>>>> patch is reverted or any other fix is applied. > > >>>>>> > > >>>>>> What GPU do you have and is it AGP? If it is AGP, does setting > > >>>>>> radeon.agpmode=-1 also fix it? > > >>>>>> > > >>>>>> Alex > > >>>>> > > >>>>> That is ATI Radeon X1950, and, unfortunately, radeon.agpmode=-1 > > >>>>> doesn't > > >>>>> help, it just makes 3D acceleration in games such as OpenArena stop > > >>>>> working. > > >>>> > > >>>> Just to confirm, is the board AGP or PCIe? > > >>>> > > >>>> Alex > > >>> > > >>> It is AGP. That's an old machine. > > >> > > >> Can you check whether dma_addressing_limited() is actually returning the > > >> expected result at the point of radeon_ttm_init()? Disabling highmem is > > >> presumably just hiding whatever problem exists, by throwing away all > > >> >32-bit RAM such that use_dma32 doesn't matter. > > > > > > The device in question only supports a 32 bit DMA mask so > > > dma_addressing_limited() should return true. Bounce buffers are not > > > really usable on GPUs because they map so much memory. If > > > dma_addressing_limited() returns false, that would explain it. > > > > Right, it appears to be the only part of the offending commit that > > *could* reasonably make any difference, so I'm primarily wondering if > > dma_get_required_mask() somehow gets confused. > > Mikhail, > > Can you see that dma_addressing_limited() and dma_get_required_mask() > return in this case? > > Alex > > > > > > Thanks, > > Robin. Hello again, I was able to confirm by adding printk() to the functions and recompiling the kernel that dma_addressing_limited() returns *false* on the kernel with the bug. And dma_get_required_mask() returns 0x7fff, as I said before. signature.asc Description: PGP signature
Re: [bug][vaapi][h264] The commit 7cbe08a930a132d84b4cf79953b00b074ec7a2a7 on certain video files leads to problems with VAAPI hardware decoding.
On Wed, Dec 7, 2022 at 7:58 PM Alex Deucher wrote: > > > What GPU do you have and what entries do you have in > sys/class/drm/card0/device/ip_discovery/die/0/UVD for the device? I bisected the issue on the Radeon 6800M. Parent commit for 7cbe08a930a132d84b4cf79953b00b074ec7a2a7 is 46dd2965bdd1c5a4f6499c73ff32e636fa8f9769. For both commits ip_discovery is absent. # ls /sys/class/drm/card0/device/ | grep ip # ls /sys/class/drm/card1/device/ | grep ip But from verbose info I see that player for 7cbe08a930a132d84b4cf79953b00b074ec7a2a7 use acceleration: $ vlc -v Downloads/test_sample_480_2.mp4 VLC media player 3.0.18 Vetinari (revision ) [561f72097520] main libvlc: Running vlc with the default interface. Use 'cvlc' to use vlc without interface. [7fa224001190] mp4 demux warning: elst box found [7fa224001190] mp4 demux warning: STTS table of 1 entries [7fa224001190] mp4 demux warning: CTTS table of 78 entries [7fa224001190] mp4 demux warning: elst box found [7fa224001190] mp4 demux warning: STTS table of 1 entries [7fa224001190] mp4 demux warning: elst old=0 new=1 [7fa224d19010] faad decoder warning: decoded zero sample [7fa224001190] mp4 demux warning: elst old=0 new=1 [7fa214007030] gl gl: Initialized libplacebo v4.208.0 (API v208) libva info: VA-API version 1.16.0 libva error: vaGetDriverNameByIndex() failed with unknown libva error, driver_name = (null) [7fa214007030] glconv_vaapi_x11 gl error: vaInitialize: unknown libva error libva info: VA-API version 1.16.0 libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so libva info: Found init function __vaDriverInit_1_16 libva info: va_openDriver() returns 0 [7fa224c0b3a0] avcodec decoder: Using Mesa Gallium driver 23.0.0-devel for AMD Radeon RX 6800M (navi22, LLVM 15.0.4, DRM 3.42, 5.14.0-rc4-14-7cbe08a930a132d84b4cf79953b00b074ec7a2a7+) for hardware decoding [h264 @ 0x7fa224c3fa40] Using deprecated struct vaapi_context in decode. [561f72174de0] pulse audio output warning: starting late (-9724 us) And for 46dd2965bdd1c5a4f6499c73ff32e636fa8f9769 commit did not use acceleration: $ vlc -v Downloads/test_sample_480_2.mp4 VLC media player 3.0.18 Vetinari (revision ) [55f61ad35520] main libvlc: Running vlc with the default interface. Use 'cvlc' to use vlc without interface. [7fc7e8001190] mp4 demux warning: elst box found [7fc7e8001190] mp4 demux warning: STTS table of 1 entries [7fc7e8001190] mp4 demux warning: CTTS table of 78 entries [7fc7e8001190] mp4 demux warning: elst box found [7fc7e8001190] mp4 demux warning: STTS table of 1 entries [7fc7e8001190] mp4 demux warning: elst old=0 new=1 [7fc7e8d19010] faad decoder warning: decoded zero sample [7fc7e8001190] mp4 demux warning: elst old=0 new=1 [7fc7d8007030] gl gl: Initialized libplacebo v4.208.0 (API v208) libva info: VA-API version 1.16.0 libva error: vaGetDriverNameByIndex() failed with unknown libva error, driver_name = (null) [7fc7d8007030] glconv_vaapi_x11 gl error: vaInitialize: unknown libva error libva info: VA-API version 1.16.0 libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so libva info: Found init function __vaDriverInit_1_16 libva info: va_openDriver() returns 0 [7fc7d40b3260] vaapi generic error: profile(7) is not supported [7fc7d8a089c0] gl gl: Initialized libplacebo v4.208.0 (API v208) Failed to open VDPAU backend libvdpau_nvidia.so: cannot open shared object file: No such file or directory Failed to open VDPAU backend libvdpau_nvidia.so: cannot open shared object file: No such file or directory [7fc7d89e4f80] gl gl: Initialized libplacebo v4.208.0 (API v208) [55f61ae12de0] pulse audio output warning: starting late (-13537 us) So my bisect didn't make sense :( Anyway can you reproduce the issue with the attached sample file and vlc on fresh kernel (6.1-rc8)? Thanks! -- Best Regards, Mike Gavrilov.
Re: Screen corruption using radeon kernel driver
On Thu, Dec 01, 2022 at 02:00:58PM +, Robin Murphy wrote: > On 2022-11-30 19:59, Mikhail Krylov wrote: > > On Wed, Nov 30, 2022 at 11:07:32AM -0500, Alex Deucher wrote: > > > On Wed, Nov 30, 2022 at 10:42 AM Robin Murphy > > > wrote: > > > > > > > > On 2022-11-30 14:28, Alex Deucher wrote: > > > > > On Wed, Nov 30, 2022 at 7:54 AM Robin Murphy > > > > > wrote: > > > > > > > > > > > > On 2022-11-29 17:11, Mikhail Krylov wrote: > > > > > > > On Tue, Nov 29, 2022 at 11:05:28AM -0500, Alex Deucher wrote: > > > > > > > > On Tue, Nov 29, 2022 at 10:59 AM Mikhail Krylov > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Tue, Nov 29, 2022 at 09:44:19AM -0500, Alex Deucher wrote: > > > > > > > > > > On Mon, Nov 28, 2022 at 3:48 PM Mikhail Krylov > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > [excessive quoting removed] > > > > > > > > > > > > > > > > > > > > > > > > So, is there any progress on this issue? I do > > > > > > > > > > > > > understand it's not a high > > > > > > > > > > > > > priority one, and today I've checked it on 6.0 > > > > > > > > > > > > > kernel, and > > > > > > > > > > > > > unfortunately, it still persists... > > > > > > > > > > > > > > > > > > > > > > > > > > I'm considering writing a patch that will allow user > > > > > > > > > > > > > to override > > > > > > > > > > > > > need_dma32/dma_bits setting with a module parameter. > > > > > > > > > > > > > I'll have some time > > > > > > > > > > > > > after the New Year for that. > > > > > > > > > > > > > > > > > > > > > > > > > > Is it at all possible that such a patch will be > > > > > > > > > > > > > merged into kernel? > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov > > > > > > > > > > > > wrote: > > > > > > > > > > > > Unless someone familiar with HIMEM can figure out what > > > > > > > > > > > > is going wrong > > > > > > > > > > > > we should just revert the patch. > > > > > > > > > > > > > > > > > > > > > > > > Alex > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Okay, I was suggesting that mostly because > > > > > > > > > > > > > > > > > > > > > > a) it works for me with dma_bits = 40 (I understand > > > > > > > > > > > that's what it is > > > > > > > > > > > without the original patch applied); > > > > > > > > > > > > > > > > > > > > > > b) there's a hint of uncertainity on this line > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359 > > > > > > > > > > > saying that for AGP dma_bits = 32 is the safest option, > > > > > > > > > > > so apparently there are > > > > > > > > > > > setups, unlike mine, where dma_bits = 32 is better than > > > > > > > > > > > 40. > > > > > > > > > > > > > > > > > > > > > > But I'm in no position to argue, just wanted to make > > > > > > > > > > > myself clear. > >
Re: Screen corruption using radeon kernel driver
On Wed, Nov 30, 2022 at 11:07:32AM -0500, Alex Deucher wrote: > On Wed, Nov 30, 2022 at 10:42 AM Robin Murphy wrote: > > > > On 2022-11-30 14:28, Alex Deucher wrote: > > > On Wed, Nov 30, 2022 at 7:54 AM Robin Murphy wrote: > > >> > > >> On 2022-11-29 17:11, Mikhail Krylov wrote: > > >>> On Tue, Nov 29, 2022 at 11:05:28AM -0500, Alex Deucher wrote: > > >>>> On Tue, Nov 29, 2022 at 10:59 AM Mikhail Krylov > > >>>> wrote: > > >>>>> > > >>>>> On Tue, Nov 29, 2022 at 09:44:19AM -0500, Alex Deucher wrote: > > >>>>>> On Mon, Nov 28, 2022 at 3:48 PM Mikhail Krylov > > >>>>>> wrote: > > >>>>>>> > > >>>>>>> On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher wrote: > > >>>>>>> > > >>>>>>>>>> [excessive quoting removed] > > >>>>>>> > > >>>>>>>>> So, is there any progress on this issue? I do understand it's not > > >>>>>>>>> a high > > >>>>>>>>> priority one, and today I've checked it on 6.0 kernel, and > > >>>>>>>>> unfortunately, it still persists... > > >>>>>>>>> > > >>>>>>>>> I'm considering writing a patch that will allow user to override > > >>>>>>>>> need_dma32/dma_bits setting with a module parameter. I'll have > > >>>>>>>>> some time > > >>>>>>>>> after the New Year for that. > > >>>>>>>>> > > >>>>>>>>> Is it at all possible that such a patch will be merged into > > >>>>>>>>> kernel? > > >>>>>>>>> > > >>>>>>>> On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov > > >>>>>>>> wrote: > > >>>>>>>> Unless someone familiar with HIMEM can figure out what is going > > >>>>>>>> wrong > > >>>>>>>> we should just revert the patch. > > >>>>>>>> > > >>>>>>>> Alex > > >>>>>>> > > >>>>>>> > > >>>>>>> Okay, I was suggesting that mostly because > > >>>>>>> > > >>>>>>> a) it works for me with dma_bits = 40 (I understand that's what it > > >>>>>>> is > > >>>>>>> without the original patch applied); > > >>>>>>> > > >>>>>>> b) there's a hint of uncertainity on this line > > >>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359 > > >>>>>>> saying that for AGP dma_bits = 32 is the safest option, so > > >>>>>>> apparently there are > > >>>>>>> setups, unlike mine, where dma_bits = 32 is better than 40. > > >>>>>>> > > >>>>>>> But I'm in no position to argue, just wanted to make myself clear. > > >>>>>>> I'm okay with rebuilding the kernel for my machine until the > > >>>>>>> original > > >>>>>>> patch is reverted or any other fix is applied. > > >>>>>> > > >>>>>> What GPU do you have and is it AGP? If it is AGP, does setting > > >>>>>> radeon.agpmode=-1 also fix it? > > >>>>>> > > >>>>>> Alex > > >>>>> > > >>>>> That is ATI Radeon X1950, and, unfortunately, radeon.agpmode=-1 > > >>>>> doesn't > > >>>>> help, it just makes 3D acceleration in games such as OpenArena stop > > >>>>> working. > > >>>> > > >>>> Just to confirm, is the board AGP or PCIe? > > >>>> > > >>>> Alex > > >>> > > >>> It is AGP. That's an old machine. > > >> > > >> Can you check whether dma_addressing_limited() is actually returning the > > >> expected result at the point of radeon_ttm_init()? Disabling highmem is > > >> presumably just hiding whatever problem exists, by throwing away all > > >> >32-bit RAM such that use_dma32 doesn't matter. > > > > > > The device in question only supports a 32 bit DMA mask so > > > dma_addressing_limited() should return true. Bounce buffers are not > > > really usable on GPUs because they map so much memory. If > > > dma_addressing_limited() returns false, that would explain it. > > > > Right, it appears to be the only part of the offending commit that > > *could* reasonably make any difference, so I'm primarily wondering if > > dma_get_required_mask() somehow gets confused. > > Mikhail, > > Can you see that dma_addressing_limited() and dma_get_required_mask() > return in this case? > > Alex > > > > > > Thanks, > > Robin. Unfortunately, right now I don't have enough time for kernel modifications and rebuilds (I will later!), so I did a quick-and-dirty research with kprobe. The problem is that dma_addressing_limited() seems to be inlined and kprobe fails to intercept it. But I managed to get the result of dma_get_required_mask(). It returns 0x7fff (!) on the vanilla (with the patch, buggy) kernel: $ sudo kprobe-perf 'r:dma_get_required_mask $retval' Tracing kprobe dma_get_required_mask. Ctrl-C to end. modprobe-1244[000] d... 105.582816: dma_get_required_mask: (radeon_ttm_init+0x61/0x240 [radeon] <- dma_get_required_mask) arg1=0x7fff This function does not even get called in the kernel without the patch that I built myself. I believe that's because ttm_bo_device_init() doesn't call it without the patch. Hope that helps at least a bit. If not, I'll be able to do more thorough research in a couple of weeks, probably. signature.asc Description: PGP signature
Re: Screen corruption using radeon kernel driver
On Tue, Nov 29, 2022 at 11:05:28AM -0500, Alex Deucher wrote: > On Tue, Nov 29, 2022 at 10:59 AM Mikhail Krylov wrote: > > > > On Tue, Nov 29, 2022 at 09:44:19AM -0500, Alex Deucher wrote: > > > On Mon, Nov 28, 2022 at 3:48 PM Mikhail Krylov wrote: > > > > > > > > On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher wrote: > > > > > > > > >>> [excessive quoting removed] > > > > > > > > >> So, is there any progress on this issue? I do understand it's not a > > > > >> high > > > > >> priority one, and today I've checked it on 6.0 kernel, and > > > > >> unfortunately, it still persists... > > > > >> > > > > >> I'm considering writing a patch that will allow user to override > > > > >> need_dma32/dma_bits setting with a module parameter. I'll have some > > > > >> time > > > > >> after the New Year for that. > > > > >> > > > > >> Is it at all possible that such a patch will be merged into kernel? > > > > >> > > > > > On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov > > > > > wrote: > > > > > Unless someone familiar with HIMEM can figure out what is going wrong > > > > > we should just revert the patch. > > > > > > > > > > Alex > > > > > > > > > > > > Okay, I was suggesting that mostly because > > > > > > > > a) it works for me with dma_bits = 40 (I understand that's what it is > > > > without the original patch applied); > > > > > > > > b) there's a hint of uncertainity on this line > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359 > > > > saying that for AGP dma_bits = 32 is the safest option, so apparently > > > > there are > > > > setups, unlike mine, where dma_bits = 32 is better than 40. > > > > > > > > But I'm in no position to argue, just wanted to make myself clear. > > > > I'm okay with rebuilding the kernel for my machine until the original > > > > patch is reverted or any other fix is applied. > > > > > > What GPU do you have and is it AGP? If it is AGP, does setting > > > radeon.agpmode=-1 also fix it? > > > > > > Alex > > > > That is ATI Radeon X1950, and, unfortunately, radeon.agpmode=-1 doesn't > > help, it just makes 3D acceleration in games such as OpenArena stop > > working. > > Just to confirm, is the board AGP or PCIe? > > Alex It is AGP. That's an old machine. signature.asc Description: PGP signature
Re: Screen corruption using radeon kernel driver
On Tue, Nov 29, 2022 at 09:44:19AM -0500, Alex Deucher wrote: > On Mon, Nov 28, 2022 at 3:48 PM Mikhail Krylov wrote: > > > > On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher wrote: > > > > >>> [excessive quoting removed] > > > > >> So, is there any progress on this issue? I do understand it's not a high > > >> priority one, and today I've checked it on 6.0 kernel, and > > >> unfortunately, it still persists... > > >> > > >> I'm considering writing a patch that will allow user to override > > >> need_dma32/dma_bits setting with a module parameter. I'll have some time > > >> after the New Year for that. > > >> > > >> Is it at all possible that such a patch will be merged into kernel? > > >> > > > On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov wrote: > > > Unless someone familiar with HIMEM can figure out what is going wrong > > > we should just revert the patch. > > > > > > Alex > > > > > > Okay, I was suggesting that mostly because > > > > a) it works for me with dma_bits = 40 (I understand that's what it is > > without the original patch applied); > > > > b) there's a hint of uncertainity on this line > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359 > > saying that for AGP dma_bits = 32 is the safest option, so apparently there > > are > > setups, unlike mine, where dma_bits = 32 is better than 40. > > > > But I'm in no position to argue, just wanted to make myself clear. > > I'm okay with rebuilding the kernel for my machine until the original > > patch is reverted or any other fix is applied. > > What GPU do you have and is it AGP? If it is AGP, does setting > radeon.agpmode=-1 also fix it? > > Alex That is ATI Radeon X1950, and, unfortunately, radeon.agpmode=-1 doesn't help, it just makes 3D acceleration in games such as OpenArena stop working. signature.asc Description: PGP signature
Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start
On Tue, Nov 22, 2022 at 12:16 PM Christian König wrote: > > Ah, thanks a lot for this. I've already pushed the patches into our > internal branch, but getting this confirmation is still great! > > This was quite some fundamental bug in the handling and I hope to get > this completely reworked at some point since it is currently only mitigated. Looks like the final version of this patch successfully merged in 6.1-rc7. Big thanks, all games work again! > No idea what that could be. Modesetting is not something I work on. > > The best advice I can give you is to maybe ping Harry and our other > display people, they should know that stuff better than I do. Unfortunately Harry didn't answer. I hope my email wasn't marked as spam. -- Best Regards, Mike Gavrilov.
Re: Screen corruption using radeon kernel driver
On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher wrote: >>> [excessive quoting removed] >> So, is there any progress on this issue? I do understand it's not a high >> >> >> priority one, and today I've checked it on 6.0 kernel, and >> >> >> unfortunately, it still persists... >> >> >> >> >> >> I'm considering writing a patch that will allow user to override >> >> >> need_dma32/dma_bits setting with a module parameter. I'll have some time >> >> >> after the New Year for that. >> >> >> >> >> >> Is it at all possible that such a patch will be merged into kernel? >> > On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov wrote: > Unless someone familiar with HIMEM can figure out what is going wrong > we should just revert the patch. > > Alex Okay, I was suggesting that mostly because a) it works for me with dma_bits = 40 (I understand that's what it is without the original patch applied); b) there's a hint of uncertainity on this line https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359 saying that for AGP dma_bits = 32 is the safest option, so apparently there are setups, unlike mine, where dma_bits = 32 is better than 40. But I'm in no position to argue, just wanted to make myself clear. I'm okay with rebuilding the kernel for my machine until the original patch is reverted or any other fix is applied. signature.asc Description: PGP signature
Re: Screen corruption using radeon kernel driver
On Mon, Apr 25, 2022 at 01:22:04PM -0400, Alex Deucher wrote: > + dri-devel > > On Mon, Apr 25, 2022 at 3:33 AM Krylov Michael wrote: > > > > Hello! > > > > After updating my Linux kernel from version 4.19 (Debian 10 version) to > > 5.10 (packaged with Debian 11), I've noticed that the image > > displayed on my older computer, 32-bit Pentium 4 using ATI Radeon X1950 > > AGP video card is severely corrupted in the graphical (Xorg and Wayland) > > mode: all kinds of black and white stripes across the screen, some > > letters missing, etc. > > > > I've checked several options (Xorg drivers, Wayland instead of > > Xorg, radeon.agpmode=-1 in kernel command line and so on), but the > > problem persisted. I've managed to find that the problem was in the > > kernel, as everything worked well with 4.19 kernel with everything > > else being from Debian 11. > > > > I have managed to find the culprit of that corruption, that is the > > commit 33b3ad3788aba846fc8b9a065fe2685a0b64f713 on the linux kernel. > > Reverting this commit and building the kernel with that commit reverted > > fixes the problem. Disabling HIMEM also gets rid of that problem. But it > > also leaves the system with less that 1G of RAM, which is, of course, > > undesirable. > > > > Apparently this problem is somewhat known, as I can tell after googling > > for the commit id, see this link for example: > > https://lkml.org/lkml/2020/1/9/518 > > > > Mageia distro, for example, reverted this commit in the kernel they are > > building: > > > > http://sophie.zarb.org/distrib/Mageia/7/i586/by-pkgid/b9193a4f85192bc57f4d770fb9bb399c/files/32 > > > > I've reported this bug to Debian bugtracker, checked the recent verion > > of the kernel (5.17), bug still persists. Here's a link to the Debian > > bug page: > > > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=993670 > > > > I'm not sure if reverting this commit is the correct way to go, so if > > you need to check any changes/patches that I could apply and test on > > the real hardware, I'll be glad to do that (but please keep in mind > > that testing could take some time, I don't have access to this computer > > 24/7, but I'll do my best to respond ASAP). > > I would be happy to revert that commit. I attempted to revert it a > year or so ago, but Christoph didn't want to. He was going to look > further into it. I was not able to repro the issue. It seemed to be > related to highmem support. You might try disabling that. Here is > the previous thread for reference: > https://lists.freedesktop.org/archives/amd-gfx/2020-September/053922.html > > Alex So, is there any progress on this issue? I do understand it's not a high priority one, and today I've checked it on 6.0 kernel, and unfortunately, it still persists... I'm considering writing a patch that will allow user to override need_dma32/dma_bits setting with a module parameter. I'll have some time after the New Year for that. Is it at all possible that such a patch will be merged into kernel? signature.asc Description: PGP signature
Re: [regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70
On Thu, Oct 13, 2022 at 6:36 PM Mikhail Gavrilov wrote: > > Hi! > I bisected an issue of the 6.0 kernel which started happening after > 6.0-rc7 on all my machines. > > Backtrace of this issue looks like as: > > [ 2807.339439] [ cut here ] > [ 2807.339445] WARNING: CPU: 11 PID: 2061 at > drivers/gpu/drm/drm_modeset_lock.c:276 > drm_modeset_drop_locks+0x63/0x70 > > bisect points to this commit: b261509952bc19d1012cf732f853659be6ebc61e. > > After reverting this commit the WARNING messages described here disappeared. > Hi Harry, Christian says that you can help with it. Thanks. -- Best Regards, Mike Gavrilov.
Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start
On Mon, Nov 14, 2022 at 6:22 PM Christian König wrote: > > I've found and fixed a few problems around the userptr handling which > might explain what you see here. > > A series of four patches starting with "drm/amdgpu: always register an > MMU notifier for userptr" is under review now. > > Going to give that a bit cleanup later today and will CC you when I send > that out. Would be nice if you could give that some testing. > > Thanks, > Christian. > Christian, I tested all four patches around week and can say that this issue is completely gone. All known broken games working. Tested-by: Mikhail Gavrilov The only thing I don't like is the flood in the kernel logs of the message "WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70", but this is not related to the patches being checked. All kernel logs uploaded to pastebin [1][2][3][4][5][6][7][8] I wrote a separate bug report about "drm_modeset_lock" [9], it's a pity that no one paid attention to it. I even found the first bad commit. It is b261509952bc19d1012cf732f853659be6ebc61e. [1] https://pastebin.com/WZWczupk [2] https://pastebin.com/f4i9pvjS [3] https://pastebin.com/rsDWaMR1 [4] https://pastebin.com/tDNEYJq0 [5] https://pastebin.com/xfZVbm1f [6] https://pastebin.com/Vx9gDyKt [7] https://pastebin.com/XvRkLckV [8] https://pastebin.com/pd8WBkgx [9] https://www.spinics.net/lists/dri-devel/msg367543.html Thanks. -- Best Regards, Mike Gavrilov.
Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start
On Tue, Nov 1, 2022 at 10:52 PM Christian König wrote: > > Let's focus on one problem at a time. > > The issue here is that somehow userptr handling became racy after we > removed the lock, but I don't see why. > > We need to fix this ASAP since it is probably a much wider problem and > the additional lock just hides it somehow. > > Going to provide you with an updated patch tomorrow. > > Thanks, > Christian. Recently sackboy has been updated and now the kernel log contains a trace very similar to the one in the first post, even with the patch applied. [ 155.948044] [ cut here ] [ 155.948164] WARNING: CPU: 3 PID: 4850 at drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:678 amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu] [ 155.948342] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep intel_rapl_msr intel_rapl_common snd_hda_codec_realtek snd_sof_amd_renoir snd_sof_amd_acp snd_hda_codec_generic snd_hda_codec_hdmi snd_sof_pci sunrpc binfmt_misc snd_sof snd_hda_intel snd_sof_utils snd_intel_dspcfg mt7921e snd_intel_sdw_acpi snd_hda_codec mt7921_common snd_soc_core edac_mce_amd mt76_connac_lib btusb snd_hda_core snd_compress snd_hwdep mt76 btrtl ac97_bus kvm_amd snd_pcm_dmaengine btbcm snd_rpl_pci_acp6x snd_pci_acp6x btintel mac80211 btmtk snd_seq snd_seq_device kvm snd_pcm snd_pci_acp5x libarc4 bluetooth irqbypass vfat snd_timer snd_rn_pci_acp3x fat rapl snd_acp_config asus_nb_wmi snd cfg80211 snd_soc_acpi wmi_bmof k10temp pcspkr [ 155.948436] snd_pci_acp3x i2c_piix4 soundcore asus_wireless amd_pmc joydev zram amdgpu drm_ttm_helper ttm crct10dif_pclmul hid_asus crc32_pclmul asus_wmi crc32c_intel iommu_v2 ledtrig_audio polyval_clmulni gpu_sched sparse_keymap polyval_generic platform_profile drm_buddy drm_display_helper nvme rfkill ghash_clmulni_intel hid_multitouch ucsi_acpi sha512_ssse3 nvme_core typec_ucsi serio_raw sp5100_tco r8169 ccp cec nvme_common typec i2c_hid_acpi i2c_hid video wmi ip6_tables ip_tables fuse [ 155.948540] CPU: 3 PID: 4850 Comm: Sackboy-Win64-T Tainted: G WL--- --- 6.1.0-0.rc3.20221101git5aaef24b5c6d.29.fc38.x86_64 #1 [ 155.948544] Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022 [ 155.948547] RIP: 0010:amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu] [ 155.948748] Code: 9e f1 e9 32 ff ff ff 4c 89 e9 89 ea 48 c7 c6 a8 a3 fd c0 48 c7 c7 88 81 1e c1 e8 af 97 ea f1 eb 8e 66 90 bd f2 ff ff ff eb 8d <0f> 0b eb f5 bd fd ff ff ff eb 82 bd f2 ff ff ff e9 62 ff ff ff 48 [ 155.948751] RSP: 0018:960b544d3a50 EFLAGS: 00010282 [ 155.948756] RAX: 8a4e40d44e00 RBX: 8a4f0e564140 RCX: 0001 [ 155.948759] RDX: RSI: 8a4e40d44e00 RDI: 8a4f4b52b400 [ 155.948761] RBP: 8a4e8c979000 R08: 0dc0 R09: [ 155.948764] R10: 0001 R11: R12: 8a4e8aaad558 [ 155.948767] R13: 3b91 R14: 8a4f0e667180 R15: 8a4f4b52b458 [ 155.948770] FS: 7fa13fe006c0() GS:8a5d16e0() knlGS:36f8 [ 155.948772] CS: 0010 DS: ES: CR0: 80050033 [ 155.948775] CR2: 25c9e1d0 CR3: 00036199 CR4: 00750ee0 [ 155.948778] PKRU: 5554 [ 155.948780] Call Trace: [ 155.948783] [ 155.948790] amdgpu_cs_ioctl+0x9fd/0x2030 [amdgpu] [ 155.948992] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 155.949155] drm_ioctl_kernel+0xac/0x160 [ 155.949165] drm_ioctl+0x1e7/0x450 [ 155.949172] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 155.949344] amdgpu_drm_ioctl+0x4a/0x80 [amdgpu] [ 155.949528] __x64_sys_ioctl+0x90/0xd0 [ 155.949537] do_syscall_64+0x5b/0x80 [ 155.949547] ? lock_is_held_type+0xe8/0x140 [ 155.949559] ? do_syscall_64+0x67/0x80 [ 155.949565] ? lockdep_hardirqs_on+0x7d/0x100 [ 155.949573] ? do_syscall_64+0x67/0x80 [ 155.949579] ? do_syscall_64+0x67/0x80 [ 155.949586] ? do_syscall_64+0x67/0x80 [ 155.949592] ? lockdep_hardirqs_on+0x7d/0x100 [ 155.949597] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 155.949603] RIP: 0033:0x7fa1b7fd912f [ 155.949610] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 [ 155.949615] RSP: 002b:7fa13fdfe920 EFLAGS: 0246 ORIG_RAX: 0010 [ 155.949621] RAX: ffda RBX: 7fa13fdfebe8 RCX: 7fa1b7fd912f [ 155.949625] RDX: 7fa13fdfea10 RSI: c0186444 RDI: 0165 [ 155.949629] RBP: 7fa13fdfea10 R08: 7f9ff80018e0 R09: 7fa13fdfe9c0 [ 155.949633] R10: 7eb11590 R11: 0246 R12: c0186444 [
Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start
On Wed, Oct 26, 2022 at 12:29 PM Christian König wrote: > > Attached is the original test patch rebased on current amd-staging-drm-next. > > Can you test if this is enough to make sure that the games start without > crashing by fetching the userptrs? 1. Over the past week the list of games affected by this issue updated with new games: The Outlast Trials, Gotham Knights, Sackboy: A Big Adventure. 2. I tested the patch and it really solves the problem with the launch of all the listed games and does not create new problems. 3. The only thing I noticed is that in the game Sackboy: A Big Adventure, when using the kernel built from the commit b229b6ca5abbd63ff40c1396095b1b36b18139c3 + the attached patch, I can’t connect to friend coop session due to the steam client hangs. The kernel built from commit 736ec9fadd7a1fde8480df7e5cfac465c07ff6f3 (this is the commit prior to dd80d9c8eecac8c516da5b240d01a35660ba6cb6) free of this problem. I need to spend some more time to find the commit after which leads to hanging [3] the steam client. Thanks. -- Best Regards, Mike Gavrilov.
Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start
On Fri, Oct 21, 2022 at 1:33 PM Christian König wrote: > > Hi, > > yes Bas already reported this issue, but I couldn't reproduce it. Need > to come up with a patch to narrow this down further. > > Can I send you something to test? I would appreciate to test any patches and ideas. -- Best Regards, Mike Gavrilov.
[6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start
Hi! I found that some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6. dd80d9c8eecac8c516da5b240d01a35660ba6cb6 is the first bad commit commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 Author: Christian König Date: Thu Jul 14 10:23:38 2022 +0200 drm/amdgpu: revert "partial revert "remove ctx->lock" v2" This reverts commit 94f4c4965e5513ba624488f4b601d6b385635aec. We found that the bo_list is missing a protection for its list entries. Since that is fixed now this workaround can be removed again. Signed-off-by: Christian König Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 21 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 -- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 1 - 3 files changed, 6 insertions(+), 18 deletions(-) And when it happening in kernel log appears a such backtrace: [ 231.331210] [ cut here ] [ 231.331262] WARNING: CPU: 11 PID: 6555 at drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:675 amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu] [ 231.331424] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek snd_sof snd_hda_codec_generic snd_hda_codec_hdmi snd_sof_utils mt7921e snd_hda_intel sunrpc snd_intel_dspcfg mt7921_common binfmt_misc snd_intel_sdw_acpi snd_hda_codec mt76_connac_lib edac_mce_amd btusb snd_soc_core mt76 snd_hda_core btrtl snd_hwdep snd_compress kvm_amd ac97_bus snd_seq btbcm snd_pcm_dmaengine btintel snd_rpl_pci_acp6x mac80211 btmtk snd_pci_acp6x kvm snd_seq_device snd_pcm snd_pci_acp5x libarc4 irqbypass bluetooth snd_rn_pci_acp3x snd_timer pcspkr asus_nb_wmi rapl joydev wmi_bmof snd_acp_config cfg80211 snd_soc_acpi vfat snd [ 231.331490] snd_pci_acp3x i2c_piix4 soundcore fat k10temp amd_pmc asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi iommu_v2 crct10dif_pclmul crc32_pclmul gpu_sched crc32c_intel ledtrig_audio sparse_keymap polyval_clmulni platform_profile drm_buddy polyval_generic hid_multitouch drm_display_helper rfkill nvme ucsi_acpi ghash_clmulni_intel nvme_core video typec_ucsi serio_raw ccp sha512_ssse3 sp5100_tco r8169 cec nvme_common typec wmi i2c_hid_acpi i2c_hid ip6_tables ip_tables fuse [ 231.331532] CPU: 11 PID: 6555 Comm: GameThread Tainted: GW L--- --- 6.1.0-0.rc1.20221019gitaae703b02f92.17.fc38.x86_64 #1 [ 231.331534] Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022 [ 231.331537] RIP: 0010:amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu] [ 231.331654] Code: a8 d0 e9 32 ff ff ff 4c 89 e9 89 ea 48 c7 c6 40 82 f3 c0 48 c7 c7 10 60 14 c1 e8 2f a0 f4 d0 eb 8e 66 90 bd f2 ff ff ff eb 8d <0f> 0b eb f5 bd fd ff ff ff eb 82 bd f2 ff ff ff e9 62 ff ff ff 48 [ 231.331656] RSP: 0018:aad4c705bae8 EFLAGS: 00010286 [ 231.331659] RAX: 8e9cbdbe3200 RBX: 8e997e3f2440 RCX: [ 231.331661] RDX: RSI: 8e9cbdbe3200 RDI: 8e9c31208000 [ 231.331663] RBP: 0001 R08: 0dc0 R09: [ 231.331665] R10: 0001 R11: R12: aad4c705bb90 [ 231.331666] R13: 7651 R14: 8e9c89f334e0 R15: 8e991fda8000 [ 231.331668] FS: 7c2af6c0() GS:8ea7d8e0() knlGS:7b2c [ 231.331671] CS: 0010 DS: ES: CR0: 80050033 [ 231.331673] CR2: 7ff65ffd8000 CR3: 0004f90f CR4: 00750ee0 [ 231.331674] PKRU: 5554 [ 231.331676] Call Trace: [ 231.331678] [ 231.331682] amdgpu_cs_ioctl+0x87e/0x1fc0 [amdgpu] [ 231.331824] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 231.331981] drm_ioctl_kernel+0xac/0x160 [ 231.331990] drm_ioctl+0x1e7/0x450 [ 231.331994] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 231.332118] amdgpu_drm_ioctl+0x4a/0x80 [amdgpu] [ 231.332233] __x64_sys_ioctl+0x90/0xd0 [ 231.332238] do_syscall_64+0x5b/0x80 [ 231.332243] ? asm_exc_page_fault+0x22/0x30 [ 231.332247] ? lockdep_hardirqs_on+0x7d/0x100 [ 231.332250] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 231.332253] RIP: 0033:0x7ff677c5704f [ 231.332256] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 [ 231.332258] RSP: 002b:7c2ad470 EFLAGS: 0246 ORIG_RAX: 0010 [ 231.332261] RAX: ffda RBX: 7c2ad718 RCX: 7ff677c5704f [ 231.332263] RDX: 7c2ad540 RSI: c0186444
Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga
On Wed, May 11, 2022 at 5:01 PM Christian König wrote: > > > We have implemented a workaround, but still don't know the exact root cause. > > If anybody wants to look into this it would be rather helpful to be able > to reproduce the issue. > > Regards, > Christian. I see that issue was returned after this commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 is the first bad commit commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 Author: Christian König Date: Thu Jul 14 10:23:38 2022 +0200 drm/amdgpu: revert "partial revert "remove ctx->lock" v2" This reverts commit 94f4c4965e5513ba624488f4b601d6b385635aec. We found that the bo_list is missing a protection for its list entries. Since that is fixed now this workaround can be removed again. Signed-off-by: Christian König Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 21 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 -- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 1 - 3 files changed, 6 insertions(+), 18 deletions(-) The games Forza Horizon 4 and Cyberpunk 2077 again hangs at start. -- Best Regards, Mike Gavrilov.
[regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70
Hi! I bisected an issue of the 6.0 kernel which started happening after 6.0-rc7 on all my machines. Backtrace of this issue looks like as: [ 2807.339439] [ cut here ] [ 2807.339445] WARNING: CPU: 11 PID: 2061 at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70 [ 2807.339453] Modules linked in: tls uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek sunrpc snd_sof snd_hda_codec_hdmi snd_hda_codec_generic snd_sof_utils snd_hda_intel snd_intel_dspcfg mt7921e snd_intel_sdw_acpi binfmt_misc snd_soc_core mt7921_common snd_hda_codec snd_compress vfat ac97_bus edac_mce_amd mt76_connac_lib snd_pcm_dmaengine fat snd_hda_core snd_rpl_pci_acp6x snd_pci_acp6x mt76 btusb snd_hwdep kvm_amd btrtl snd_seq btbcm mac80211 snd_seq_device kvm btintel btmtk libarc4 snd_pcm snd_pci_acp5x bluetooth snd_timer snd_rn_pci_acp3x irqbypass snd_acp_config snd_soc_acpi cfg80211 rapl snd joydev pcspkr asus_nb_wmi wmi_bmof [ 2807.339519] snd_pci_acp3x soundcore i2c_piix4 k10temp amd_pmc asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi crct10dif_pclmul iommu_v2 crc32_pclmul ledtrig_audio crc32c_intel gpu_sched sparse_keymap platform_profile hid_multitouch polyval_clmulni nvme ucsi_acpi drm_buddy polyval_generic drm_display_helper ghash_clmulni_intel serio_raw nvme_core ccp typec_ucsi rfkill sp5100_tco r8169 cec nvme_common typec wmi video i2c_hid_acpi i2c_hid ip6_tables ip_tables fuse [ 2807.339540] Unloaded tainted modules: acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 acpi_cpufreq():1 amd64_edac():1 amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 amd64_edac():1 acpi_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1 amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 fjes():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 fjes():1 acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 fjes():1 acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 [ 2807.339579] acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 acpi_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 [ 2807.339596] CPU: 11 PID: 2061 Comm: gnome-shell Tainted: GW L 6.0.0-rc4-07-cb0eca01ad9756e853efec3301203c2b5b45aa9f+ #16 [ 2807.339598] Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022 [ 2807.339600] RIP: 0010:drm_modeset_drop_locks+0x63/0x70 [ 2807.339602] Code: 42 08 48 89 10 48 89 1b 48 8d bb 50 ff ff ff 48 89 5b 08 e8 3f 41 55 00 48 8b 45 78 49 39 c4 75 c6 5b 5d 41 5c c3 cc cc cc cc <0f> 0b eb ac 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 41 54 [ 2807.339604] RSP: 0018:b6ad46e07b80 EFLAGS: 00010282 [ 2807.339606] RAX: 0001 RBX: RCX: 0002 [ 2807.339607] RDX: 0001 RSI: a6a118b1 RDI: b6ad46e07c00 [ 2807.339608] RBP: b6ad46e07c00 R08: R09: [ 2807.339609] R10: R11: 0001 R12: [ 2807.339610] R13: 9801ca24bb00 R14: 9801ca24bb00 R15: [ 2807.339611] FS: 7f57445b0600() GS:981198e0() knlGS: [ 2807.339613] CS: 0010 DS: ES: CR0: 80050033 [ 2807.339614] CR2: 7f574367f000 CR3: 0001236ae000 CR4: 00750ee0 [ 2807.339615] PKRU: 5554 [ 2807.339616] Call Trace: [ 2807.339618] [ 2807.339621] drm_mode_atomic_ioctl+0x3b9/0xac0 [ 2807.339627] ? drm_atomic_set_property+0xb60/0xb60 [ 2807.339629] drm_ioctl_kernel+0xac/0x160 [ 2807.339633] drm_ioctl+0x22d/0x410 [ 2807.339635] ? drm_atomic_set_property+0xb60/0xb60 [ 2807.339639] amdgpu_drm_ioctl+0x4a/0x80 [amdgpu] [ 2807.339834] __x64_sys_ioctl+0x90/0xd0 [ 2807.339838] do_syscall_64+0x5b/0x80 [ 2807.339843] ? rcu_read_lock_sched_held+0x10/0x80 [ 2807.339846] ? trace_hardirqs_on_prepare+0x55/0xe0 [ 2807.339849] ? do_syscall_64+0x67/0x80 [ 2807.339851] ?
[regression][6.1] After commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86 system randomly hungs
Hi! The hungs occurs randomly, but I found good reproductive scenario (This is running the campaign in the game Halo Infinite) The backtrace is look like this: [ 147.260971] BUG: kernel NULL pointer dereference, address: 0088 [ 147.260987] [ cut here ] [ 147.260988] WARNING: CPU: 3 PID: 0 at kernel/softirq.c:321 __local_bh_disable_ip+0x9e/0xb0 [ 147.260993] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc snd_sof_amd_renoir intel_rapl_msr snd_sof_amd_acp intel_rapl_common mt7921e snd_sof_pci mt7921_common binfmt_misc snd_sof mt76_connac_lib snd_sof_utils vfat snd_hda_codec_realtek snd_soc_core snd_hda_codec_generic mt76 fat snd_hda_codec_hdmi snd_hda_intel edac_mce_amd snd_compress ac97_bus btusb kvm_amd snd_intel_dspcfg snd_pcm_dmaengine btrtl snd_intel_sdw_acpi btbcm snd_hda_codec snd_pci_acp6x mac80211 kvm snd_hda_core btintel btmtk irqbypass snd_hwdep snd_seq libarc4 snd_seq_device bluetooth snd_pcm snd_pci_acp5x snd_timer snd_rn_pci_acp3x cfg80211 rapl pcspkr joydev asus_nb_wmi wmi_bmof snd_acp_config snd snd_soc_acpi k10temp [ 147.261033] soundcore i2c_piix4 snd_pci_acp3x asus_wireless amd_pmc zram amdgpu drm_ttm_helper ttm hid_asus iommu_v2 asus_wmi gpu_sched ledtrig_audio sparse_keymap drm_buddy platform_profile drm_display_helper crct10dif_pclmul crc32_pclmul nvme rfkill crc32c_intel ucsi_acpi hid_multitouch video ghash_clmulni_intel nvme_core ccp typec_ucsi serio_raw r8169 cec sp5100_tco typec i2c_hid_acpi wmi i2c_hid ip6_tables ip_tables fuse [ 147.261045] CPU: 3 PID: 0 Comm: swapper/3 Tainted: GWL 6.0.0-rc2-02-907cc346ff6a69a08b4786c4ed2a78ac0120b9da+ #124 [ 147.261046] Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022 [ 147.261047] RIP: 0010:__local_bh_disable_ip+0x9e/0xb0 [ 147.261048] Code: 25 00 1e 02 00 48 89 df e8 6f 23 08 00 85 c0 75 0e 48 89 9d 30 1c 00 00 5b 5d c3 cc cc cc cc 31 ff 31 db e8 54 23 08 00 eb e7 <0f> 0b e9 76 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 [ 147.261049] RSP: 0018:a4e1c028c8d8 EFLAGS: 00010006 [ 147.261050] RAX: 80010005 RBX: 0201 RCX: 0018 [ 147.261051] RDX: 0f440b255950 RSI: 0201 RDI: c1b652e5 [ 147.261051] RBP: 93a4eaf00fd8 R08: 0001 R09: [ 147.261052] R10: 7635d840c31a8942 R11: fcca632b3d1b0d46 R12: 93a4f7831000 [ 147.261052] R13: 93a4eaf00ee0 R14: 93a4efd84178 R15: 93a4efd84000 [ 147.261053] FS: () GS:93b396e0() knlGS: [ 147.261054] CS: 0010 DS: ES: CR0: 80050033 [ 147.261055] CR2: 0088 CR3: 00012a61 CR4: 00750ee0 [ 147.261056] PKRU: 5554 [ 147.261056] Call Trace: [ 147.261060] [ 147.261068] _raw_spin_lock_bh+0x1d/0x80 [ 147.261074] ieee80211_queue_skb+0x125/0x7a0 [mac80211] [ 147.261113] ? __skb_get_hash+0x55/0x200 [ 147.261117] ieee80211_tx_8023+0x9c/0x1c0 [mac80211] [ 147.261155] ieee80211_subif_start_xmit_8023+0x2b5/0x510 [mac80211] [ 147.261191] netpoll_start_xmit+0x121/0x190 [ 147.261199] netpoll_send_skb+0x1fc/0x300 [ 147.261202] write_msg+0xdc/0xf0 [netconsole] [ 147.261207] console_emit_next_record.constprop.0+0x17d/0x300 [ 147.261214] console_unlock+0xf3/0x1f0 [ 147.261215] vprintk_emit+0x152/0x350 [ 147.261217] ? plist_add+0xba/0xf0 [ 147.261223] _printk+0x48/0x4e [ 147.261231] ? rcu_read_lock_sched_held+0x10/0x80 [ 147.261235] page_fault_oops.cold+0xcf/0x1f9 [ 147.261240] ? do_user_addr_fault+0x65/0x6b0 [ 147.261243] ? _raw_spin_unlock_irqrestore+0x40/0x60 [ 147.261247] exc_page_fault+0x7e/0x300 [ 147.261249] asm_exc_page_fault+0x22/0x30 [ 147.261252] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x1e0 [gpu_sched] [ 147.261255] Code: 89 d7 e8 87 02 0d f0 e9 54 ff ff ff 48 89 d7 e8 ea 66 37 f0 e9 47 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 41 54 55 53 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d 70 02 00 00 48 8b 85 a8 03 00 00 f0 [ 147.261256] RSP: 0018:a4e1c028cdc8 EFLAGS: 00010093 [ 147.261257] RAX: c06dc380 RBX: RCX: 0018 [ 147.261257] RDX: 0efa9afe3594 RSI: 93a7a4c1ec90 RDI: [ 147.261258] RBP: 93a7a4c1ee10 R08: 0001 R09: [ 147.261259] R10: R11: 0001 R12: a4e1c028cde8 [ 147.261259] R13: 0086 R14: R15: 93a4fbed0198 [ 147.261261] ? drm_sched_job_done.isra.0+0x1e0/0x1e0 [gpu_sched] [ 147.261266] dma_fence_signal_timestamp_locked+0x9e/0x1c0 [ 147.261274] dma_fence_signal+0x36/0x70 [ 147.261276]
Re: [BUG][5.20] refcount_t: underflow; use-after-free
Hi! Unfortunately the use-after-free issue still happens on the 6.0-rc5 kernel. The issue became hard to repeat. I spent the whole day at the computer when use-after-free again happened, I was playing the game Tiny Tina's Wonderlands. Therefore, forget about repeatability. It remains only to hope for logs and tracing. I didn't see anything new in the logs. It seems that we need to somehow expand the logging so that the next time this happens we have more information. Sep 18 20:52:16 primary-ws gnome-shell[2388]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed Sep 18 20:52:27 primary-ws gnome-shell[2388]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed Sep 18 20:53:44 primary-ws gnome-shell[2388]: Window manager warning: Window 0x4e3 sets an MWM hint indicating it isn't resizable, but sets min size 1 x 1 and max size 2147483647 x 2147483647; this doesn't make much sense. Sep 18 20:53:45 primary-ws kernel: umip_printk: 11 callbacks suppressed Sep 18 20:53:45 primary-ws kernel: umip: Wonderlands.exe[213853] ip:14ebb0d03 sp:4ee528: SGDT instruction cannot be used by applications. Sep 18 20:53:45 primary-ws kernel: umip: Wonderlands.exe[213853] ip:14ebb0d03 sp:4ee528: For now, expensive software emulation returns the result. Sep 18 20:53:53 primary-ws gnome-shell[2388]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed Sep 18 20:53:53 primary-ws kernel: umip: Wonderlands.exe[213853] ip:14ebb0d03 sp:4ee528: SGDT instruction cannot be used by applications. Sep 18 20:53:53 primary-ws kernel: umip: Wonderlands.exe[213853] ip:14ebb0d03 sp:4ee528: For now, expensive software emulation returns the result. Sep 18 20:54:15 primary-ws kernel: umip: Wonderlands.exe[214194] ip:15a270815 sp:6eaef490: SGDT instruction cannot be used by applications. Sep 18 20:56:01 primary-ws kernel: umip_printk: 15 callbacks suppressed Sep 18 20:56:01 primary-ws kernel: umip: Wonderlands.exe[213853] ip:15e3a82b0 sp:4ed178: SGDT instruction cannot be used by applications. Sep 18 20:56:01 primary-ws kernel: umip: Wonderlands.exe[213853] ip:15e3a82b0 sp:4ed178: For now, expensive software emulation returns the result. Sep 18 20:56:03 primary-ws kernel: umip: Wonderlands.exe[213853] ip:15e3a82b0 sp:4edbe8: SGDT instruction cannot be used by applications. Sep 18 20:56:03 primary-ws kernel: umip: Wonderlands.exe[213853] ip:15e3a82b0 sp:4edbe8: For now, expensive software emulation returns the result. Sep 18 20:56:03 primary-ws kernel: umip: Wonderlands.exe[213853] ip:15e3a82b0 sp:4ebf18: SGDT instruction cannot be used by applications. Sep 18 20:57:55 primary-ws kernel: [ cut here ] Sep 18 20:57:55 primary-ws kernel: refcount_t: underflow; use-after-free. Sep 18 20:57:55 primary-ws kernel: WARNING: CPU: 22 PID: 235114 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 Sep 18 20:57:55 primary-ws kernel: Modules linked in: tls uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_> Sep 18 20:57:55 primary-ws kernel: asus_wmi ledtrig_audio sparse_keymap platform_profile irqbypass rfkill mc rapl snd_timer video wmi_bmof pcspkr snd k10temp i2c_piix4 soundcore acpi_cpufreq zram amdgpu drm_ttm_helper ttm iommu_v2 crct1> Sep 18 20:57:55 primary-ws kernel: Unloaded tainted modules: amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_eda> Sep 18 20:57:55 primary-ws kernel: pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 Sep 18 20:57:55 primary-ws kernel: CPU: 22 PID: 235114 Comm: kworker/22:0 Tainted: GWL--- --- 6.0.0-0.rc5.20220914git3245cb65fd91.39.fc38.x86_64 #1 Sep 18 20:57:55 primary-ws kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 Sep 18 20:57:55 primary-ws kernel: Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched] Sep 18 20:57:55 primary-ws kernel: RIP: 0010:refcount_warn_saturate+0xba/0x110 Sep 18 20:57:55 primary-ws kernel: Code: 01 01 e8 69 6b 6f 00 0f 0b e9 32 38 a5 00 80 3d 4d 7d be 01 00 75 85 48 c7 c7 80 b7 8e 95 c6 05 3d 7d be 01 01 e8 46 6b 6f 00 <0f> 0b e9 0f 38 a5 00 80 3d 28 7d be 01 00 0f 85 5e ff ff ff 48 c7 Sep 18 20:57:55 primary-ws kernel: RSP: 0018:a1a853ccbe60 EFLAGS: 00010286 Sep 18 20:57:55 primary-ws kernel: RAX: 0026 RBX: 8e0e60a96c28 RCX: Sep 18 20:57:55 primary-ws kernel: RDX: 0001 RSI: 958d255c RDI: Sep 18 20:57:55 primary-ws kernel: RBP: 8e19a83f5600 R08: R09: a1a853ccbd10 Sep 18 20:57:55 primary-ws kernel: R10: 0003 R11: 8e19ee2fffe8 R12: 8e19a83fc800 Sep 18
Re: [BUG][5.20] refcount_t: underflow; use-after-free
On Fri, Aug 19, 2022 at 5:13 PM Maíra Canal wrote: > > Hi Mikhail, > > Could you please specify the steps to reproduce this use-after-free? I > will try to reproduce it on the RX5700 XT and bisect the issue. > Hi Maíra, thanks for help. I'm afraid that it will be unrealistic to reproduce, because on a laptop with 6800M (also RDNA 2 graphics) the problem does not repeat. Sorry for the long silence, but I was trying to bisect the problem myself. git bisect start # status: waiting for both good and bad commits # good: [3d7cb6b04c3f3115719235cc6866b10326de34cd] Linux 5.19 git bisect good 3d7cb6b04c3f3115719235cc6866b10326de34cd # status: waiting for bad commit, 1 good commit known # bad: [7ebfc85e2cd7b08f518b526173e9a33b56b3913b] Merge tag 'net-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net git bisect bad 7ebfc85e2cd7b08f518b526173e9a33b56b3913b # bad: [b44f2fd87919b5ae6e1756d4c7ba2cbba22238e1] Merge tag 'drm-next-2022-08-03' of git://anongit.freedesktop.org/drm/drm # 001: GPU hangs + use-after-free issue - https://pastebin.com/z86E9ydx git bisect bad b44f2fd87919b5ae6e1756d4c7ba2cbba22238e1 # good: [526942b8134cc34d25d27f95dfff98b8ce2f6fcd] Merge tag 'ata-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata # 002: good - https://pastebin.com/9qki65Sj git bisect good 526942b8134cc34d25d27f95dfff98b8ce2f6fcd # good: [45490ce2ff833c4ec0de66705e46ba41320860cb] nfp: flower: add support for tunnel offload without key ID # 003: good - https://pastebin.com/vHk5eRkw git bisect good 45490ce2ff833c4ec0de66705e46ba41320860cb # skip: [e23a5e14aa278858c2e3d81ec34e83aa9a4177c5] Backmerge tag 'v5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into drm-next # 004: GPU not switched in graphic mode - https://pastebin.com/RmqCTMLD git bisect skip e23a5e14aa278858c2e3d81ec34e83aa9a4177c5 # bad: [b2065fb21d9a789b14f737ea90facedabadeb8a4] drm/amdgpu: fix i2s_pdata out of bound array access # 005: GPU hangs + use-after-free issue - https://pastebin.com/Zgw5Hc48 git bisect bad b2065fb21d9a789b14f737ea90facedabadeb8a4 # skip: [344feb7ccf764756937cfd74fa4ac5caba069c99] Merge tag 'amd-drm-next-5.20-2022-07-05' of https://gitlab.freedesktop.org/agd5f/linux into drm-next # 006: GPU not switched in graphic mode - https://pastebin.com/b8BUBE7Q git bisect skip 344feb7ccf764756937cfd74fa4ac5caba069c99 # skip: [869b10ac8d2300327f554d83f4dbab041bf27d49] drm/amdgpu: add dm ip block for dcn 3.1.4 # 007: GPU not switched in graphic mode - https://pastebin.com/byd7HECH git bisect skip 869b10ac8d2300327f554d83f4dbab041bf27d49 # skip: [676ad8e997036e2f815c293b76c356fb7cc97a08] drm: rcar-du: Lift z-pos restriction on primary plane for Gen3 # 008: GPU not switched in graphic mode - https://pastebin.com/3fXCTinb git bisect skip 676ad8e997036e2f815c293b76c356fb7cc97a08 # skip: [5c57cbc390b166950c2e6c2f0c4edaeb0f47e97d] drm/bridge: lt9211: Convert to drm_of_get_data_lanes_count # 009: Build error - https://pastebin.com/rxHe9QRB git bisect skip 5c57cbc390b166950c2e6c2f0c4edaeb0f47e97d # skip: [6db5e0c8692e590734a7ec7455365d9cbaa15ef1] Merge tag 'drm-intel-next-2022-07-06' of git://anongit.freedesktop.org/drm/drm-intel into drm-next # 010: GPU not switched in graphic mode - https://pastebin.com/rqubSuc8 git bisect skip 6db5e0c8692e590734a7ec7455365d9cbaa15ef1 # skip: [5d763a9955f0fbf2681a2f1fa87c416056bd0c89] drm/amd/display: Remove compiler warning # 011: GPU not switched in graphic mode - https://pastebin.com/BrJs6ybP git bisect skip 5d763a9955f0fbf2681a2f1fa87c416056bd0c89 # skip: [e6c2db2be986158afb9991d9fa8a38fe65a88516] drm/i915: Don't use DRM_DEBUG_WARN_ON for unexpected l3bank/mslice config # 012: GPU not switched in graphic mode - https://pastebin.com/yxppyqbD git bisect skip e6c2db2be986158afb9991d9fa8a38fe65a88516 # bad: [cb6b81b21bd9cf09d72b7fe711be1b55001eb166] Merge tag 'drm-misc-next-fixes-2022-07-21' of git://anongit.freedesktop.org/drm/drm-misc into drm-next # 013: GPU hangs without use-after-free issue - https://pastebin.com/iRek4bBy git bisect bad cb6b81b21bd9cf09d72b7fe711be1b55001eb166 # skip: [48b927770f8ad3f8cf4a024a552abf272af9f592] drm/exynos/exynos7_drm_decon: free resources when clk_set_parent() failed. # 014: GPU not switched in graphic mode - https://pastebin.com/ekp10xhP git bisect skip 48b927770f8ad3f8cf4a024a552abf272af9f592 # skip: [c5da61cf5bab30059f22ea368702c445ee87171a] drm/amdgpu/display: add missing FP_START/END checks dcn32_clk_mgr.c # 015: GPU not switched in graphic mode - https://pastebin.com/YbskKWmA git bisect skip c5da61cf5bab30059f22ea368702c445ee87171a # skip: [a77f7c89e62c6dfe405a64995812746f27adc510] drm/edid: convert drm_gtf_modes_for_range() to drm_edid # 016: GPU not switched in graphic mode - https://pastebin.com/bA2AwkJ7 git bisect skip a77f7c89e62c6dfe405a64995812746f27adc510 # skip: [6fde8eec71796f3534f0c274066862829813b21f] drm/doc: Add KUnit documentation # 017: GPU not switched in gr
Re: [BUG][5.20] refcount_t: underflow; use-after-free
On Wed, Aug 17, 2022 at 11:43 PM Maíra Canal wrote: > > Hi Mikhail, > > Looks like 45ecaea738830b9d521c93520c8f201359dcbd95 ("drm/sched: Partial > revert of 'drm/sched: Keep s_fence->parent pointer'") introduced the > error. Try reverting it and check if the use-after-free still happens. Thanks, but unfortunately, this did not lead to the expected result. Again happens use-after-free in an incomprehensible context. >From the new: added warning "suspicious RCU usage" but it looks like it is completely not related to the use-after-free issue. [ 215.434115] [ cut here ] [ 215.434184] refcount_t: underflow; use-after-free. [ 215.434204] WARNING: CPU: 7 PID: 1258 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 215.434214] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc snd_seq_midi snd_seq_midi_event intel_rapl_msr intel_rapl_common snd_hda_codec_realtek vfat snd_hda_codec_generic snd_hda_codec_hdmi mt76x2u fat mt76x2_common snd_hda_intel mt76x02_usb snd_intel_dspcfg snd_intel_sdw_acpi mt76_usb iwlmvm edac_mce_amd snd_usb_audio snd_hda_codec mt76x02_lib snd_hda_core snd_usbmidi_lib snd_hwdep snd_rawmidi uvcvideo mt76 kvm_amd snd_seq videobuf2_vmalloc videobuf2_memops snd_seq_device mac80211 videobuf2_v4l2 videobuf2_common kvm btusb iwlwifi snd_pcm btrtl videodev libarc4 eeepc_wmi btbcm asus_wmi iwlmei btintel ledtrig_audio xpad irqbypass sparse_keymap btmtk platform_profile joydev [ 215.434436] hid_logitech_hidpp rapl ff_memless mc snd_timer bluetooth cfg80211 video pcspkr wmi_bmof snd soundcore k10temp i2c_piix4 rfkill mei asus_ec_sensors acpi_cpufreq zram amdgpu drm_ttm_helper ttm iommu_v2 ucsi_ccg gpu_sched crct10dif_pclmul crc32_pclmul typec_ucsi drm_buddy crc32c_intel ghash_clmulni_intel ccp igb sp5100_tco typec drm_display_helper nvme dca nvme_core cec wmi ip6_tables ip_tables fuse [ 215.434528] Unloaded tainted modules: amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 [ 215.434672] pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 [ 215.434702] CPU: 7 PID: 1258 Comm: kworker/7:3 Tainted: G W L --- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1 [ 215.434709] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 215.434715] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched] [ 215.434728] RIP: 0010:refcount_warn_saturate+0xba/0x110 [ 215.434734] Code: 01 01 e8 59 59 6f 00 0f 0b e9 22 46 a5 00 80 3d be 7d be 01 00 75 85 48 c7 c7 c0 99 8e 92 c6 05 ae 7d be 01 01 e8 36 59 6f 00 <0f> 0b e9 ff 45 a5 00 80 3d 99 7d be 01 00 0f 85 5e ff ff ff 48 c7 [ 215.434740] RSP: 0018:9ccb0237fe60 EFLAGS: 00010286 [ 215.434747] RAX: 0026 RBX: 8d531f6f2828 RCX: [ 215.434753] RDX: 0001 RSI: 928d07a4 RDI: [ 215.434757] RBP: 8d61e47f5600 R08: R09: 9ccb0237fd10 [ 215.434762] R10: 0003 R11: 8d622e2fffe8 R12: 8d61e47fc800 [ 215.434767] R13: 8d5313e95500 R14: 8d61e47fc805 R15: 8d531f6f2830 [ 215.434772] FS: () GS:8d61e460() knlGS: [ 215.434777] CS: 0010 DS: ES: CR0: 80050033 [ 215.434782] CR2: 7f0c8b815048 CR3: 0001ab0e8000 CR4: 00350ee0 [ 215.434788] Call Trace: [ 215.434792] [ 215.434797] process_one_work+0x2a0/0x600 [ 215.434819] worker_thread+0x4f/0x3a0 [ 215.434830] ? process_one_work+0x600/0x600 [ 215.434836] kthread+0xf5/0x120 [ 215.434842] ? kthread_complete_and_exit+0x20/0x20 [ 215.434854] ret_from_fork+0x22/0x30 [ 215.434881] [ 215.434885] irq event stamp: 134873 [ 215.434890] hardirqs last enabled at (134881): []
Re: [BUG][5.20] refcount_t: underflow; use-after-free
On Wed, Aug 17, 2022 at 9:08 PM Melissa Wen wrote: > > Hi Mikhail, > > IIUC, you got this second user-after-free by applying the first version > of Maíra's patch, right? So, that version was adding another unbalanced > unlock to the cs ioctl flow, but it was solved in the latest version, > that you can find here: https://patchwork.freedesktop.org/patch/497680/ > If this is the situation, can you check this last version? > > Thanks, > > Melissa With the last version warning "bad unlock balance detected!" was gone, but the user-after-free issue remains. And again "Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched]". [ 297.834779] [ cut here ] [ 297.834818] refcount_t: underflow; use-after-free. [ 297.834831] WARNING: CPU: 30 PID: 2377 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 297.834838] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc snd_seq_midi snd_seq_midi_event mt76x2u mt76x2_common mt76x02_usb mt76_usb mt76x02_lib snd_hda_codec_realtek iwlmvm intel_rapl_msr snd_hda_codec_generic snd_hda_codec_hdmi mt76 vfat fat snd_hda_intel intel_rapl_common mac80211 snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio snd_hda_codec snd_usbmidi_lib btusb edac_mce_amd iwlwifi libarc4 uvcvideo snd_hda_core btrtl snd_rawmidi snd_hwdep videobuf2_vmalloc btbcm kvm_amd videobuf2_memops snd_seq iwlmei btintel videobuf2_v4l2 eeepc_wmi snd_seq_device videobuf2_common btmtk kvm xpad videodev joydev irqbypass snd_pcm asus_wmi hid_logitech_hidpp ff_memless cfg80211 bluetooth rapl mc [ 297.834932] ledtrig_audio snd_timer sparse_keymap platform_profile wmi_bmof snd video pcspkr k10temp i2c_piix4 rfkill soundcore mei asus_ec_sensors acpi_cpufreq zram amdgpu drm_ttm_helper ttm crct10dif_pclmul crc32_pclmul crc32c_intel iommu_v2 ucsi_ccg gpu_sched typec_ucsi drm_buddy ghash_clmulni_intel drm_display_helper ccp igb typec sp5100_tco nvme cec nvme_core dca wmi ip6_tables ip_tables fuse [ 297.834978] Unloaded tainted modules: amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 [ 297.835055] pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 [ 297.835071] CPU: 30 PID: 2377 Comm: kworker/30:6 Tainted: G WL--- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1 [ 297.835075] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 297.835078] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched] [ 297.835085] RIP: 0010:refcount_warn_saturate+0xba/0x110 [ 297.835088] Code: 01 01 e8 59 59 6f 00 0f 0b e9 22 46 a5 00 80 3d be 7d be 01 00 75 85 48 c7 c7 c0 99 8e aa c6 05 ae 7d be 01 01 e8 36 59 6f 00 <0f> 0b e9 ff 45 a5 00 80 3d 99 7d be 01 00 0f 85 5e ff ff ff 48 c7 [ 297.835091] RSP: 0018:bd3506df7e60 EFLAGS: 00010286 [ 297.835095] RAX: 0026 RBX: 961b250cbc28 RCX: [ 297.835097] RDX: 0001 RSI: aa8d07a4 RDI: [ 297.835100] RBP: 96276a3f5600 R08: R09: bd3506df7d10 [ 297.835102] R10: 0003 R11: 9627ae2fffe8 R12: 96276a3fc800 [ 297.835105] R13: 9618c03e6600 R14: 96276a3fc805 R15: 961b250cbc30 [ 297.835108] FS: () GS:96276a20() knlGS: [ 297.835110] CS: 0010 DS: ES: CR0: 80050033 [ 297.835113] CR2: 621001e4a000 CR3: 00018d958000 CR4: 00350ee0 [ 297.835116] Call Trace: [ 297.835118] [ 297.835121] process_one_work+0x2a0/0x600 [ 297.835133] worker_thread+0x4f/0x3a0 [ 297.835139] ? process_one_work+0x600/0x600 [ 297.835142] kthread+0xf5/0x120 [ 297.835145] ? kthread_complete_and_exit+0
Re: [BUG][5.20] refcount_t: underflow; use-after-free
On Mon, Aug 15, 2022 at 3:37 PM Mikhail Gavrilov wrote: > > Thanks, I tested this patch. > But with this patch use-after-free problem happening in another place: Does anyone have an idea why the second use-after-free happened? >From the trace I don't understand which code is related. I don't quite understand what the "Workqueue" entry in the trace means. [ 408.358737] [ cut here ] [ 408.358743] refcount_t: underflow; use-after-free. [ 408.358760] WARNING: CPU: 9 PID: 62 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 408.358769] Modules linked in: uinput snd_seq_dummy rfcomm snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc snd_seq_midi snd_seq_midi_event mt76x2u mt76x2_common snd_hda_codec_realtek mt76x02_usb snd_hda_codec_generic iwlmvm snd_hda_codec_hdmi mt76_usb intel_rapl_msr snd_hda_intel mt76x02_lib intel_rapl_common snd_intel_dspcfg snd_intel_sdw_acpi mt76 snd_hda_codec vfat fat snd_usb_audio snd_hda_core edac_mce_amd mac80211 snd_usbmidi_lib snd_hwdep snd_rawmidi mc snd_seq btusb kvm_amd iwlwifi snd_seq_device btrtl btbcm libarc4 btintel eeepc_wmi snd_pcm iwlmei kvm btmtk asus_wmi ledtrig_audio irqbypass joydev snd_timer sparse_keymap bluetooth platform_profile rapl cfg80211 snd video wmi_bmof soundcore i2c_piix4 k10temp rfkill mei [ 408.358853] asus_ec_sensors acpi_cpufreq zram hid_logitech_hidpp amdgpu igb dca drm_ttm_helper ttm iommu_v2 crct10dif_pclmul gpu_sched crc32_pclmul ucsi_ccg crc32c_intel drm_buddy nvme typec_ucsi drm_display_helper ghash_clmulni_intel ccp typec nvme_core sp5100_tco cec wmi ip6_tables ip_tables fuse [ 408.358880] Unloaded tainted modules: amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 [ 408.358953] pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 [ 408.358967] CPU: 9 PID: 62 Comm: kworker/9:0 Tainted: G W L --- --- 6.0.0-0.rc1.13.fc38.x86_64+debug #1 [ 408.358971] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 408.358974] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched] [ 408.358982] RIP: 0010:refcount_warn_saturate+0xba/0x110 [ 408.358987] Code: 01 01 e8 d9 59 6f 00 0f 0b e9 a2 46 a5 00 80 3d 3e 7e be 01 00 75 85 48 c7 c7 70 99 8e 92 c6 05 2e 7e be 01 01 e8 b6 59 6f 00 <0f> 0b e9 7f 46 a5 00 80 3d 19 7e be 01 00 0f 85 5e ff ff ff 48 c7 [ 408.358990] RSP: 0018:b124003efe60 EFLAGS: 00010286 [ 408.358994] RAX: 0026 RBX: 9987a025d428 RCX: [ 408.358997] RDX: 0001 RSI: 928d0754 RDI: [ 408.358999] RBP: 9994e4ff5600 R08: R09: b124003efd10 [ 408.359001] R10: 0003 R11: 99952e2fffe8 R12: 9994e4ffc800 [ 408.359004] R13: 998600228cc0 R14: 9994e4ffc805 R15: 9987a025d430 [ 408.359006] FS: () GS:9994e4e0() knlGS: [ 408.359009] CS: 0010 DS: ES: CR0: 80050033 [ 408.359012] CR2: 27ac39e78000 CR3: 0001a66d8000 CR4: 00350ee0 [ 408.359015] Call Trace: [ 408.359017] [ 408.359020] process_one_work+0x2a0/0x600 [ 408.359032] worker_thread+0x4f/0x3a0 [ 408.359036] ? process_one_work+0x600/0x600 [ 408.359039] kthread+0xf5/0x120 [ 408.359044] ? kthread_complete_and_exit+0x20/0x20 [ 408.359049] ret_from_fork+0x22/0x30 [ 408.359061] [ 408.359063] irq event stamp: 5468 [ 408.359064] hardirqs last enabled at (5467): [] _raw_spin_unlock_irq+0x24/0x50 [ 408.359071] hardirqs last disabled at (5468): [] __schedule+0xe2c/0x16d0 [ 408.359076] softirqs last enabled at (2482): [] rht_deferred_worker+0x708/0xc00 [ 408.359079] softirqs last disabled at (2480): [] rht_deferred_worker+0x1f7/0xc00 [ 408.359082] ---[ end trace ]--- Full kernel log i
Re: [BUG][5.20] refcount_t: underflow; use-after-free
On Mon, Aug 15, 2022 at 5:20 AM Maíra Canal wrote: > > Hi Mikhail > > Looks like this use-after-free problem was introduced on > 90af0ca047f3049c4b46e902f432ad6ef1e2ded6. Checking this patch it seems > like: if amdgpu_cs_vm_handling return r != 0, then it will unlock > bo_list_mutex inside the function amdgpu_cs_vm_handling and again on > amdgpu_cs_parser_fini. > > Maybe the following patch will help: Thanks, I tested this patch. But with this patch use-after-free problem happening in another place: [ 894.012920] [ cut here ] [ 894.012939] refcount_t: underflow; use-after-free. [ 894.012968] WARNING: CPU: 14 PID: 205 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 894.012999] Modules linked in: tls uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc snd_seq_midi snd_seq_midi_event snd_hda_codec_realtek mt76x2u mt76x2_common snd_hda_codec_generic snd_hda_codec_hdmi intel_rapl_msr mt76x02_usb intel_rapl_common snd_hda_intel mt76_usb snd_intel_dspcfg vfat iwlmvm snd_intel_sdw_acpi mt76x02_lib fat snd_usb_audio snd_hda_codec mt76 edac_mce_amd snd_usbmidi_lib snd_hda_core btusb snd_rawmidi snd_hwdep mac80211 mc iwlwifi btrtl eeepc_wmi asus_wmi btbcm snd_seq kvm_amd libarc4 ledtrig_audio snd_seq_device btintel iwlmei sparse_keymap btmtk kvm snd_pcm irqbypass platform_profile snd_timer xpad joydev cfg80211 rapl hid_logitech_hidpp bluetooth ff_memless wmi_bmof video pcspkr snd k10temp i2c_piix4 [ 894.013086] soundcore rfkill mei asus_ec_sensors acpi_cpufreq zram amdgpu drm_ttm_helper ttm iommu_v2 crct10dif_pclmul ucsi_ccg gpu_sched crc32_pclmul crc32c_intel typec_ucsi drm_buddy typec drm_display_helper ghash_clmulni_intel igb ccp cec nvme sp5100_tco nvme_core dca wmi ip6_tables ip_tables fuse [ 894.013322] Unloaded tainted modules: amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 [ 894.013455] pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 [ 894.013690] CPU: 14 PID: 205 Comm: kworker/14:1 Tainted: GW L--- --- 5.20.0-0.rc0.20220812git7ebfc85e2cd7.11.fc38.x86_64 #1 [ 894.013725] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 894.013756] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched] [ 894.013779] RIP: 0010:refcount_warn_saturate+0xba/0x110 [ 894.013796] Code: 01 01 e8 79 4a 6f 00 0f 0b e9 42 47 a5 00 80 3d de 7e be 01 00 75 85 48 c7 c7 f8 98 8e 9c c6 05 ce 7e be 01 01 e8 56 4a 6f 00 <0f> 0b e9 1f 47 a5 00 80 3d b9 7e be 01 00 0f 85 5e ff ff ff 48 c7 [ 894.013842] RSP: 0018:b48681153e60 EFLAGS: 00010286 [ 894.013858] RAX: 0026 RBX: 9bad16f1f028 RCX: [ 894.013878] RDX: 0001 RSI: 9c8d06dc RDI: [ 894.013897] RBP: 9bba663f5600 R08: R09: b48681153d10 [ 894.013916] R10: 0003 R11: 9bbaae2fffe8 R12: 9bba663fc800 [ 894.013934] R13: 9bab93fcab40 R14: 9bba663fc805 R15: 9bad16f1f030 [ 894.013954] FS: () GS:9bba6620() knlGS: [ 894.013975] CS: 0010 DS: ES: CR0: 80050033 [ 894.013991] CR2: 1aa46b2ec008 CR3: 000101516000 CR4: 00350ee0 [ 894.014011] Call Trace: [ 894.014022] [ 894.014030] process_one_work+0x2a0/0x600 [ 894.014051] worker_thread+0x4f/0x3a0 [ 894.014065] ? process_one_work+0x600/0x600 [ 894.014079] kthread+0xf5/0x120 [ 894.014092] ? kthread_complete_and_exit+0x20/0x20 [ 894.014109] ret_from_fork+0x22/0x30 [ 894.014129] [ 894.014137] irq event stamp: 5802 [ 894.014148] hardirqs last enabled at (5801): [] _raw_spin_unlock_irq+0x24/0x50 [ 894.014178] hardirqs last disabled at (5802): [] _
[BUG][5.20] refcount_t: underflow; use-after-free
Hi folks. Joined testing 5.20 today (7ebfc85e2cd7). I encountered a frequently GPU freeze, after which a message appears in the kernel logs: [ 220.280990] [ cut here ] [ 220.281000] refcount_t: underflow; use-after-free. [ 220.281019] WARNING: CPU: 1 PID: 3746 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 220.281029] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc snd_seq_midi snd_seq_midi_event vfat intel_rapl_msr fat intel_rapl_common snd_hda_codec_realtek mt76x2u snd_hda_codec_generic snd_hda_codec_hdmi mt76x2_common iwlmvm mt76x02_usb edac_mce_amd mt76_usb snd_hda_intel snd_intel_dspcfg mt76x02_lib snd_intel_sdw_acpi snd_usb_audio snd_hda_codec mt76 kvm_amd uvcvideo mac80211 snd_hda_core btusb eeepc_wmi snd_usbmidi_lib videobuf2_vmalloc videobuf2_memops kvm btrtl snd_rawmidi asus_wmi snd_hwdep videobuf2_v4l2 btbcm iwlwifi ledtrig_audio libarc4 btintel snd_seq videobuf2_common sparse_keymap btmtk irqbypass videodev snd_seq_device joydev xpad iwlmei platform_profile bluetooth ff_memless snd_pcm mc rapl [ 220.281185] video snd_timer cfg80211 wmi_bmof snd pcspkr soundcore k10temp i2c_piix4 rfkill mei asus_ec_sensors acpi_cpufreq zram hid_logitech_hidpp amdgpu igb dca drm_ttm_helper ttm crct10dif_pclmul iommu_v2 crc32_pclmul gpu_sched crc32c_intel ucsi_ccg drm_buddy nvme typec_ucsi ghash_clmulni_intel drm_display_helper ccp nvme_core typec sp5100_tco cec wmi ip6_tables ip_tables fuse [ 220.281258] Unloaded tainted modules: amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 [ 220.281388] pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 [ 220.281415] CPU: 1 PID: 3746 Comm: chrome:cs0 Tainted: G W L --- --- 5.20.0-0.rc0.20220812git7ebfc85e2cd7.10.fc38.x86_64 #1 [ 220.281421] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 220.281426] RIP: 0010:refcount_warn_saturate+0xba/0x110 [ 220.281431] Code: 01 01 e8 79 4a 6f 00 0f 0b e9 42 47 a5 00 80 3d de 7e be 01 00 75 85 48 c7 c7 f8 98 8e 98 c6 05 ce 7e be 01 01 e8 56 4a 6f 00 <0f> 0b e9 1f 47 a5 00 80 3d b9 7e be 01 00 0f 85 5e ff ff ff 48 c7 [ 220.281437] RSP: 0018:b4b0d18d7a80 EFLAGS: 00010282 [ 220.281443] RAX: 0026 RBX: 0003 RCX: [ 220.281448] RDX: 0001 RSI: 988d06dc RDI: [ 220.281452] RBP: R08: R09: b4b0d18d7930 [ 220.281457] R10: 0003 R11: a0672e2fffe8 R12: a058ca360400 [ 220.281461] R13: a05846c50a18 R14: fe00 R15: 0003 [ 220.281465] FS: 7f82683e06c0() GS:a066e2e0() knlGS: [ 220.281470] CS: 0010 DS: ES: CR0: 80050033 [ 220.281475] CR2: 3590005cc000 CR3: 0001fca46000 CR4: 00350ee0 [ 220.281480] Call Trace: [ 220.281485] [ 220.281490] amdgpu_cs_ioctl+0x4e2/0x2070 [amdgpu] [ 220.281806] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 220.282028] drm_ioctl_kernel+0xa4/0x150 [ 220.282043] drm_ioctl+0x21f/0x420 [ 220.282053] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 220.282275] ? lock_release+0x14f/0x460 [ 220.282282] ? _raw_spin_unlock_irqrestore+0x30/0x60 [ 220.282290] ? _raw_spin_unlock_irqrestore+0x30/0x60 [ 220.282297] ? lockdep_hardirqs_on+0x7d/0x100 [ 220.282305] ? _raw_spin_unlock_irqrestore+0x40/0x60 [ 220.282317] amdgpu_drm_ioctl+0x4a/0x80 [amdgpu] [ 220.282534] __x64_sys_ioctl+0x90/0xd0 [ 220.282545] do_syscall_64+0x5b/0x80 [ 220.282551] ? futex_wake+0x6c/0x150 [ 220.282568] ? lock_is_held_type+0xe8/0x140 [ 220.282580] ? do_syscall_64+0x67/0x80 [ 220.282585] ? lockdep_hardirqs_on+0x7d/0x100 [ 220.282592] ? do_syscall_64+0x67/0x80 [ 220.282597] ? do_syscall_64+0x67/0x80 [ 220.282602] ?
Re: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu
On Tue, Jul 19, 2022 at 4:26 PM Mikhail Gavrilov wrote: > In the kernel log there is no error so it is most likely a user space issue , > but I am not > sure about it. But I am confused by the message in the kernel log: [ 1962.000909] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 1962.000912] amdgpu: Failed to evict process queues [ 1962.000918] amdgpu: Failed to quiesce KFD [ 1966.010395] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 1966.010406] amdgpu: Resetting wave fronts (cpsch) on dev b40e7982 -- Best Regards, Mike Gavrilov.
Re: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu
On Tue, Jul 19, 2022 at 1:40 PM Mike Lothian wrote: > > I was told that this patch replaces the patch you mentioned > https://patchwork.freedesktop.org/series/106078/ and it the one > that'll hopefully land in Linus's tree > Great, I confirm that both patches solve the issue. As I understand the second patch [1] is more right and it should be land merged 5.19 soon, right? And since we are talking about clinfo, there is a question. No one has encountered the problem that on configurations with two GPUs, it hangs in a cycle since it completely occupies one processor core. In my case, one GPU is in the RENOIR processor, and the other is a discrete AMD Radeon 6800M. In the BIOS there is no ability to turn off the integrated GPU in the processor, so there is no way to check this configuration with each GPU separately. In the kernel log there is no error so it is most likely a user space issue , but I am not sure about it. clinfo backtrace is here [2] [1] https://patchwork.freedesktop.org/series/106078/ [2] https://pastebin.com/wv5iGibi -- Best Regards, Mike Gavrilov.
Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
On Wed, Jul 13, 2022 at 5:38 PM Mikhail Gavrilov wrote: > # first bad commit: [9cbbd694a58bdf24def2462276514c90cab7cf80] Merge > drm/drm-next into drm-misc-next > Don't know who to thank but the issue disappeared in 5.19 rc7. -- Best Regards, Mike Gavrilov.
Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu
Hi guys I continue testing 5.19 rc7 and found the bug. Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0008 on driver amdgpu. Here is trace: [ 1320.203332] BUG: kernel NULL pointer dereference, address: 0008 [ 1320.203338] #PF: supervisor read access in kernel mode [ 1320.203340] #PF: error_code(0x) - not-present page [ 1320.203341] PGD 0 P4D 0 [ 1320.203344] Oops: [#1] PREEMPT SMP NOPTI [ 1320.203346] CPU: 5 PID: 1226 Comm: kworker/5:2 Tainted: G W L --- 5.19.0-0.rc7.53.fc37.x86_64+debug #1 [ 1320.203348] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 1320.203350] Workqueue: events delayed_fput [ 1320.203354] RIP: 0010:dma_resv_add_fence+0x5a/0x2d0 [ 1320.203358] Code: 85 c0 0f 84 43 02 00 00 8d 50 01 09 c2 0f 88 47 02 00 00 8b 15 73 10 99 01 49 8d 45 70 48 89 44 24 10 85 d2 0f 85 05 02 00 00 <49> 8b 44 24 08 48 3d 80 93 53 97 0f 84 06 01 00 00 48 3d 20 93 53 [ 1320.203360] RSP: 0018:af4cc1adfc68 EFLAGS: 00010246 [ 1320.203362] RAX: 976660408208 RBX: 975f545f2000 RCX: [ 1320.203363] RDX: RSI: RDI: 976660408198 [ 1320.203364] RBP: 976806f6e800 R08: R09: [ 1320.203366] R10: R11: 0001 R12: [ 1320.203367] R13: 976660408198 R14: 975f545f2000 R15: 976660408198 [ 1320.203368] FS: () GS:976de120() knlGS: [ 1320.203370] CS: 0010 DS: ES: CR0: 80050033 [ 1320.203371] CR2: 0008 CR3: 0007fb31c000 CR4: 00350ee0 [ 1320.203372] Call Trace: [ 1320.203374] [ 1320.203378] amdgpu_amdkfd_gpuvm_destroy_cb+0x5d/0x1e0 [amdgpu] [ 1320.203516] amdgpu_vm_fini+0x2f/0x4e0 [amdgpu] [ 1320.203625] ? mutex_destroy+0x21/0x50 [ 1320.203629] amdgpu_driver_postclose_kms+0x1da/0x2b0 [amdgpu] [ 1320.203734] drm_file_free.part.0+0x20d/0x260 [ 1320.203738] drm_release+0x6a/0x120 [ 1320.203741] __fput+0xab/0x270 [ 1320.203743] delayed_fput+0x1f/0x30 [ 1320.203745] process_one_work+0x2a0/0x600 [ 1320.203749] worker_thread+0x4f/0x3a0 [ 1320.203751] ? process_one_work+0x600/0x600 [ 1320.203753] kthread+0xf5/0x120 [ 1320.203755] ? kthread_complete_and_exit+0x20/0x20 [ 1320.203758] ret_from_fork+0x22/0x30 [ 1320.203764] Full kernel log is here: https://pastebin.com/EeKh2LEr And one hour later after a lot of messages "BUG: workqueue lockup" GPU completely hung. I will be glad to test patches that fix this bug. -- Best Regards, Mike Gavrilov.
Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
On Sat, Jul 9, 2022 at 5:10 PM Mikhail Gavrilov wrote: > Hi Christian, > if you read my initial post. You should see that I tried to bisect the issue. > But it is very problematic because on each step I see different symptomes. > And if mark different symptoms with skip step we got at end lot of > possible commits: > Here is my bisect from initial post: https://pastebin.com/AhLMNfyv > [8.291298] [ cut here ] > [8.291309] kernel BUG at mm/page_alloc.c:1329! > [8.291324] invalid opcode: [#1] PREEMPT SMP NOPTI > [8.291328] CPU: 8 PID: 599 Comm: systemd-udevd Not tainted > 5.18.0-rc2-003-790b45f1bc6736a8dd48ba5731b6871e0217311e+ #361 > [8.291333] Hardware name: System manufacturer System Product > Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 > [8.291338] RIP: 0010:free_pcp_prepare+0x58d/0x5a0 There will be a 5.19 release soon. I haven't got a working kernel fresher than the fdaf9a5840ac commit on any machine (all machines have AMD graphics). Bisecting the kernel if we considered the mutex issue as "bad" state and all other non working state as "skip" did not lead to anything useful. Even if we consider "bad" all commits in which the kernel does not work, this also does not lead to anything good. Below I did it: $ git bisect log git bisect start # status: waiting for both good and bad commits # good: [fdaf9a5840acaab18694a19e0eb0aa51162d] Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache git bisect good fdaf9a5840acaab18694a19e0eb0aa51162d # status: waiting for bad commit, 1 good commit known # bad: [babf0bb978e3c9fce6c4eba6b744c8754fd43d8e] Merge tag 'xfs-5.19-for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux git bisect bad babf0bb978e3c9fce6c4eba6b744c8754fd43d8e # 01 - good: [86c87bea6b42100c67418af690919c44de6ede6e] Merge tag 'devicetree-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux git bisect good 86c87bea6b42100c67418af690919c44de6ede6e # 02 - observed initial problem with mutex # bad: [43ab20c599f4dc4c3972a8386ef4ca3943b5f9cd] drm/i915/gt: Fix build error without CONFIG_PM git bisect bad 43ab20c599f4dc4c3972a8386ef4ca3943b5f9cd # 03 - observed invalid opcode: [#1] PREEMPT SMP NOPTI - RIP: 0010:free_pcp_prepare+0x58d/0x5a0 # bad: [790b45f1bc6736a8dd48ba5731b6871e0217311e] drm/i915/bios: Parse the seamless DRRS min refresh rate git bisect bad 790b45f1bc6736a8dd48ba5731b6871e0217311e # 04 - observed invalid opcode: [#1] PREEMPT SMP NOPTI - RIP: 0010:free_pcp_prepare+0x455/0x650 # bad: [c6ed9f66eb70aeaac9998bd3552ada740d90e20c] drm/nouveau/gr/gf100-: change gf108_gr_fwif from global to static git bisect bad c6ed9f66eb70aeaac9998bd3552ada740d90e20c # 05 good: [3123109284176b1532874591f7c81f3837bbdc17] Linux 5.18-rc1 git bisect good 3123109284176b1532874591f7c81f3837bbdc17 # 06 good: [711c7adc4687250deb550ee8a6994203f817b2ca] drm: exynos: dsi: Use drm panel_bridge API git bisect good 711c7adc4687250deb550ee8a6994203f817b2ca # 07 - observed invalid opcode: [#1] PREEMPT SMP NOPTI - RIP: 0010:free_pcp_prepare+0x35e/0x410 # bad: [047a1b877ed48098bed71fcfb1d4891e1b54441d] dma-buf & drm/amdgpu: remove dma_resv workaround git bisect bad 047a1b877ed48098bed71fcfb1d4891e1b54441d # 08 good: [644704740b8282c9ee9483a38666ee4a4561c37c] drm/amdgpu: use dma_resv_for_each_fence for CS workaround v2 git bisect good 644704740b8282c9ee9483a38666ee4a4561c37c # 09 - observed invalid opcode: [#1] PREEMPT SMP NOPTI - RIP: 0010:free_pcp_prepare+0x35e/0x410 # bad: [61fe0ab26e36998cebec48805d6873e31f0d79d7] drm/gma500: fix a missing break in psb_intel_crtc_mode_set git bisect bad 61fe0ab26e36998cebec48805d6873e31f0d79d7 # 10 good: [1c3b2a27def609473ed13b1cd668cb10deab49b4] drm/nouveau/clk: Fix an incorrect NULL check on list iterator git bisect good 1c3b2a27def609473ed13b1cd668cb10deab49b4 # 11 - observed invalid opcode: [#1] PREEMPT SMP NOPTI - RIP: 0010:free_pcp_prepare+0x35e/0x410 # bad: [aa46154355e1e81ef746470d2e88bdb283508bff] drm/ingenic: Add ingenic_drm_bridge_atomic_enable and disable git bisect bad aa46154355e1e81ef746470d2e88bdb283508bff # 12 good: [71d637823cac7748079a912e0373476c7cf6f985] dma-buf: finally make dma_resv_excl_fence private v2 git bisect good 71d637823cac7748079a912e0373476c7cf6f985 # 13 - observed invalid opcode: [#1] PREEMPT SMP NOPTI - RIP: 0010:free_pcp_prepare+0x35e/0x410 # bad: [33f2069fb6a9c2d6509accc39521d3f4d6369576] drm/nouveau: support more than one write fence in fenv50_wndw_prepare_fb git bisect bad 33f2069fb6a9c2d6509accc39521d3f4d6369576 # 14 - observed invalid opcode: [#1] PREEMPT SMP NOPTI - RIP: 0010:free_pcp_prepare+0x35e/0x410 # bad: [9cbbd694a58bdf24def2462276514c90cab7cf80] Merge drm/drm-next into drm-misc-next git bisect bad 9cbbd694a58bdf24def2462276514c90cab7cf80 # first bad commit: [9cbbd694a58bdf24def2462276514c90cab7cf80] Merge d
Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
On Thu, Jul 7, 2022 at 2:50 PM Christian König wrote: > > Am 07.07.22 um 02:20 schrieb Mikhail Gavrilov: > > On Tue, Jun 28, 2022 at 2:21 PM Mikhail Gavrilov > > wrote: > > Christian can you look why > > drm_aperture_remove_conflicting_pci_framebuffers cause this kernel bug > > on my machine? > > That looks like a problem outside of the amdgpu driver. > > What happens is that during load amdgpu requests whatever driver > (vesafb,vgafb or efifb) is currently handling the framebuffer to unload. > This unload in turn now crashes for some reason. > > My best suggestion is to try to bisect this. Hi Christian, if you read my initial post. You should see that I tried to bisect the issue. But it is very problematic because on each step I see different symptomes. And if mark different symptoms with skip step we got at end lot of possible commits: Here is my bisect from initial post: https://pastebin.com/AhLMNfyv If you want that I ended bisection successfully please help how to fix this oops: [8.291177] page:af2b6334 refcount:0 mapcount:0 mapping: index:0x0 pfn:0x102a000 [8.291202] head:af2b6334 order:0 compound_mapcount:-1226 compound_pincount:0 [8.291221] flags: 0x17c001(head|node=0|zone=2|lastcpupid=0x1f) [8.291239] raw: 0017c001 fb35c0a80008 fb35c0a80008 [8.291257] raw: [8.291275] page dumped because: VM_BUG_ON_PAGE(compound && compound_order(page) != order) [8.291298] [ cut here ] [8.291309] kernel BUG at mm/page_alloc.c:1329! [8.291324] invalid opcode: [#1] PREEMPT SMP NOPTI [8.291328] CPU: 8 PID: 599 Comm: systemd-udevd Not tainted 5.18.0-rc2-003-790b45f1bc6736a8dd48ba5731b6871e0217311e+ #361 [8.291333] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [8.291338] RIP: 0010:free_pcp_prepare+0x58d/0x5a0 [8.291343] Code: c6 18 a2 85 a7 e8 d3 b7 fc ff 0f 0b 31 f6 48 89 df e8 97 cf 06 00 e9 29 ff ff ff 48 c7 c6 00 f1 85 a7 48 89 df e8 b3 b7 fc ff <0f> 0b 48 c7 c6 58 92 85 a7 e8 a5 b7 fc ff 0f 0b 0f 1f 00 0f 1f 44 [8.291351] RSP: 0018:b07c023ab9d8 EFLAGS: 00010296 [8.291354] RAX: 004e RBX: fb35c0a8 RCX: [8.291358] RDX: 0001 RSI: a789dbaf RDI: [8.291361] RBP: 0009 R08: R09: b07c023ab7c0 [8.291365] R10: 0003 R11: 92ee2e2fffe8 R12: [8.291368] R13: 92ee2a55d180 R14: fe00 R15: fb35c0a8 [8.291371] FS: 7f80aa398680() GS:92edda20() knlGS: [8.291376] CS: 0010 DS: ES: CR0: 80050033 [8.291379] CR2: 7f80aa38e616 CR3: 00017d726000 CR4: 00350ee0 [8.291382] Call Trace: [8.291384] [8.291386] ? find_held_lock+0x32/0x80 [8.291391] free_unref_page+0x25/0x2a0 [8.291395] __vunmap+0x261/0x3d0 [8.291399] drm_fbdev_cleanup+0x6b/0xc0 [8.291403] drm_fbdev_fb_destroy+0x15/0x30 [8.291407] unregister_framebuffer+0x2e/0x40 [8.291411] drm_client_dev_unregister+0x6e/0xe0 [8.291416] drm_dev_unregister+0x34/0x90 [8.291419] drm_dev_unplug+0x24/0x40 [8.291422] simpledrm_remove+0x11/0x20 [8.291426] platform_remove+0x1f/0x40 [8.291429] device_release_driver_internal+0x1b8/0x220 [8.291433] bus_remove_device+0xef/0x160 [8.291437] device_del+0x18c/0x3f0 [8.291440] platform_device_del.part.0+0x13/0x70 [8.291444] platform_device_unregister+0x1c/0x30 [8.291447] drm_aperture_detach_drivers+0xa3/0xd0 [8.291452] drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x70 [8.291457] amdgpu_pci_probe+0x126/0x3c0 [amdgpu] [8.291599] local_pci_probe+0x41/0x80 [8.291604] pci_device_probe+0xaa/0x200 [8.291607] really_probe+0x1a0/0x370 [8.291611] __driver_probe_device+0xfb/0x170 [8.291615] driver_probe_device+0x1f/0x90 [8.291618] __driver_attach+0xbe/0x1a0 [8.291622] ? __device_attach_driver+0xe0/0xe0 [8.291625] bus_for_each_dev+0x65/0x90 [8.291629] bus_add_driver+0x150/0x1f0 [8.291632] driver_register+0x89/0xd0 [8.291636] ? 0xc067b000 [8.291641] do_one_initcall+0x69/0x350 [8.291645] ? do_init_module+0x22/0x260 [8.291650] ? rcu_read_lock_sched_held+0x3b/0x70 [8.291654] ? trace_kmalloc+0x3b/0x100 [8.291658] ? kmem_cache_alloc_trace+0x1eb/0x3a0 [8.291662] do_init_module+0x4a/0x260 [8.291666] __do_sys_finit_module+0x93/0xf0 [8.291673] do_syscall_64+0x3a/0x80 [8.291677] entry_SYSCALL_64_after_hwframe+0x44/0xae [8.291681] RIP: 0033:0x7f80aaf4507d [8.291685] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89
Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
On Tue, Jun 28, 2022 at 2:21 PM Mikhail Gavrilov wrote: > Christian can you look why drm_aperture_remove_conflicting_pci_framebuffers cause this kernel bug on my machine? [6.822385] amdgpu: Ignoring ACPI CRAT on non-APU system [6.822462] amdgpu: Virtual CRAT table created for CPU [6.822654] amdgpu: Topology: Add CPU node [6.827643] Console: switching to colour dummy device 80x25 [6.845504] BUG: kernel NULL pointer dereference, address: 0038 [6.845509] #PF: supervisor read access in kernel mode [6.845512] #PF: error_code(0x) - not-present page [6.845515] PGD 0 P4D 0 [6.845518] Oops: [#1] PREEMPT SMP NOPTI [6.845522] CPU: 27 PID: 612 Comm: systemd-udevd Tainted: G W --- 5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64 #1 [6.845528] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [6.845533] RIP: 0010:kernfs_find_and_get_ns+0x11/0x70 [6.845539] Code: 78 e8 c3 fa 31 00 48 85 c0 75 e1 eb 93 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 49 89 d5 41 54 49 89 f4 55 53 <48> 8b 47 38 48 89 fb 48 85 c0 48 0f 44 c7 48 8b a8 80 00 00 00 48 [6.845546] RSP: 0018:a98c022f3aa0 EFLAGS: 00010246 [6.845550] RAX: RBX: af52c3c0 RCX: 9e150147b640 [6.845553] RDX: RSI: af52c508 RDI: [6.845557] RBP: R08: R09: 249249d4 [6.845560] R10: 0001 R11: R12: af52c508 [6.845563] R13: R14: 9e157aa93900 R15: [6.845567] FS: 7fabaafbf680() GS:9e23e6a0() knlGS: [6.845571] CS: 0010 DS: ES: CR0: 80050033 [6.845574] CR2: 0038 CR3: 00017cb56000 CR4: 00350ee0 [6.845578] Call Trace: [6.845579] [6.845582] sysfs_unmerge_group+0x18/0x60 [6.845585] dpm_sysfs_remove+0x20/0x60 [6.845590] device_del+0xa4/0x3f0 [6.845594] platform_device_del.part.0+0x13/0x70 [6.845599] platform_device_unregister+0x1c/0x30 [6.845602] sysfb_disable+0x2d/0x60 [6.845605] remove_conflicting_framebuffers+0x1b/0xc0 [6.845610] remove_conflicting_pci_framebuffers+0xce/0x120 [6.845614] drm_aperture_remove_conflicting_pci_framebuffers+0x57/0x80 [6.845620] amdgpu_pci_probe+0xcb/0x360 [amdgpu] [6.845760] local_pci_probe+0x41/0x80 [6.845764] pci_device_probe+0xaa/0x210 [6.845768] really_probe+0x1bf/0x390 [6.845771] __driver_probe_device+0xfc/0x170 [6.845775] driver_probe_device+0x1f/0x90 [6.845778] __driver_attach+0xbf/0x1b0 [6.845782] ? __device_attach_driver+0xe0/0xe0 [6.845785] bus_for_each_dev+0x65/0x90 [6.845789] bus_add_driver+0x15c/0x200 [6.845792] driver_register+0x89/0xe0 [6.845796] ? 0xc0c8d000 [6.845801] do_one_initcall+0x69/0x350 [6.845806] ? rcu_read_lock_sched_held+0x3c/0x70 [6.845810] ? trace_kmalloc+0x3c/0x100 [6.845814] ? kmem_cache_alloc_trace+0x1e8/0x350 [6.845818] do_init_module+0x4a/0x200 [6.845822] __do_sys_init_module+0x13a/0x190 [6.845827] do_syscall_64+0x5b/0x80 [6.845832] ? asm_exc_page_fault+0x27/0x30 [6.845835] ? lockdep_hardirqs_on+0x7d/0x100 [6.845839] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [6.845842] RIP: 0033:0x7fababb7463e [6.845845] Code: 48 8b 0d e5 57 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 57 0c 00 f7 d8 64 89 01 48 [6.845852] RSP: 002b:7ffc6a6c9658 EFLAGS: 0246 ORIG_RAX: 00af [6.845857] RAX: ffda RBX: 5620deef53f0 RCX: 7fababb7463e [6.845860] RDX: 5620deeb2df0 RSI: 010bfac6 RDI: 7faba943e010 [6.845864] RBP: 5620deeb2df0 R08: 5620deef4880 R09: [6.845867] R10: 0005 R11: 0246 R12: 0002 [6.845870] R13: 5620deeb5330 R14: R15: 5620deef0410 [6.845875] [6.845877] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 crct10dif_pclmul gpu_sched crc32_pclmul crc32c_intel drm_buddy drm_display_helper ucsi_ccg nvme igb typec_ucsi ghash_clmulni_intel ccp cec typec sp5100_tco nvme_core dca wmi ip6_tables ip_tables ipmi_devintf ipmi_msghandler fuse [6.845898] CR2: 0038 [6.845900] ---[ end trace ]--- $ /usr/src/kernels/5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/scripts/faddr2line /lib/debug/lib/modules/5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.debug amdgpu_pci_probe+0xcb amdgpu_pci_probe+0xcb/0x360: amdgpu_pci_probe at /usr/src/debug/kernel-5.19-rc5-49-gc1084b6c5620/linux-5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/driv
[Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
Hi guys. Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode instead I see black screen with constantly glowing cursor. Demonstration: https://youtu.be/rGL4LsHMae4 In the kernel logs there are references to hung processes: [ 149.363465] rfkill: input handler disabled [ 249.072478] INFO: task (brt-dbus):1645 blocked for more than 122 seconds. [ 249.072515] Tainted: GWL --- 5.19.0-0.rc0.20220526gitbabf0bb978e3.4.fc37.x86_64 #1 [ 249.072520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 249.072524] task:(brt-dbus) state:D stack:14384 pid: 1645 ppid: 1 flags:0x0002 [ 249.072536] Call Trace: [ 249.072540] [ 249.072551] __schedule+0x492/0x1640 [ 249.072560] ? lock_is_held_type+0xe8/0x140 [ 249.072569] ? find_held_lock+0x32/0x80 [ 249.072584] schedule+0x4e/0xb0 [ 249.072591] schedule_preempt_disabled+0x14/0x20 [ 249.072597] __mutex_lock+0x423/0x890 [ 249.072608] ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.072818] ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.073010] amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.073207] amdgpu_flush+0x25/0x40 [amdgpu] [ 249.074088] filp_close+0x31/0x70 [ 249.074097] __close_range+0x130/0x320 [ 249.074108] __x64_sys_close_range+0x13/0x20 [ 249.074113] do_syscall_64+0x5b/0x80 [ 249.074120] ? lockdep_hardirqs_on+0x7d/0x100 [ 249.074127] ? do_syscall_64+0x67/0x80 [ 249.074135] ? do_syscall_64+0x67/0x80 [ 249.074140] ? lockdep_hardirqs_on+0x7d/0x100 [ 249.074147] ? do_syscall_64+0x67/0x80 [ 249.074154] ? lock_is_held_type+0xe8/0x140 [ 249.074164] ? asm_exc_page_fault+0x27/0x30 [ 249.074171] ? lockdep_hardirqs_on+0x7d/0x100 [ 249.074178] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 249.074184] RIP: 0033:0x7fd71f54f97b [ 249.074208] RSP: 002b:7fffc8e752a8 EFLAGS: 0246 ORIG_RAX: 01b4 [ 249.074215] RAX: ffda RBX: 7fffc8e752b0 RCX: 7fd71f54f97b [ 249.074220] RDX: RSI: RDI: 0027 [ 249.074224] RBP: 7fffc8e75330 R08: R09: 7fffc8e75380 [ 249.074228] R10: 7fffc8e751f0 R11: 0246 R12: 0002 [ 249.074232] R13: 7fffc8e75340 R14: R15: 0002 [ 249.074252] [ 249.074261] INFO: task (ostnamed):1718 blocked for more than 122 seconds. [ 249.074266] Tainted: GWL --- 5.19.0-0.rc0.20220526gitbabf0bb978e3.4.fc37.x86_64 #1 [ 249.074285] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 249.074289] task:(ostnamed) state:D stack:14552 pid: 1718 ppid: 1 flags:0x0006 [ 249.074299] Call Trace: [ 249.074302] [ 249.074310] __schedule+0x492/0x1640 [ 249.074316] ? lock_is_held_type+0xe8/0x140 [ 249.074324] ? find_held_lock+0x32/0x80 [ 249.074339] schedule+0x4e/0xb0 [ 249.074346] schedule_preempt_disabled+0x14/0x20 [ 249.074352] __mutex_lock+0x423/0x890 [ 249.074361] ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.074564] ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.074754] amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.074950] amdgpu_flush+0x25/0x40 [amdgpu] [ 249.075133] filp_close+0x31/0x70 [ 249.075140] __close_range+0x130/0x320 [ 249.075150] __x64_sys_close_range+0x13/0x20 [ 249.075154] do_syscall_64+0x5b/0x80 [ 249.075164] ? lock_is_held_type+0xe8/0x140 [ 249.075175] ? do_syscall_64+0x67/0x80 [ 249.075180] ? lockdep_hardirqs_on+0x7d/0x100 [ 249.075187] ? do_syscall_64+0x67/0x80 [ 249.075194] ? lock_is_held_type+0xe8/0x140 [ 249.075204] ? asm_exc_page_fault+0x27/0x30 [ 249.075210] ? lockdep_hardirqs_on+0x7d/0x100 [ 249.075217] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 249.075222] RIP: 0033:0x7fd71f54f97b [ 249.075231] RSP: 002b:7fffc8e752a8 EFLAGS: 0246 ORIG_RAX: 01b4 [ 249.075237] RAX: ffda RBX: 7fffc8e752b0 RCX: 7fd71f54f97b [ 249.075241] RDX: RSI: 00b9 RDI: 0027 [ 249.075245] RBP: 7fffc8e75330 R08: R09: 7fffc8e75380 [ 249.075249] R10: 7fffc8e751f0 R11: 0246 R12: 0004 [ 249.075253] R13: 7fffc8e75340 R14: R15: 0003 [ 249.075289] [ 249.075294] INFO: task (pcscd):1749 blocked for more than 122 seconds. [ 249.075298] Tainted: GWL --- 5.19.0-0.rc0.20220526gitbabf0bb978e3.4.fc37.x86_64 #1 [ 249.075302] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 249.075306] task:(pcscd) state:D stack:14256 pid: 1749 ppid: 1 flags:0x0002 [ 249.075314] Call Trace: [ 249.075318] [ 249.075325] __schedule+0x492/0x1640 [ 249.075331] ? lock_is_held_type+0xe8/0x140 [ 249.075339] ? find_held_lock+0x32/0x80 [ 249.075353] schedule+0x4e/0xb0 [ 249.075360]
Re: Screen corruption using radeon kernel driver
On Mon, Apr 25, 2022 at 01:22:04PM -0400, Alex Deucher wrote: > + dri-devel > > On Mon, Apr 25, 2022 at 3:33 AM Krylov Michael wrote: > > > > Hello! > > > > After updating my Linux kernel from version 4.19 (Debian 10 version) to > > 5.10 (packaged with Debian 11), I've noticed that the image > > displayed on my older computer, 32-bit Pentium 4 using ATI Radeon X1950 > > AGP video card is severely corrupted in the graphical (Xorg and Wayland) > > mode: all kinds of black and white stripes across the screen, some > > letters missing, etc. > > > > I've checked several options (Xorg drivers, Wayland instead of > > Xorg, radeon.agpmode=-1 in kernel command line and so on), but the > > problem persisted. I've managed to find that the problem was in the > > kernel, as everything worked well with 4.19 kernel with everything > > else being from Debian 11. > > > > I have managed to find the culprit of that corruption, that is the > > commit 33b3ad3788aba846fc8b9a065fe2685a0b64f713 on the linux kernel. > > Reverting this commit and building the kernel with that commit reverted > > fixes the problem. Disabling HIMEM also gets rid of that problem. But it > > also leaves the system with less that 1G of RAM, which is, of course, > > undesirable. > > > > Apparently this problem is somewhat known, as I can tell after googling > > for the commit id, see this link for example: > > https://lkml.org/lkml/2020/1/9/518 > > > > Mageia distro, for example, reverted this commit in the kernel they are > > building: > > > > http://sophie.zarb.org/distrib/Mageia/7/i586/by-pkgid/b9193a4f85192bc57f4d770fb9bb399c/files/32 > > > > I've reported this bug to Debian bugtracker, checked the recent verion > > of the kernel (5.17), bug still persists. Here's a link to the Debian > > bug page: > > > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=993670 > > > > I'm not sure if reverting this commit is the correct way to go, so if > > you need to check any changes/patches that I could apply and test on > > the real hardware, I'll be glad to do that (but please keep in mind > > that testing could take some time, I don't have access to this computer > > 24/7, but I'll do my best to respond ASAP). > > I would be happy to revert that commit. I attempted to revert it a > year or so ago, but Christoph didn't want to. He was going to look > further into it. I was not able to repro the issue. It seemed to be > related to highmem support. You might try disabling that. Here is > the previous thread for reference: > https://lists.freedesktop.org/archives/amd-gfx/2020-September/053922.html > > Alex Yeah, I tried to disable HIMEM, and that indeed fixes the problem, but it leaves me with less than 1G of available memory which is undesirable. signature.asc Description: PGP signature
Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga
On Fri, Apr 15, 2022 at 1:04 PM Christian König wrote: > > No, I just couldn't find time during all that bug fixing :) > > Sorry for the delay, going to take a look after the eastern holiday here. > > Christian. The message is just for history. The issue was fixed between b253435746d9a4a and 5.18rc4. -- Best Regards, Mike Gavrilov.
Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga
On Sat, Apr 9, 2022 at 7:27 PM Christian König wrote: > > That's unfortunately not the end of the story. > > This is fixing your problem, but reintroducing the original problem that > we call the syncobj with a lock held which can crash badly as well. > > Going to take a closer look on Monday. I hope you can test a few more > patches to help narrow down what's actually going wrong here. > > Thanks, > Christian. > Hi Christian. I'm sorry to trouble you. Have you forgotten about this issue? -- Best Regards, Mike Gavrilov.
Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga
On Fri, 8 Apr 2022 at 19:27, Christian König wrote: > > Please test the attached patch, it just re-introduce the lock without > doing much else. > > And does your branch contain the following patch: > > commit d18b8eadd83e3d8d63a45f9479478640dbcfca02 > Author: Christian König > Date: Wed Feb 23 14:35:31 2022 +0100 > > drm/amdgpu: install ctx entities with cmpxchg > > Since we removed the context lock we need to make sure that not two > threads > are trying to install an entity at the same time. > > Signed-off-by: Christian König > Fixes: 461fa7b0ac565e ("drm/amdgpu: remove ctx->lock") > Reviewed-by: Andrey Grodzovsky > Signed-off-by: Alex Deucher The all listed games are now working with an attached patch. Also flood messages "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" has gone. Thanks. Tested-by: Mikhail Gavrilov -- Best Regards, Mike Gavrilov.
Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga
On Fri, 8 Apr 2022 at 16:13, Christian König wrote: > I own you a beer. > > I still don't know what happens here, but that makes at least a bit more > sense than a patch which only changes comments :) > > Looks like we are missing something here. Can I send you a patch to try > something later today? Yes, please feel free to send me a patch for testing. -- Best Regards, Mike Gavrilov.
Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga
Hi Christian > those are two independent and already known problems. > > The warning triggered from the sync_file is already fixed in > drm-misc-next-fixes, but so far I couldn't figure out why the games > suddenly doesn't work any more. I thought that these warnings are related to the stuck of the listed games. > There is a bug report for that, but bisecting the changes didn't yielded > anything valuable so far. > > So if you can come up with something that would be rather valuable. I found how to fix my build problems. They are all related to gcc12. And making again git bisect and found which commit lead to stuck the games "Forza Horizon 5", "Forza Horizon 4", "Cyberpunk 2077". At least it affected hardware Radeon 6900 XT, Radeon 6800M and Radeon VII. $ git bisect log git bisect start # good: [ed4643521e6af8ab8ed1e467630a85884d2696cf] Merge tag 'arm-dt-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc git bisect good ed4643521e6af8ab8ed1e467630a85884d2696cf # bad: [34af78c4e616c359ed428d79fe4758a35d2c5473] Merge tag 'iommu-updates-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu git bisect bad 34af78c4e616c359ed428d79fe4758a35d2c5473 # good: [4a0cb83ba6e0cd73a50fa4f84736846bf0029f2b] netdevice: add missing dm_private kdoc git bisect good 4a0cb83ba6e0cd73a50fa4f84736846bf0029f2b # skip: [2ab82efeeed885c0210a0029df93bb95a316e8c7] Merge tag 'drm-intel-gt-next-2022-03-03' of git://anongit.freedesktop.org/drm/drm-intel into drm-next git bisect skip 2ab82efeeed885c0210a0029df93bb95a316e8c7 # good: [00598b056aa6d46c7a6819efa850ec9d0d690d76] scsi: smartpqi: Expose SAS address for SATA drives git bisect good 00598b056aa6d46c7a6819efa850ec9d0d690d76 # good: [00598b056aa6d46c7a6819efa850ec9d0d690d76] scsi: smartpqi: Expose SAS address for SATA drives git bisect good 00598b056aa6d46c7a6819efa850ec9d0d690d76 # skip: [c674c5b9342e5cb0f3d9e9bcaf37dbe2087845e5] drm/i915/xehp: CCS should use RCS setup functions git bisect skip c674c5b9342e5cb0f3d9e9bcaf37dbe2087845e5 # good: [f0d4ce59f4d48622044933054a0e0cefa91ba15e] drm/i915: Disable DRRS on IVB/HSW port != A git bisect good f0d4ce59f4d48622044933054a0e0cefa91ba15e # skip: [6de7e4f02640fba2ffa6ac04e2be13785d614175] Merge tag 'drm-msm-next-2022-03-01' of https://gitlab.freedesktop.org/drm/msm into drm-next git bisect skip 6de7e4f02640fba2ffa6ac04e2be13785d614175 # bad: [868f4357ed0d1e2f96bbd67d4ac862aa6335effe] drm/amd/display: Add DMUB support for DCN316 git bisect bad 868f4357ed0d1e2f96bbd67d4ac862aa6335effe # good: [39da460fd4c0f8e7290dcc9cbfc9375de9d0eeca] drm/amd/display: Fix DP LT sequence on EQ fail git bisect good 39da460fd4c0f8e7290dcc9cbfc9375de9d0eeca # good: [3f268ef06f8cf3c481dbd5843d564f5170c6df54] drm/ttm: add back a reference to the bdev to the res manager git bisect good 3f268ef06f8cf3c481dbd5843d564f5170c6df54 # bad: [123db17ddff007080d464e785689fb14f94cbc7a] Merge tag 'amd-drm-next-5.18-2022-02-11-1' of https://gitlab.freedesktop.org/agd5f/linux into drm-next git bisect bad 123db17ddff007080d464e785689fb14f94cbc7a # bad: [24992ab0b8b0d2521caa9c3dcbed0e2a56cbe3d0] drm/amdkfd: Fix prototype warning for get_process_num_bos git bisect bad 24992ab0b8b0d2521caa9c3dcbed0e2a56cbe3d0 # good: [1cbbc8d4f788af4c260ef3cae05902ef7b191197] drm/radeon/uvd: Fix forgotten unmap buffer objects git bisect good 1cbbc8d4f788af4c260ef3cae05902ef7b191197 # good: [69f915cc97c4bb82b34105a47abf613f7c87215d] drm/amdgpu: loose check for umc poison mode git bisect good 69f915cc97c4bb82b34105a47abf613f7c87215d # good: [8bbd4d83a68beaf54ae01b2e2aa2024ff1dfc0ba] drm/amdgpu: Reset OOB table error count info git bisect good 8bbd4d83a68beaf54ae01b2e2aa2024ff1dfc0ba # bad: [1915a433954262ac7466469d1a4684ac54218af4] drm/amdgpu: adjust register address calculation git bisect bad 1915a433954262ac7466469d1a4684ac54218af4 # bad: [461fa7b0ac565ef25c1da0ced31005dd437883a7] drm/amdgpu: remove ctx->lock git bisect bad 461fa7b0ac565ef25c1da0ced31005dd437883a7 # first bad commit: [461fa7b0ac565ef25c1da0ced31005dd437883a7] drm/amdgpu: remove ctx->lock 461fa7b0ac565ef25c1da0ced31005dd437883a7 is the first bad commit commit 461fa7b0ac565ef25c1da0ced31005dd437883a7 Author: Ken Xue Date: Fri Feb 11 16:18:46 2022 -0500 drm/amdgpu: remove ctx->lock KMD reports a warning on holding a lock from drm_syncobj_find_fence, when running amdgpu_test case “syncobj timeline test”. ctx->lock was designed to prevent concurrent "amdgpu_ctx_wait_prev_fence" calls and avoid dead reservation lock from GPU reset. since no reservation lock is held in latest GPU reset any more, ctx->lock can be simply removed and concurrent "amdgpu_ctx_wait_prev_fence" call also can be prevented by PD root bo reservation lock. call stacks: = //hold lock amdgpu_cs_ioctl->amdgpu_cs_parser_init->mutex_lock(>ctx->lock); … //report warning amdgpu_cs_dependencies->amdgpu_cs_process_syncobj_timeline_in_dep \
[Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some games
Hi, Between commits ed4643521e6a and 34af78c4e616 something was broken. I noted that kernel log flooded with warning message "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" when some games are running: "Resident Evil Village", "Marvel's Avengers", "The Dark Pictures Anthology: House of Ashes". [16999.958726] [ cut here ] [16999.958731] WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120 [16999.958738] Modules linked in: xone_gip_chatpad(OE) xone_gip_gamepad(OE) xone_gip_common(OE) ff_memless tls uinput rfcomm snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_midi_event nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc iwlmvm vfat intel_rapl_msr fat intel_rapl_common snd_hda_codec_realtek mac80211 snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi libarc4 snd_hda_intel edac_mce_amd snd_intel_dspcfg snd_usb_audio snd_intel_sdw_acpi btusb kvm_amd snd_hda_codec btrtl btbcm iwlwifi btintel snd_hda_core snd_usbmidi_lib uvcvideo snd_hwdep kvm iwlmei snd_rawmidi videobuf2_vmalloc xone_dongle(OE) videobuf2_memops xone_gip_bus(OE) snd_seq btmtk videobuf2_v4l2 videobuf2_common snd_seq_device irqbypass bluetooth cfg80211 snd_pcm rapl videodev [16999.958799] eeepc_wmi asus_wmi snd_timer sparse_keymap platform_profile ecdh_generic video wmi_bmof pcspkr snd k10temp i2c_piix4 joydev mc soundcore rfkill mei acpi_cpufreq zram hid_logitech_hidpp hid_logitech_dj amdgpu drm_ttm_helper ttm crct10dif_pclmul ccp crc32_pclmul ucsi_ccg iommu_v2 crc32c_intel typec_ucsi gpu_sched ghash_clmulni_intel sp5100_tco drm_dp_helper typec igb nvme nvme_core dca wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath ipmi_devintf ipmi_msghandler fuse [16999.958862] CPU: 31 PID: 51848 Comm: GWT.exe Tainted: GB W OEL - --- 5.18.0-0.rc0.20220401gite8b767f5e04097a.15.fc37.x86_64 #1 [16999.958865] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4204 02/24/2022 [16999.958867] RIP: 0010:dma_fence_array_create+0x101/0x120 [16999.958871] Code: 45 85 e4 75 10 eb 2a 48 81 fa c0 aa 52 ab 74 1a 83 e8 01 72 1c 48 63 d0 48 8b 54 d5 00 48 8b 52 08 48 81 fa 60 aa 52 ab 75 dd <0f> 0b 83 e8 01 73 e4 48 83 c4 08 48 89 d8 5b 5d 41 5c 41 5d 41 5e [16999.958874] RSP: 0018:b03c071f7e08 EFLAGS: 00010246 [16999.958877] RAX: 0001 RBX: 98fdb03c6d00 RCX: 00510e99 [16999.958879] RDX: ab52aac0 RSI: 98fdb03c6d10 RDI: 98fdb03c6d00 [16999.958880] RBP: 98fa31c59e40 R08: 0001 R09: [16999.958882] R10: R11: R12: 0002 [16999.958883] R13: R14: 98fdb03c6d40 R15: 0001 [16999.958885] FS: 4789f640() GS:9907ea60() knlGS:29b7 [16999.958887] CS: 0010 DS: ES: CR0: 80050033 [16999.95] CR2: 7ff41eee8000 CR3: 2856a000 CR4: 00350ee0 [16999.958890] Call Trace: [16999.958893] [16999.958897] sync_file_ioctl+0x83d/0x9f0 [16999.958904] __x64_sys_ioctl+0x8d/0xc0 [16999.958908] do_syscall_64+0x3a/0x80 [16999.958913] entry_SYSCALL_64_after_hwframe+0x44/0xae [16999.958917] RIP: 0033:0x7ff5e850b29f [16999.958941] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 [16999.958943] RSP: 002b:4789d540 EFLAGS: 0246 ORIG_RAX: 0010 [16999.958946] RAX: ffda RBX: 7ff5d5637040 RCX: 7ff5e850b29f [16999.958948] RDX: 4789d740 RSI: c0303e03 RDI: 0260 [16999.958949] RBP: 0260 R08: 0001 R09: [16999.958951] R10: R11: 0246 R12: 4789d740 [16999.958953] R13: R14: c0303e03 R15: [16999.958958] [16999.958959] irq event stamp: 0 [16999.958961] hardirqs last enabled at (0): [<>] 0x0 [16999.958964] hardirqs last disabled at (0): [] copy_process+0x9f1/0x1e20 [16999.958968] softirqs last enabled at (0): [] copy_process+0x9f1/0x1e20 [16999.958971] softirqs last disabled at (0): [<>] 0x0 [16999.958974] ---[ end trace ]--- The games "Forza Horizon 5", "Forza Horizon 4", "Cyberpunk 2077", "Ghostwire: Tokyo" stopped working. When these games crashed I again saw the same warning message as above [2]. Difference only in thead name and addresses. [ 643.442353] [ cut here ] [ 643.442358] WARNING: CPU: 24 PID: 7824 at
Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K
On Wed, 15 Sept 2021 at 14:55, Christian König wrote: > > Yes, absolutely. You should see GPU resets and recovery in the system log > after that. Unfortunately, not one DE will survive a GPU reset. All applications will terminate abnormally in fact this would be equivalent to reboot (and denial of service). :( -- Best Regards, Mike Gavrilov.
Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K
On Wed, 14 Apr 2021 at 11:48, Christian König < ckoenig.leichtzumer...@gmail.com> wrote: > > That is expected behavior, the application is just buggy and causing a > page fault on the GPU. > > The kernel should just not crash with a backtrace. > > Regards, > Christian. > If after it GPU hangs with the message "[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!" is it also expected behavior? Kernel log: https://pastebin.com/WkhATKXX -- Best Regards, Mike Gavrilov.
Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K
On Wed, 21 Apr 2021 at 11:42, Christian König wrote: > I can try, but I'm not sure if we even have the full page fault handling > for Navi in 5.12. > It would be great. For me this patch is working as expected and I already for several days didn't see the panic "kernel BUG at drivers/dma-buf/dma-resv.c:287!". Anyway I will waiting for any news. -- Best Regards, Mike Gavrilov. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K
On Wed, 14 Apr 2021 at 11:48, Christian König wrote: > > >> commit f63da9ae7584280582cbc834b20cc18bfb203b14 > >> Author: Philip Yang > >> Date: Thu Apr 1 00:22:23 2021 -0400 > >> > >> drm/amdgpu: reserve fence slot to update page table > >> > > That is expected behavior, the application is just buggy and causing a > page fault on the GPU. > > The kernel should just not crash with a backtrace. > Any chance to see this commit to be backported to 5.12? I plan to submit a bug report to OBS devs and don't want my system to hang again and again when I would test their patches. -- Best Regards, Mike Gavrilov. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K
On Wed, 14 Apr 2021 at 03:22, Leo Liu wrote: > > This is decode command line, are you seeing issue with encode or > decode? I was means that described above the kernel panic happens only when OBS record or stream video with VAAPI encoder. Grabbing and encoding video with ffmpeg (given command example) is free from this issue, but result video encoded with ffmpeg not played properly. And I believe that this is not a bug of ffmpeg itself, because with CPU encoder (libx264), the resulting video is played properly. > you also said `ffmpeg -f x11grab -framerate 60 -video_size > 3840x2160 -i :0.0 -vf 'format=nv12,hwupload' -vaapi_device > /dev/dri/renderD128 -vcodec h264_vaapi output3.mp4` doesn't cause such > issue, right? This command does not cause described kernel panic, but the resulting video looks like 0.01 FPS. > > Yes. > I filled bugreport about VAAPI encoder in ffmpeg here: https://gitlab.freedesktop.org/drm/amd/-/issues/1570 We can continue there. -- Best Regards, Mike Gavrilov. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K
On Tue, 13 Apr 2021 at 04:55, Leo Liu wrote: > > >It curious why ffmpeg does not cause such issues. > >For example such command not cause kernel panic: > >$ ffmpeg -f x11grab -framerate 60 -video_size 3840x2160 -i :0.0 -vf > >'format=nv12,hwupload' -vaapi_device /dev/dri/renderD128 -vcodec > >h264_vaapi output3.mp4 > > What command are you using to see the issue or how can the issue be > reproduced? $ mpv output4.mp4 And of course, I know how it should works because when I encode video with CPU encoder (libx264) all fine. $ ffmpeg -f x11grab -framerate 60 -video_size 3840x2160 -i :0.0 -vcodec libx264 output3.mp4 > Please file a freedesktop gitlab issue, so we can keep track of it. Here? https://gitlab.freedesktop.org/drm/amd/-/issues Also, I found that other users face the same problem. https://bbs.archlinux.org/viewtopic.php?id=261965 -- Best Regards, Mike Gavrilov. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K
On Tue, 13 Apr 2021 at 12:29, Christian König wrote: > > Hi Mikhail, > > the crash is a known issue and should be fixed by: > > commit f63da9ae7584280582cbc834b20cc18bfb203b14 > Author: Philip Yang > Date: Thu Apr 1 00:22:23 2021 -0400 > > drm/amdgpu: reserve fence slot to update page table > Unfortunately, this commit couldn't fix the initial problem. 1. Result video is jerky if it grabbed and encoded with ffmpeg (h264_vaapi codec). 2. OBS still crashed if I try to record or stream video. 3. In the kernel log still appears the message "amdgpu: [mmhub] page fault (src_id:0 ring:0 vmid:4 pasid:32770, for process obs" if I tried to record or stream video by OBS. -- Best Regards, Mike Gavrilov. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[BUG] VAAPI encoder cause kernel panic if encoded video in 4K
Video demonstration: https://youtu.be/3nkvUeB0GSw How looks kernel traces. 1. [ 7315.156460] amdgpu :0b:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:0 vmid:6 pasid:32779, for process obs pid 23963 thread obs:cs0 pid 23977) [ 7315.156490] amdgpu :0b:00.0: amdgpu: in page starting at address 0x80011fdf5000 from client 18 [ 7315.156495] amdgpu :0b:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00641A51 [ 7315.156500] amdgpu :0b:00.0: amdgpu: Faulty UTCL2 client ID: VCN1 (0xd) [ 7315.156503] amdgpu :0b:00.0: amdgpu: MORE_FAULTS: 0x1 [ 7315.156505] amdgpu :0b:00.0: amdgpu: WALKER_ERROR: 0x0 [ 7315.156509] amdgpu :0b:00.0: amdgpu: PERMISSION_FAULTS: 0x5 [ 7315.156510] amdgpu :0b:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 7315.156513] amdgpu :0b:00.0: amdgpu: RW: 0x1 [ 7315.156545] amdgpu :0b:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:0 vmid:6 pasid:32779, for process obs pid 23963 thread obs:cs0 pid 23977) [ 7315.156549] amdgpu :0b:00.0: amdgpu: in page starting at address 0x80011fdf6000 from client 18 [ 7315.156551] amdgpu :0b:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00641A51 [ 7315.156554] amdgpu :0b:00.0: amdgpu: Faulty UTCL2 client ID: VCN1 (0xd) [ 7315.156556] amdgpu :0b:00.0: amdgpu: MORE_FAULTS: 0x1 [ 7315.156559] amdgpu :0b:00.0: amdgpu: WALKER_ERROR: 0x0 [ 7315.156561] amdgpu :0b:00.0: amdgpu: PERMISSION_FAULTS: 0x5 [ 7315.156564] amdgpu :0b:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 7315.156566] amdgpu :0b:00.0: amdgpu: RW: 0x1 This is a harmless panic, but nevertheless VAAPI does not work and the application that tried to use the encoder crashed. 2. If we tries again and again encode 4K stream through VAAPI we can encounter the next trace: [12341.860944] [ cut here ] [12341.860961] kernel BUG at drivers/dma-buf/dma-resv.c:287! [12341.860968] invalid opcode: [#1] SMP NOPTI [12341.860972] CPU: 28 PID: 18261 Comm: kworker/28:0 Tainted: G W- --- 5.12.0-0.rc5.180.fc35.x86_64+debug #1 [12341.860977] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 3402 01/13/2021 [12341.860981] Workqueue: events amdgpu_irq_handle_ih_soft [amdgpu] [12341.861102] RIP: 0010:dma_resv_add_shared_fence+0x2ab/0x2c0 [12341.861108] Code: fd ff ff be 01 00 00 00 e8 e2 74 dc ff e9 ac fd ff ff 48 83 c4 18 be 03 00 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f e9 c5 74 dc ff <0f> 0b 31 ed e9 73 fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f [12341.861112] RSP: 0018:b2f084c87bb0 EFLAGS: 00010246 [12341.861115] RAX: 0002 RBX: 9f9551184998 RCX: [12341.861119] RDX: 0002 RSI: RDI: 9f9551184a50 [12341.861122] RBP: 0002 R08: R09: [12341.861124] R10: R11: R12: 9f91b9a18140 [12341.861127] R13: 9f91c9020740 R14: 9f91c9020768 R15: [12341.861130] FS: () GS:9f984a20() knlGS: [12341.861133] CS: 0010 DS: ES: CR0: 80050033 [12341.861136] CR2: 144e080d8000 CR3: 00010e98c000 CR4: 00350ee0 [12341.861139] Call Trace: [12341.861143] amdgpu_vm_sdma_commit+0x182/0x220 [amdgpu] [12341.861251] amdgpu_vm_bo_update_mapping.constprop.0+0x278/0x3c0 [amdgpu] [12341.861356] amdgpu_vm_handle_fault+0x145/0x290 [amdgpu] [12341.861461] gmc_v10_0_process_interrupt+0xb3/0x250 [amdgpu] [12341.861571] ? _raw_spin_unlock_irqrestore+0x37/0x40 [12341.861577] ? lock_acquire+0x179/0x3a0 [12341.861583] ? lock_acquire+0x179/0x3a0 [12341.861587] ? amdgpu_irq_dispatch+0xc6/0x240 [amdgpu] [12341.861692] amdgpu_irq_dispatch+0xc6/0x240 [amdgpu] [12341.861796] amdgpu_ih_process+0x90/0x110 [amdgpu] [12341.861900] process_one_work+0x2b0/0x5e0 [12341.861906] worker_thread+0x55/0x3c0 [12341.861910] ? process_one_work+0x5e0/0x5e0 [12341.861915] kthread+0x13a/0x150 [12341.861918] ? __kthread_bind_mask+0x60/0x60 [12341.861922] ret_from_fork+0x22/0x30 [12341.861928] Modules linked in: uinput snd_seq_dummy rfcomm snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink cmac bnep sunrpc vfat fat hid_logitech_hidpp joydev hid_logitech_dj mt76x2u mt76x2_common mt76x02_usb mt76_usb mt76x02_lib intel_rapl_msr intel_rapl_common mt76 iwlmvm mac80211 snd_hda_codec_realtek edac_mce_amd snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi btusb kvm_amd snd_hda_intel btrtl snd_intel_dspcfg btbcm snd_intel_sdw_acpi snd_usb_audio uvcvideo btintel snd_hda_codec videobuf2_vmalloc snd_usbmidi_lib videobuf2_memops iwlwifi kvm bluetooth snd_rawmidi snd_hda_core snd_seq videobuf2_v4l2 snd_hwdep videobuf2_common snd_seq_device eeepc_wmi snd_pcm videodev asus_wmi
Re: Unexpected multihop in swaput - likely driver bug.
On Wed, 7 Apr 2021 at 15:46, Christian König wrote: > > What hardware are you using $ inxi -bM System:Host: fedora Kernel: 5.12.0-0.rc6.184.fc35.x86_64+debug x86_64 bits: 64 Desktop: GNOME 40.0 Distro: Fedora release 35 (Rawhide) Machine: Type: Desktop Mobo: ASUSTeK model: ROG STRIX X570-I GAMING v: Rev X.0x serial: UEFI: American Megatrends v: 3603 date: 03/20/2021 Battery: ID-1: hidpp_battery_0 charge: N/A condition: N/A CPU: Info: 16-Core (2-Die) AMD Ryzen 9 3950X [MT MCP MCM] speed: 2365 MHz min/max: 2200/3500 MHz Graphics: Device-1: Advanced Micro Devices [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] driver: amdgpu v: kernel Device-2: AVerMedia Live Streamer CAM 513 type: USB driver: hid-generic,usbhid,uvcvideo Device-3: AVerMedia Live Gamer Ultra-Video type: USB driver: hid-generic,snd-usb-audio,usbhid,uvcvideo Display: wayland server: X.Org 1.21.1 driver: loaded: amdgpu,ati unloaded: fbdev,modesetting,radeon,vesa resolution: 3840x2160~60Hz OpenGL: renderer: AMD SIENNA_CICHLID (DRM 3.40.0 5.12.0-0.rc6.184.fc35.x86_64+debug LLVM 12.0.0) v: 4.6 Mesa 21.1.0-devel Network: Device-1: Intel Wi-Fi 6 AX200 driver: iwlwifi Device-2: Intel I211 Gigabit Network driver: igb Drives:Local Storage: total: 11.35 TiB used: 10.82 TiB (95.3%) Info: Processes: 805 Uptime: 12h 56m Memory: 31.18 GiB used: 21.88 GiB (70.2%) Shell: Bash inxi: 3.3.02 > and how do you exactly trigger this? I am running heavy games like "Zombie Army 4: Dead War" and switching to Gnome Activities and other applications while the game is running. -- Best Regards, Mike Gavrilov. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Unexpected multihop in swaput - likely driver bug.
Hi! During the 5.12 testing cycle I observed the repeatable bug when launching heavy graphic applications. The kernel log is flooded with the message "Unexpected multihop in swaput - likely driver bug.". Trace: [ 8707.814899] [ cut here ] [ 8707.814920] Unexpected multihop in swaput - likely driver bug. [ 8707.814998] WARNING: CPU: 19 PID: 28231 at drivers/gpu/drm/ttm/ttm_bo.c:1484 ttm_bo_swapout+0x40b/0x420 [ttm] [ 8707.815011] Modules linked in: tun uinput snd_seq_dummy rfcomm snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink cmac bnep sunrpc vfat fat hid_logitech_hidpp hid_logitech_dj intel_rapl_msr snd_hda_codec_realtek intel_rapl_common mt76x2u snd_hda_codec_generic mt76x2_common mt76x02_usb iwlmvm ledtrig_audio snd_hda_codec_hdmi mt76_usb mt76x02_lib snd_hda_intel mt76 snd_intel_dspcfg snd_intel_sdw_acpi mac80211 joydev snd_usb_audio snd_hda_codec uvcvideo edac_mce_amd videobuf2_vmalloc snd_hda_core snd_usbmidi_lib videobuf2_memops snd_hwdep iwlwifi snd_rawmidi btusb videobuf2_v4l2 kvm_amd snd_seq videobuf2_common btrtl btbcm videodev btintel snd_seq_device kvm mc cfg80211 bluetooth snd_pcm libarc4 eeepc_wmi snd_timer asus_wmi irqbypass xpad sp5100_tco [ 8707.815065] sparse_keymap ecdh_generic ff_memless video ecc wmi_bmof i2c_piix4 snd rapl k10temp soundcore rfkill acpi_cpufreq ip_tables amdgpu drm_ttm_helper ttm iommu_v2 gpu_sched drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel cec drm ghash_clmulni_intel igb ccp nvme dca nvme_core i2c_algo_bit wmi pinctrl_amd fuse [ 8707.815096] CPU: 19 PID: 28231 Comm: kworker/u64:1 Tainted: G W- --- 5.12.0-0.rc6.184.fc35.x86_64+debug #1 [ 8707.815101] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 3603 03/20/2021 [ 8707.815106] Workqueue: ttm_swap ttm_shrink_work [ttm] [ 8707.815114] RIP: 0010:ttm_bo_swapout+0x40b/0x420 [ttm] [ 8707.815122] Code: 10 00 00 48 c1 e2 0c 48 c1 e6 0c e8 3f 37 fa c8 e9 71 fe ff ff 83 f8 b8 0f 85 a9 fe ff ff 48 c7 c7 28 32 37 c0 e8 02 2b 98 c9 <0f> 0b e9 96 fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f [ 8707.815126] RSP: 0018:a306d20e7d58 EFLAGS: 00010292 [ 8707.815130] RAX: 0032 RBX: c0379260 RCX: 0027 [ 8707.815133] RDX: 918c091daae8 RSI: 0001 RDI: 918c091daae0 [ 8707.815136] RBP: 918602210058 R08: R09: [ 8707.815138] R10: a306d20e7b90 R11: 918c2e2fffe8 R12: ffb8 [ 8707.815141] R13: c03792a0 R14: 9186022102c0 R15: 0001 [ 8707.815145] FS: () GS:918c0900() knlGS: [ 8707.815148] CS: 0010 DS: ES: CR0: 80050033 [ 8707.815151] CR2: 325c84d12000 CR3: 000776c28000 CR4: 00350ee0 [ 8707.815154] Call Trace: [ 8707.815164] ttm_shrink+0xa6/0xe0 [ttm] [ 8707.815171] ttm_shrink_work+0x36/0x40 [ttm] [ 8707.815177] process_one_work+0x2b0/0x5e0 [ 8707.815185] worker_thread+0x55/0x3c0 [ 8707.815188] ? process_one_work+0x5e0/0x5e0 [ 8707.815192] kthread+0x13a/0x150 [ 8707.815196] ? __kthread_bind_mask+0x60/0x60 [ 8707.815199] ret_from_fork+0x22/0x30 [ 8707.815207] irq event stamp: 0 [ 8707.815209] hardirqs last enabled at (0): [<>] 0x0 [ 8707.815213] hardirqs last disabled at (0): [] copy_process+0x91b/0x1e10 [ 8707.815218] softirqs last enabled at (0): [] copy_process+0x91b/0x1e10 [ 8707.815222] softirqs last disabled at (0): [<>] 0x0 [ 8707.815224] ---[ end trace 29252aa87289bbaa ]--- Full kernel log: https://pastebin.com/mmAxwBYc $ /usr/src/kernels/`uname -r`/scripts/faddr2line /lib/debug/lib/modules/`uname -r`/kernel/drivers/gpu/drm/ttm/ttm.ko.debug ttm_bo_swapout+0x40b ttm_bo_swapout+0x40b/0x420: ttm_bo_swapout at /usr/src/debug/kernel-5.12-rc6/linux-5.12.0-0.rc6.184.fc35.x86_64/drivers/gpu/drm/ttm/ttm_bo.c:1484 (discriminator 1) $ git blame drivers/gpu/drm/ttm/ttm_bo.c -L 1475,1494 Blaming lines: 1% (20/1530), done. ebdf565169af0 (Dave Airlie 2020-10-29 13:58:52 +1000 1475) memset(, 0, sizeof(hop)); ba4e7d973dd09 (Thomas Hellstrom 2009-06-10 15:20:19 +0200 1476) ba4e7d973dd09 (Thomas Hellstrom 2009-06-10 15:20:19 +0200 1477) evict_mem = bo->mem; ba4e7d973dd09 (Thomas Hellstrom 2009-06-10 15:20:19 +0200 1478) evict_mem.mm_node = NULL; ce65b874001d7 (Christian König 2020-09-30 16:44:16 +0200 1479) evict_mem.placement = 0; ba4e7d973dd09 (Thomas Hellstrom 2009-06-10 15:20:19 +0200 1480) evict_mem.mem_type = TTM_PL_SYSTEM; ba4e7d973dd09 (Thomas Hellstrom 2009-06-10 15:20:19 +0200 1481) ebdf565169af0 (Dave Airlie 2020-10-29 13:58:52 +1000 1482) ret = ttm_bo_handle_move_mem(bo, _mem, true, , );
Re: [bug] 5.11-rc5 brought page allocation failure issue [ttm][amdgpu]
On Mon, 8 Feb 2021 at 14:18, Christian König wrote: > > Are the other problems gone as well? > And yes and no. The issue with monitor turns off was gone after rc6 (git3aaf0a27ffc2) But both traces 1) BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 (kernel 5.11 specific) 2) WARNING: CPU: 14 PID: 504 at kernel/locking/lockdep.c:4618 lockdep_init_map_waits+0x18b/0x210 (Navi specific) are still happening on every boot. 1) [5.806032] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 [5.806048] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 504, name: systemd-udevd [5.806064] 1 lock held by systemd-udevd/504: [5.806073] #0: 9c5ac2e4f258 (>mutex){}-{3:3}, at: device_driver_attach+0x3b/0xb0 [5.806097] CPU: 14 PID: 504 Comm: systemd-udevd Not tainted 5.11.0-0.rc6.20210204git61556703b610.145.fc34.x86_64 #1 [5.806117] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 3402 01/13/2021 [5.806135] Call Trace: [5.806142] dump_stack+0x8b/0xb0 [5.806153] ___might_sleep.cold+0xb6/0xc6 [5.806163] ? dcn30_clock_source_create+0x34/0xb0 [amdgpu] [5.806338] kmem_cache_alloc_trace+0x204/0x230 [5.806353] dcn30_clock_source_create+0x34/0xb0 [amdgpu] [5.806516] dcn30_create_resource_pool+0x1de/0x13b0 [amdgpu] [5.806678] ? rcu_read_lock_sched_held+0x3f/0x80 [5.806690] ? trace_kmalloc+0xb2/0xe0 [5.806699] ? __kmalloc+0x191/0x280 [5.806710] ? dc_create_resource_pool+0x110/0x1d0 [amdgpu] [5.806869] dc_create_resource_pool+0x110/0x1d0 [amdgpu] [5.807026] dc_create+0x205/0x790 [amdgpu] [5.807181] ? trace_kmalloc+0xb2/0xe0 [5.807190] ? kmem_cache_alloc_trace+0x174/0x230 [5.807203] amdgpu_dm_init.isra.0+0x1b9/0x250 [amdgpu] [5.807369] ? dev_vprintk_emit+0x171/0x195 [5.807385] ? dev_printk_emit+0x3e/0x40 [5.807403] dm_hw_init+0xe/0x20 [amdgpu] [5.807563] amdgpu_device_init.cold+0x179f/0x1afd [amdgpu] [5.807728] ? pci_conf1_read+0x9b/0xf0 [5.807741] amdgpu_driver_load_kms+0x68/0x280 [amdgpu] [5.807877] amdgpu_pci_probe+0x129/0x1b0 [amdgpu] [5.808009] local_pci_probe+0x42/0x80 [5.808020] pci_device_probe+0xd9/0x1a0 [5.808031] really_probe+0xf2/0x440 [5.808042] driver_probe_device+0xe1/0x150 [5.808053] device_driver_attach+0xa8/0xb0 [5.808063] __driver_attach+0x8c/0x150 [5.808071] ? device_driver_attach+0xb0/0xb0 [5.808080] ? device_driver_attach+0xb0/0xb0 [5.808090] bus_for_each_dev+0x67/0x90 [5.808101] bus_add_driver+0x12e/0x1f0 [5.808111] driver_register+0x8f/0xe0 [5.808119] ? 0xc0c02000 [5.808128] do_one_initcall+0x67/0x320 [5.808138] ? rcu_read_lock_sched_held+0x3f/0x80 [5.808148] ? trace_kmalloc+0xb2/0xe0 [5.808157] ? kmem_cache_alloc_trace+0x174/0x230 [5.808169] do_init_module+0x5c/0x270 [5.808179] __do_sys_init_module+0x130/0x190 [5.808196] do_syscall_64+0x33/0x40 [5.808205] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [5.808216] RIP: 0033:0x7f4d133aa40e [5.808225] Code: 48 8b 0d 65 1a 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 32 1a 0c 00 f7 d8 64 89 01 48 [5.808256] RSP: 002b:7ffc81317fb8 EFLAGS: 0246 ORIG_RAX: 00af [5.808272] RAX: ffda RBX: 563f79509ee0 RCX: 7f4d133aa40e [5.808285] RDX: 563f7951daa0 RSI: 00b8a85e RDI: 563f79f03db0 [5.808298] RBP: 563f79f03db0 R08: 563f79509fd0 R09: 7ffc813146be [5.808311] R10: 563a1aa70959 R11: 0246 R12: 563f7951daa0 [5.808324] R13: 563f7950e9c0 R14: R15: 563f7951f100 2) [6.064107] BUG: key 9c5adb339148 has not been registered! [6.064119] [ cut here ] [6.064121] DEBUG_LOCKS_WARN_ON(1) [6.064124] WARNING: CPU: 14 PID: 504 at kernel/locking/lockdep.c:4618 lockdep_init_map_waits+0x18b/0x210 [6.064131] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 gpu_sched drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel cec igb drm ghash_clmulni_intel ccp nvme dca i2c_algo_bit nvme_core wmi pinctrl_amd fuse [6.064147] CPU: 14 PID: 504 Comm: systemd-udevd Tainted: G W- --- 5.11.0-0.rc6.20210204git61556703b610.145.fc34.x86_64 #1 [6.064152] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 3402 01/13/2021 [6.064156] RIP: 0010:lockdep_init_map_waits+0x18b/0x210 [6.064159] Code: 00 85 c0 0f 84 77 ff ff ff 8b 3d 08 5e f1 01 85 ff 0f 85 69 ff ff ff 48 c7 c6 cc 98 60 9a 48 c7 c7 7d d4 5a 9a e8 51 3a b7 00 <0f> 0b e9 4f ff ff ff e8 c9 82 bd 00 85 c0 74 21 44 8b 15 d6 5d f1 [6.064165] RSP: 0018:bba701be78c8 EFLAGS: 00010292 [6.064168] RAX: 0016 RBX: 9a247b80
Re: [bug] 5.11-rc5 brought page allocation failure issue [ttm][amdgpu]
On Sun, 31 Jan 2021 at 22:22, Christian König wrote: > > > Yeah, known issue. I already pushed Michel's fix to drm-misc-fixes. > Should land in the next -rc by the weekend. > > Regards, > Christian. I checked this patch [1] for several days. And I can confirm that the reported issue was gone. [1] https://lore.kernel.org/lkml/20210128095346.2421-1-mic...@daenzer.net/ -- Best Regards, Mike Gavrilov. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[bug] 5.11-rc5 brought page allocation failure issue [ttm][amdgpu]
The 5.11-rc5 (git 76c057c84d28) brought a new issue. Now the kernel log is flooded with the message "page allocation failure". Trace: msedge:cs0: page allocation failure: order:10, mode:0x190cc2(GFP_HIGHUSER|__GFP_NORETRY|__GFP_NOMEMALLOC), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 18 PID: 4540 Comm: msedge:cs0 Tainted: GW - --- 5.11.0-0.rc5.20210128git76c057c84d28.138.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 3402 01/13/2021 Call Trace: dump_stack+0x8b/0xb0 warn_alloc.cold+0x72/0xd6 ? _cond_resched+0x16/0x50 ? __alloc_pages_direct_compact+0x1a1/0x210 __alloc_pages_slowpath.constprop.0+0xf64/0xf90 ? kmem_cache_alloc+0x299/0x310 ? lock_acquire+0x173/0x380 ? trace_hardirqs_on+0x1b/0xe0 ? lock_release+0x1e9/0x400 __alloc_pages_nodemask+0x37d/0x400 ttm_pool_alloc+0x2a3/0x630 [ttm] ttm_tt_populate+0x37/0xe0 [ttm] ttm_bo_handle_move_mem+0x142/0x180 [ttm] ttm_bo_evict+0x12e/0x1b0 [ttm] ? kfree+0xeb/0x660 ? amdgpu_vram_mgr_new+0x34d/0x3d0 [amdgpu] ttm_mem_evict_first+0x101/0x4d0 [ttm] ttm_bo_mem_space+0x2c8/0x330 [ttm] ttm_bo_validate+0x163/0x1c0 [ttm] amdgpu_cs_bo_validate+0x82/0x190 [amdgpu] amdgpu_cs_list_validate+0x105/0x150 [amdgpu] amdgpu_cs_ioctl+0x803/0x1ef0 [amdgpu] ? trace_hardirqs_off_caller+0x41/0xd0 ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] drm_ioctl_kernel+0x8c/0xe0 [drm] drm_ioctl+0x20f/0x3c0 [drm] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] ? selinux_file_ioctl+0x147/0x200 ? lock_acquired+0x1fa/0x380 ? lock_release+0x1e9/0x400 ? trace_hardirqs_on+0x1b/0xe0 amdgpu_drm_ioctl+0x49/0x80 [amdgpu] __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f829c36c11b Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 25 bd 0c 00 f7 d8 64 89 01 48 RSP: 002b:7f8282c14f38 EFLAGS: 0246 ORIG_RAX: 0010 RAX: ffda RBX: 7f8282c14fa0 RCX: 7f829c36c11b RDX: 7f8282c14fa0 RSI: c0186444 RDI: 0018 RBP: c0186444 R08: 7f8282c15640 R09: 7f8282c14f80 R10: R11: 0246 R12: 1f592c0fe088 R13: 0018 R14: R15: fffd Mem-Info: active_anon:24325 inactive_anon:3569299 isolated_anon:0 active_file:704540 inactive_file:2709725 isolated_file:0 unevictable:1230 dirty:256317 writeback:7074 slab_reclaimable:222328 slab_unreclaimable:112852 mapped:838359 shmem:469422 pagetables:47722 bounce:0 free:107165 free_pcp:1298 free_cma:0 Node 0 active_anon:97300kB inactive_anon:14277196kB active_file:2818160kB inactive_file:10838900kB unevictable:4920kB isolated(anon):0kB isolated(file):0kB mapped:3353436kB dirty:1025268kB writeback:28296kB shmem:1877688kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:62528kB pagetables:190888kB all_unreclaimable? no Node 0 DMA free:11800kB min:32kB low:44kB high:56kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15900kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 3056 31787 31787 31787 Node 0 DMA32 free:303044kB min:6492kB low:9620kB high:12748kB reserved_highatomic:0KB active_anon:20kB inactive_anon:1322808kB active_file:5136kB inactive_file:483136kB unevictable:0kB writepending:220876kB present:3314552kB managed:3246620kB mlocked:0kB bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 28731 28731 28731 Node 0 Normal free:113816kB min:61052kB low:90472kB high:119892kB reserved_highatomic:0KB active_anon:97280kB inactive_anon:12953852kB active_file:2812656kB inactive_file:10355000kB unevictable:4920kB writepending:832688kB present:30133248kB managed:29421044kB mlocked:4920kB bounce:0kB free_pcp:5180kB local_pcp:4kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11800kB Node 0 DMA32: 1009*4kB (UME) 724*8kB (UME) 488*16kB (UME) *32kB (UME) 950*64kB (UME) 620*128kB (UME) 223*256kB (UME) 74*512kB (M) 11*1024kB (M) 2*2048kB (ME) 0*4096kB = 303684kB Node 0 Normal: 964*4kB (UME) 719*8kB (ME) 379*16kB (UME) 192*32kB (UME) 127*64kB (UME) 130*128kB (UME) 122*256kB (UME) 18*512kB (UME) 4*1024kB (UM) 11*2048kB (UM) 0*4096kB = 113656kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 3881804 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 67108860kB Total swap = 67108860kB 8365948 pages RAM 0 pages HighMem/MovableOnly 195057 pages reserved 0 pages cma reserved 0 pages hwpoisoned Full kernel log:
Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
On Thu, 21 Jan 2021 at 18:27, Christian König wrote: > > I still have no idea what's going on here. > > The KASAN messages from the DC code are completely unrelated. > > Please add the full dmesg to your bug report. > I did it. https://gitlab.freedesktop.org/drm/amd/-/issues/1439#note_776267 -- Best Regards, Mike Gavrilov. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
On Fri, 15 Jan 2021 at 03:43, Mikhail Gavrilov wrote: > In rc4, the number of warnings has dropped dramatically. No more errors "kasan slab-out-of-bounds" and no "DMA-API device driver failed to check map error". But still not fixed "sleeping function called from invalid context at include/linux/sched/mm.h:196" and "BUG: key 88810b0d9148 has not been registered!" Second issue Navi specific because it started to happen in 5.10 kernel after replacing Radeon VII to 6900XT. 1. BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 500, name: systemd-udevd 1 lock held by systemd-udevd/500: #0: 888107690258 (>mutex){}-{3:3}, at: device_driver_attach+0xa3/0x250 CPU: 9 PID: 500 Comm: systemd-udevd Not tainted 5.11.0-0.rc4.129.fc34.x86_64+debug #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0xae/0xe5 ___might_sleep.cold+0x150/0x17e ? dcn30_clock_source_create+0x53/0x110 [amdgpu] kmem_cache_alloc_trace+0x23f/0x270 dcn30_clock_source_create+0x53/0x110 [amdgpu] dcn30_create_resource_pool+0x998/0x4890 [amdgpu] ? dcn30_calc_max_scaled_time+0x40/0x40 [amdgpu] ? lock_is_held_type+0xb8/0xf0 ? unpoison_range+0x3a/0x60 ? kasan_kmalloc.constprop.0+0x84/0xa0 ? dc_create_resource_pool+0x26e/0x5e0 [amdgpu] dc_create_resource_pool+0x26e/0x5e0 [amdgpu] dc_create+0x636/0x1bc0 [amdgpu] ? lock_acquire+0x2dd/0x7a0 ? sched_clock+0x5/0x10 ? sched_clock_cpu+0x18/0x170 ? find_held_lock+0x33/0x110 ? dc_create_state+0xa0/0xa0 [amdgpu] ? lock_downgrade+0x6b0/0x6b0 ? module_assert_mutex_or_preempt+0x3e/0x70 ? lock_is_held_type+0xb8/0xf0 ? unpoison_range+0x3a/0x60 ? kasan_kmalloc.constprop.0+0x84/0xa0 amdgpu_dm_init.isra.0+0x479/0x640 [amdgpu] ? vprintk_emit+0x1c0/0x460 ? dev_vprintk_emit+0x2d8/0x31a ? sched_clock+0x5/0x10 ? dm_resume+0x13b0/0x13b0 [amdgpu] ? dev_attr_show.cold+0x35/0x35 ? lock_downgrade+0x6b0/0x6b0 ? dev_printk_emit+0x8c/0xa8 ? dev_vprintk_emit+0x31a/0x31a ? wait_for_completion_io+0x240/0x240 ? __dev_printk+0x71/0xdf ? smu_hw_init.cold+0x16b/0x18a [amdgpu] ? smu_suspend+0x240/0x240 [amdgpu] ? navi10_ih_irq_init+0xea3/0x2420 [amdgpu] dm_hw_init+0xe/0x20 [amdgpu] amdgpu_device_init.cold+0x3031/0x4940 [amdgpu] ? amdgpu_device_cache_pci_state+0xf0/0xf0 [amdgpu] ? pci_bus_read_config_byte+0x140/0x140 ? do_pci_enable_device+0x1f8/0x260 ? pci_find_saved_ext_cap+0x110/0x110 ? pci_enable_bridge+0xf9/0x1e0 ? pci_dev_check_d3cold+0x107/0x250 ? pci_enable_device_flags+0x201/0x340 amdgpu_driver_load_kms+0x167/0x8a0 [amdgpu] amdgpu_pci_probe+0x235/0x360 [amdgpu] ? amdgpu_pci_remove+0xd0/0xd0 [amdgpu] local_pci_probe+0xd8/0x170 pci_device_probe+0x318/0x5c0 ? kernfs_create_link+0x16c/0x230 ? pci_device_remove+0x1d0/0x1d0 really_probe+0x224/0xc40 driver_probe_device+0x1f2/0x380 device_driver_attach+0x1df/0x250 __driver_attach+0xf6/0x260 ? device_driver_attach+0x250/0x250 bus_for_each_dev+0x114/0x180 ? subsys_dev_iter_exit+0x10/0x10 bus_add_driver+0x352/0x570 driver_register+0x20f/0x390 ? __pci_register_driver+0x13a/0x210 ? 0xc1d8d000 do_one_initcall+0xfb/0x530 ? perf_trace_initcall_level+0x3d0/0x3d0 ? __memset+0x2b/0x30 ? unpoison_range+0x3a/0x60 do_init_module+0x1ce/0x7a0 load_module+0x9841/0xa380 ? module_frob_arch_sections+0x20/0x20 ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0 ? sched_clock_cpu+0x18/0x170 ? sched_clock+0x5/0x10 ? lock_acquire+0x2dd/0x7a0 ? sched_clock+0x5/0x10 ? lock_is_held_type+0xb8/0xf0 ? __do_sys_init_module+0x18b/0x220 __do_sys_init_module+0x18b/0x220 ? load_module+0xa380/0xa380 ? ktime_get_coarse_real_ts64+0x12f/0x160 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f2c109da07e Code: 48 8b 0d f5 1d 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c2 1d 0c 00 f7 d8 64 89 01 48 RSP: 002b:7ffc84d33f88 EFLAGS: 0246 ORIG_RAX: 00af RAX: ffda RBX: 55b87f8260a0 RCX: 7f2c109da07e RDX: 55b87f834060 RSI: 01e2cbf6 RDI: 7f2c0b7e0010 RBP: 7f2c0b7e0010 R08: 55b87f8281e0 R09: 7ffc84d30a26 R10: 55bd2404cc18 R11: 0246 R12: 55b87f834060 R13: 55b87f831ca0 R14: R15: 55b87f832640 [drm] Display Core initialized with v3.2.116! [drm] DMUB hardware initialized: version=0x0201 usb 1-3.2: Device not responding to setup address. usb 1-3.2: device not accepting address 5, error -71 [drm] REG_WAIT timeout 1us * 10 tries - mpc2_assert_idle_mpcc line:480 2. BUG: key 88810b0d9148 has not been registered! [ cut here ] DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 25 PID: 500 at kernel/locking/lockdep.c:4618 lockdep_init_map_waits+0x592/0x770 Modules linked
Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
On Thu, 14 Jan 2021 at 18:56, Christian König wrote: > Unfortunately not of hand. > > I also don't see any bug reports from other people and can't reproduce > the last backtrace you send out TTM here. Because only the most desperate will install kernels with enabled debug flags and then load the system by opening a huge number of programs and tabs. So you shouldn't be surprised that I'm the only one here. This is what my desktop looks like every day: https://imgur.com/a/Kxlmrem > Do you have any local modifications or special setup in your system? > Like bpf scripts or something like that? No, my I didn't write any bpf scripts, but looks like my distribution Fedora Rawhide uses some bpf scripts by default out of box: # bpftool prog 20: cgroup_device tag 40ddf486530245f5 gpl loaded_at 2021-01-15T01:30:04+0500 uid 0 xlated 504B jited 309B memlock 4096B 21: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:04+0500 uid 0 xlated 64B jited 54B memlock 4096B 22: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:04+0500 uid 0 xlated 64B jited 54B memlock 4096B 23: cgroup_device tag ca8e50a3c7fb034b gpl loaded_at 2021-01-15T01:30:05+0500 uid 0 xlated 496B jited 307B memlock 4096B 24: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:05+0500 uid 0 xlated 64B jited 54B memlock 4096B 25: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:05+0500 uid 0 xlated 64B jited 54B memlock 4096B 26: cgroup_device tag be31ae23198a0378 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 464B jited 288B memlock 4096B 27: cgroup_device tag ee0e253c78993a24 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 416B jited 255B memlock 4096B 28: cgroup_device tag 438c5618576e5b0c gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 568B jited 354B memlock 4096B 29: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 64B jited 54B memlock 4096B 30: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 64B jited 54B memlock 4096B 31: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 64B jited 54B memlock 4096B 32: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 64B jited 54B memlock 4096B 33: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 64B jited 54B memlock 4096B 34: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 64B jited 54B memlock 4096B 35: cgroup_device tag ee0e253c78993a24 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 416B jited 255B memlock 4096B 38: cgroup_device tag 3a0ef5414c2f6fca gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 744B jited 447B memlock 4096B 39: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 64B jited 54B memlock 4096B 40: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 64B jited 54B memlock 4096B 41: cgroup_device tag ee0e253c78993a24 gpl loaded_at 2021-01-15T01:30:18+0500 uid 0 xlated 416B jited 255B memlock 4096B 42: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:18+0500 uid 0 xlated 64B jited 54B memlock 4096B 43: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:18+0500 uid 0 xlated 64B jited 54B memlock 4096B I catched yet another couples of leaks , but nothing new: https://pastebin.com/2EgvYJdz [1] do_detailed_mode+0x7c1/0x13d0 [drm] [2] drm_mode_duplicate+0x45/0x220 [drm] [3] do_seccomp+0x215/0x2280 [4] __vmalloc_node_range+0x464/0x7b0 [5] bpf_prog_alloc_no_stats+0xa2/0x2b0 [6] bpf_prog_store_orig_filter+0x7b/0x1c0 [7] kmemdup+0x1a/0x40 Did the following trace message confuse anyone? == BUG: KASAN: slab-out-of-bounds in kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu] Read of size 1 at addr 88812a6b4181 by task systemd-udevd/491 CPU: 20 PID: 491 Comm: systemd-udevd Not tainted 5.11.0-0.rc3.20210114git65f0d2414b70.125.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0xae/0xe5 print_address_description.constprop.0+0x18/0x160 ? kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu] kasan_report.cold+0x7f/0x10e ? kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu] kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu] ? kfd_create_crat_image_acpi+0x340/0x340 [amdgpu] ? __raw_spin_lock_init+0x39/0x110 kfd_topology_init+0x2ac/0x400 [amdgpu] ? kfd_create_topology_device+0x320/0x320 [amdgpu] ? __class_register+0x2ad/0x430 ? __class_create+0xc5/0x130 kgd2kfd_init+0x95/0xf0 [amdgpu]
Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
On Tue, 12 Jan 2021 at 01:45, Christian König wrote: > > But what you have in your logs so far are only unrelated symptoms, the > root of the problem is that somebody is leaking memory. > > What you could do as well is to try to enable kmemleak I captured some memleaks. Do they contain any useful information? [1] https://pastebin.com/n0FE7Hsu [2] https://pastebin.com/MUX55L1k [3] https://pastebin.com/a3FT7DVG [4] https://pastebin.com/1ALvJKz7 -- Best Regards, Mike Gavrilov. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Hi Christian, On Tue, 12 Jan 2021 at 01:45, Christian König wrote: > > Hi Mike, > > Unfortunately not, that's DC stuff. Easiest is to assign this as a bug > tracker to our DC team. Ok > At least some progress. Any objections that I add your e-mail address as > tested-by tag? Yes, feel free add me. > I can take a look at this one here. Looks like some missing error > handling when allocating memory. > Can you decode to which line number ttm_tt_swapin+0x34 points to? $ /usr/src/kernels/`uname -r`/scripts/faddr2line /lib/debug/lib/modules/`uname -r`/kernel/drivers/gpu/drm/ttm/ttm.ko.debug ttm_tt_swapin+0x34 ttm_tt_swapin+0x34/0xd0: mapping_gfp_mask at /usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/./include/linux/pagemap.h:105 (discriminator 2) (inlined by) ttm_tt_swapin at /usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/drivers/gpu/drm/ttm/ttm_tt.c:210 (discriminator 2) $ cat -s -n /usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/drivers/gpu/drm/ttm/ttm_tt.c | head -220 | tail -20 201 struct page *from_page; 202 struct page *to_page; 203 gfp_t gfp_mask; 204 int i, ret; 205 206 swap_storage = ttm->swap_storage; 207 BUG_ON(swap_storage == NULL); 208 209 swap_space = swap_storage->f_mapping; 210 gfp_mask = mapping_gfp_mask(swap_space); 211 212 for (i = 0; i < ttm->num_pages; ++i) { 213 from_page = shmem_read_mapping_page_gfp(swap_space, i, 214 gfp_mask); 215 if (IS_ERR(from_page)) { 216 ret = PTR_ERR(from_page); 217 goto out_err; 218 } 219 to_page = ttm->pages[i]; 220 if (unlikely(to_page == NULL)) { > Please use this one here: > https://gitlab.freedesktop.org/drm/amd/-/issues/new > > If you can't find the DC guys of hand in the assignee list just assign > to me and I will forward. https://gitlab.freedesktop.org/drm/amd/-/issues/1439 Ok, let's continue there. -- Best Regards, Mike Gavrilov. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
On Mon, 11 Jan 2021 at 19:01, Christian König wrote: > Changing the page table attributes while releasing memory might sleep. > So we can't use a spinlock here. > > Thanks for the report, a patch to fix this is on the mailing list now. Can you look also the first trace? Here a same error message "sleeping function called from invalid context" and a lot of [amdgpu] code. BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 501, name: systemd-udevd 1 lock held by systemd-udevd/501: #0: 978e0278d258 (>mutex){}-{3:3}, at: device_driver_attach+0x3b/0xb0 CPU: 25 PID: 501 Comm: systemd-udevd Not tainted 5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0x8b/0xb0 ___might_sleep.cold+0xb6/0xc6 ? dcn30_clock_source_create+0x34/0xb0 [amdgpu] kmem_cache_alloc_trace+0x204/0x230 dcn30_clock_source_create+0x34/0xb0 [amdgpu] dcn30_create_resource_pool+0x1d9/0x13a0 [amdgpu] ? rcu_read_lock_sched_held+0x3f/0x80 ? trace_kmalloc+0xb2/0xe0 ? __kmalloc+0x191/0x280 ? dc_create_resource_pool+0x110/0x1d0 [amdgpu] dc_create_resource_pool+0x110/0x1d0 [amdgpu] dc_create+0x205/0x790 [amdgpu] ? trace_kmalloc+0xb2/0xe0 ? kmem_cache_alloc_trace+0x174/0x230 amdgpu_dm_init.isra.0+0x1b9/0x250 [amdgpu] ? dev_vprintk_emit+0x171/0x195 ? dev_printk_emit+0x3e/0x40 dm_hw_init+0xe/0x20 [amdgpu] amdgpu_device_init.cold+0x179f/0x1afd [amdgpu] ? pci_conf1_read+0xa4/0x100 amdgpu_driver_load_kms+0x68/0x280 [amdgpu] amdgpu_pci_probe+0x129/0x1b0 [amdgpu] local_pci_probe+0x42/0x80 pci_device_probe+0xd9/0x1a0 really_probe+0x205/0x460 driver_probe_device+0xe1/0x150 device_driver_attach+0xa8/0xb0 __driver_attach+0x8c/0x150 ? device_driver_attach+0xb0/0xb0 ? device_driver_attach+0xb0/0xb0 bus_for_each_dev+0x67/0x90 bus_add_driver+0x12e/0x1f0 driver_register+0x8f/0xe0 ? 0xc0d9c000 do_one_initcall+0x67/0x320 ? rcu_read_lock_sched_held+0x3f/0x80 ? trace_kmalloc+0xb2/0xe0 ? kmem_cache_alloc_trace+0x174/0x230 do_init_module+0x5c/0x270 __do_sys_init_module+0x130/0x190 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f363661deee Code: 48 8b 0d 85 1f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 52 1f 0c 00 f7 d8 64 89 01 48 RSP: 002b:7ffeb7191588 EFLAGS: 0246 ORIG_RAX: 00af RAX: ffda RBX: 561b94563170 RCX: 7f363661deee RDX: 561b94579df0 RSI: 00b8a356 RDI: 7f3633b9e010 RBP: 7f3633b9e010 R08: 561b94565240 R09: 7ffeb718d786 R10: 561ef5ef1595 R11: 0246 R12: 561b94579df0 R13: 561b9457a3e0 R14: R15: 561b94576530 [drm] Display Core initialized with v3.2.116! [drm] DMUB hardware initialized: version=0x0201 usb 1-3.2: new high-speed USB device number 5 using xhci_hcd [drm] REG_WAIT timeout 1us * 10 tries - mpc2_assert_idle_mpcc line:480 > > -12 is just -ENOMEM. Looks like a memory leak to me, maybe caused by > > the problem above, maybe something completely unrelated. > > > > I will take a look. > > The looks like a completely unrelated memory leak to me. > > Probably best if you open up a bug report for this. Yes, the monitor still turns off after applying patch "make the pool shrinker lock a mutex". Anyway patch fixed the issue with flood of message "BUG: sleeping function called from invalid context at mm/vmalloc.c:1756" so kernel log became cleaner. Now the issue with turns off monitor looks in logs so: DMA-API: cacheline tracking ENOMEM, dma-debug disabled amdgpu :0b:00.0: amdgpu: 6b791523 pin failed [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12 BUG: kernel NULL pointer dereference, address: 0060 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 P4D 0 Oops: [#1] SMP NOPTI CPU: 20 PID: 3780 Comm: brave:cs0 Tainted: GW- --- 5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 RIP: 0010:ttm_tt_swapin+0x34/0x1b0 [ttm] Code: 55 41 54 55 53 48 83 ec 10 48 8b 47 20 48 89 44 24 08 48 85 c0 0f 84 86 01 00 00 48 8b 44 24 08 49 89 fc 4c 8b a8 e0 01 00 00 <41> 8b 45 60 89 44 24 04 8b 47 0c 85 c0 0f 84 df 00 00 00 31 db 65 RSP: 0018:a7400532b9c0 EFLAGS: 00010286 RAX: 978e2ae25800 RBX: 97910ec12058 RCX: 978e12caac70 RDX: 8010 RSI: RDI: 97912c3d99c0 RBP: 97912c3d99c0 R08: R09: 70b3a000 R10: 0002 R11: R12: 97912c3d99c0 R13: R14: a7400532ba90 R15: 978e182c6350 FS: 7f070bb1b640()
[drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Hi folks, today I joined to testing Kernel 5.11 and saw that the kernel log was flooded with BUG messages: BUG: sleeping function called from invalid context at mm/vmalloc.c:1756 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 266, name: kswapd0 INFO: lockdep is turned off. CPU: 15 PID: 266 Comm: kswapd0 Tainted: GW- --- 5.11.0-0.rc2.20210108gitf5e6c330254a.119.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0x8b/0xb0 ___might_sleep.cold+0xb6/0xc6 vm_unmap_aliases+0x21/0x40 change_page_attr_set_clr+0x9e/0x190 set_memory_wb+0x2f/0x80 ttm_pool_free_page+0x28/0x90 [ttm] ttm_pool_shrink+0x45/0xb0 [ttm] ttm_pool_shrinker_scan+0xa/0x20 [ttm] do_shrink_slab+0x177/0x3a0 shrink_slab+0x9c/0x290 shrink_node+0x2e6/0x700 balance_pgdat+0x2f5/0x650 kswapd+0x21d/0x4d0 ? do_wait_intr_irq+0xd0/0xd0 ? balance_pgdat+0x650/0x650 kthread+0x13a/0x150 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 But the most unpleasant thing is that after a while the monitor turns off and does not go on again until the restart. This is accompanied by an entry in the kernel log: amdgpu :0b:00.0: amdgpu: ff7d8b94 pin failed [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12 $ grep "Failed to pin framebuffer with error" -Rn . ./drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:5816: DRM_ERROR("Failed to pin framebuffer with error %d\n", r); $ git blame -L 5811,5821 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c Blaming lines: 0% (11/9167), done. 5d43be0ccbc2f (Christian König 2017-10-26 18:06:23 +0200 5811) domain = AMDGPU_GEM_DOMAIN_VRAM; e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5812) 7b7c6c81b3a37 (Junwei Zhang2018-06-25 12:51:14 +0800 5813) r = amdgpu_bo_pin(rbo, domain); e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5814) if (unlikely(r != 0)) { 30b7c6147d18d (Harry Wentland 2017-10-26 15:35:14 -0400 5815) if (r != -ERESTARTSYS) 30b7c6147d18d (Harry Wentland 2017-10-26 15:35:14 -0400 5816) DRM_ERROR("Failed to pin framebuffer with error %d\n", r); 0f257b09531b4 (Chunming Zhou 2019-05-07 19:45:31 +0800 5817) ttm_eu_backoff_reservation(, ); e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5818) return r; e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5819) } e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5820) bb812f1ea87dd (Junwei Zhang2018-06-25 13:32:24 +0800 5821) r = amdgpu_ttm_alloc_gart(>tbo); Who knows how to fix it? Full kernel logs is here: [1] https://pastebin.com/fLasjDHX [2] https://pastebin.com/g3wR2r9e -- Best Regards, Mike Gavrilov. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
BUG: key ffff8b521bda9148 has not been registered!
Hi folks! I started to see this message every boot after replacing Radeon VII to 6900XT. $ journalctl | grep "BUG: key" Dec 31 05:19:42 localhost.localdomain kernel: BUG: key 98b59ab01148 has not been registered! Dec 31 05:25:44 localhost.localdomain kernel: BUG: key 8d425ba01148 has not been registered! Jan 02 17:36:25 localhost.localdomain kernel: BUG: key 935e5a959148 has not been registered! Jan 03 03:29:08 localhost.localdomain kernel: BUG: key 8d425b0b9148 has not been registered! Jan 03 03:33:35 localhost.localdomain kernel: BUG: key 8bc35aef9148 has not been registered! Jan 03 16:47:44 localhost.localdomain kernel: BUG: key 9a3cdb959148 has not been registered! Jan 06 14:59:58 localhost.localdomain kernel: BUG: key 97b6db9f9148 has not been registered! Jan 07 14:51:49 localhost.localdomain kernel: BUG: key 8f2dda569148 has not been registered! Jan 07 15:08:23 localhost.localdomain kernel: BUG: key a0849bd31148 has not been registered! Jan 08 18:07:28 localhost.localdomain kernel: BUG: key 89721a0e9148 has not been registered! Jan 08 18:12:51 localhost.localdomain kernel: BUG: key 8b521bda9148 has not been registered! Here is trace: [6.333672] [drm] REG_WAIT timeout 1us * 10 tries - mpc2_assert_idle_mpcc line:480 [6.335258] BUG: key 8b521bda9148 has not been registered! [6.335271] [ cut here ] [6.335273] DEBUG_LOCKS_WARN_ON(1) [6.335279] WARNING: CPU: 18 PID: 525 at kernel/locking/lockdep.c:4618 lockdep_init_map_waits+0x18b/0x210 [6.335284] Modules linked in: fjes(-) amdgpu(+) iommu_v2 gpu_sched ttm drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel cec drm ghash_clmulni_intel ccp igb nvme nvme_core dca i2c_algo_bit wmi pinctrl_amd fuse [6.335298] CPU: 18 PID: 525 Comm: systemd-udevd Not tainted 5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64 #1 [6.335302] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 [6.335306] RIP: 0010:lockdep_init_map_waits+0x18b/0x210 [6.335309] Code: 00 85 c0 0f 84 75 ff ff ff 8b 3d 18 c4 f1 01 85 ff 0f 85 67 ff ff ff 48 c7 c6 68 43 60 97 48 c7 c7 1d 90 5a 97 e8 70 1f b6 00 <0f> 0b e9 4d ff ff ff e8 19 59 bc 00 85 c0 74 21 44 8b 1d e6 c3 f1 [6.335315] RSP: 0018:9e5a013d3910 EFLAGS: 00010282 [6.335317] RAX: 0016 RBX: 97247d80 RCX: 8b5908fdb238 [6.335320] RDX: ffd8 RSI: 0027 RDI: 8b5908fdb230 [6.335322] RBP: 8b520e2a7978 R08: R09: [6.335325] R10: 9e5a013d3740 R11: 8b592e2fffe8 R12: 8b521bda9148 [6.335327] R13: R14: 8b521bc30330 R15: 8b521bc30330 [6.335330] FS: 7fe019eb9140() GS:8b5908e0() knlGS: [6.335333] CS: 0010 DS: ES: CR0: 80050033 [6.335336] CR2: 7fe018f5e000 CR3: 0001142ee000 CR4: 00350ee0 [6.335338] Call Trace: [6.335342] __kernfs_create_file+0x7b/0x100 [6.335344] sysfs_add_file_mode_ns+0xa3/0x190 [6.335347] sysfs_create_bin_file+0x50/0x70 [6.335428] hdcp_create_workqueue+0x3bd/0x410 [amdgpu] [6.335499] amdgpu_dm_init.isra.0.cold+0x136/0x126d [amdgpu] [6.335570] ? psp_set_srm+0xb0/0xb0 [amdgpu] [6.335637] ? hdcp_update_display+0x1f0/0x1f0 [amdgpu] [6.335641] ? dev_printk_emit+0x3e/0x40 [6.335709] dm_hw_init+0xe/0x20 [amdgpu] [6.335776] amdgpu_device_init.cold+0x18c3/0x1bbc [amdgpu] [6.335781] ? pci_bus_read_config_word+0x39/0x50 [6.335831] amdgpu_driver_load_kms+0x2b/0x1f0 [amdgpu] [6.335879] amdgpu_pci_probe+0x129/0x1b0 [amdgpu] [6.335889] local_pci_probe+0x42/0x80 [6.335891] pci_device_probe+0xd9/0x1a0 [6.335896] really_probe+0x205/0x460 [6.335898] driver_probe_device+0xe1/0x150 [6.335901] device_driver_attach+0xa8/0xb0 [6.335904] __driver_attach+0x8c/0x150 [6.335907] ? device_driver_attach+0xb0/0xb0 [6.335909] ? device_driver_attach+0xb0/0xb0 [6.335911] bus_for_each_dev+0x67/0x90 [6.335914] bus_add_driver+0x12e/0x1f0 [6.335917] driver_register+0x8b/0xe0 [6.335919] ? 0xc0e4c000 [6.335922] do_one_initcall+0x67/0x320 [6.335925] ? rcu_read_lock_sched_held+0x3f/0x80 [6.335928] ? trace_kmalloc+0xb2/0xe0 [6.335930] ? kmem_cache_alloc_trace+0x157/0x270 [6.335934] do_init_module+0x5c/0x260 [6.335936] __do_sys_init_module+0x13d/0x1a0 [6.335940] do_syscall_64+0x33/0x40 [6.335943] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [6.335945] RIP: 0033:0x7fe01aab2efe [6.335948] Code: 48 8b 0d 7d 1f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4a 1f 0c 00 f7 d8 64 89 01 48 [6.335953] RSP: 002b:7ffdf4879928 EFLAGS: 0246 ORIG_RAX: 00af [6.335957] RAX: