from:"mikhail"

Re: regression/bisected/6.8 commit f7fe64ad0f22ff034f8ebcfbd7299ee9cc9b57d7 leads to GPU hang when I open GNOME activities

2024-01-24 Thread Mikhail Gavrilov

On Wed, Jan 24, 2024 at 7:19 AM Mikhail Gavrilov
 wrote:
>
> Who could dig into it, please?

You decided to revert it?
https://lkml.org/lkml/2024/1/22/1866

Also I forgot to attach the kernel build .config in the previous
message. I'm going to fix it here.
It may be useful for reproducing my bug script.

-- 
Best Regards,
Mike Gavrilov.

.config.zip
Description: Zip archive

Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

2023-12-19 Thread Mikhail Gavrilov

On Fri, Dec 15, 2023 at 5:37 PM Christian König
 wrote:
>
> I have no idea :)
>
>  From the logs I can see that the AMDGPU now has the proper BARs assigned:
>
> [5.722015] pci :03:00.0: [1002:73df] type 00 class 0x038000
> [5.722051] pci :03:00.0: reg 0x10: [mem
> 0xf8-0xfb 64bit pref]
> [5.722081] pci :03:00.0: reg 0x18: [mem
> 0xfc-0xfc0fff 64bit pref]
> [5.722112] pci :03:00.0: reg 0x24: [mem 0xfca0-0xfcaf]
> [5.722134] pci :03:00.0: reg 0x30: [mem 0xfcb0-0xfcb1 pref]
> [5.722368] pci :03:00.0: PME# supported from D1 D2 D3hot D3cold
> [5.722484] pci :03:00.0: 63.008 Gb/s available PCIe bandwidth,
> limited by 8.0 GT/s PCIe x8 link at :00:01.1 (capable of 252.048
> Gb/s with 16.0 GT/s PCIe x16 link)
>
> And with that the driver can work perfectly fine.
>
> Have you updated the BIOS or added/removed some other hardware? Maybe
> somebody added a quirk for your BIOS into the PCIe code or something
> like that.

No, nothing changed in hardware.
But I found the commit which fixes it.

> git bisect unfixed
92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 is the first fixed commit
commit 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6
Author: Vasant Hegde 
Date:   Thu Sep 21 09:21:45 2023 +

iommu/amd: Introduce iommu_dev_data.flags to track device capabilities

Currently we use struct iommu_dev_data.iommu_v2 to keep track of the device
ATS, PRI, and PASID capabilities. But these capabilities can be enabled
independently (except PRI requires ATS support). Hence, replace
the iommu_v2 variable with a flags variable, which keep track of the device
capabilities.

From commit 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability
with all VFs"), device PRI/PASID is shared between PF and any associated
VFs. Hence use pci_pri_supported() and pci_pasid_features() instead of
pci_find_ext_capability() to check device PRI/PASID support.

Signed-off-by: Vasant Hegde 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Jerry Snitselaar 
Link: https://lore.kernel.org/r/20230921092147.5930-13-vasant.he...@amd.com
Signed-off-by: Joerg Roedel 

 drivers/iommu/amd/amd_iommu_types.h |  3 ++-
 drivers/iommu/amd/iommu.c   | 46 ++---
 2 files changed, 30 insertions(+), 19 deletions(-)


> git bisect log
git bisect start '--term-new=fixed' '--term-old=unfixed'
# status: waiting for both good and bad commits
# fixed: [33cc938e65a98f1d29d0a18403dbbee050dcad9a] Linux 6.7-rc4
git bisect fixed 33cc938e65a98f1d29d0a18403dbbee050dcad9a
# status: waiting for good commit(s), bad commit known
# unfixed: [ffc253263a1375a65fa6c9f62a893e9767fbebfa] Linux 6.6
git bisect unfixed ffc253263a1375a65fa6c9f62a893e9767fbebfa
# unfixed: [7d461b291e65938f15f56fe58da2303b07578a76] Merge tag
'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm
git bisect unfixed 7d461b291e65938f15f56fe58da2303b07578a76
# unfixed: [e14aec23025eeb1f2159ba34dbc1458467c4c347] s390/ap: fix AP
bus crash on early config change callback invocation
git bisect unfixed e14aec23025eeb1f2159ba34dbc1458467c4c347
# unfixed: [be3ca57cfb777ad820c6659d52e60bbdd36bf5ff] Merge tag
'media/v6.7-1' of
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect unfixed be3ca57cfb777ad820c6659d52e60bbdd36bf5ff
# fixed: [c0d12d769299e1e08338988c7745009e0db2a4a0] Merge tag
'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm
git bisect fixed c0d12d769299e1e08338988c7745009e0db2a4a0
# fixed: [4bbdb725a36b0d235f3b832bd0c1e885f0442d9f] Merge tag
'iommu-updates-v6.7' of
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
git bisect fixed 4bbdb725a36b0d235f3b832bd0c1e885f0442d9f
# unfixed: [25b6377007ebe1c3ede773fd6979f613386db000] Merge tag
'drm-next-2023-11-07' of git://anongit.freedesktop.org/drm/drm
git bisect unfixed 25b6377007ebe1c3ede773fd6979f613386db000
# unfixed: [67c0afb6424fee94238d9a32b97c407d0c97155e] Merge tag
'exfat-for-6.7-rc1-part2' of
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
git bisect unfixed 67c0afb6424fee94238d9a32b97c407d0c97155e
# unfixed: [3613047280ec42a4e1350fdc1a6dd161ff4008cc] Merge tag
'v6.6-rc7' into core
git bisect unfixed 3613047280ec42a4e1350fdc1a6dd161ff4008cc
# fixed: [cedc811c76778bdef91d405717acee0de54d8db5] iommu/amd: Remove
DMA_FQ type from domain allocation path
git bisect fixed cedc811c76778bdef91d405717acee0de54d8db5
# unfixed: [b0cc5dae1ac0c18748706a4beb636e3b726dd744] iommu/amd:
Rename ats related variables
git bisect unfixed b0cc5dae1ac0c18748706a4beb636e3b726dd744
# fixed: [5a0b11a180a9b82b4437a4be1cf73530053f139b] iommu/amd: Remove
iommu_v2 module
git bisect fixed 5a0b11a180a9b82b4437a4be1cf73530053f139b
# fixed: [92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6] iommu/amd:
Introduce iommu_dev_data.flags to track device capabilities
git bisect fixed 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6
# unfixed:

Re: regression/bisected/6.7rc1: Instead of desktop I see a horizontal flashing bar with a picture of the desktop background on white screen

2023-12-18 Thread Mikhail Gavrilov

On Fri, Dec 15, 2023 at 9:14 PM Hamza Mahfooz  wrote:
>
> Can you try the following patch with old fw (version 0x07002100 should
> be fine)?: https://patchwork.freedesktop.org/patch/572298/
>

Tested-by: Mikhail Gavrilov  on 7900XTX hardware.

Can I ask?
What does SubVP actually do?
I read on Phoronix that this is new feature of DCN 3.2 hardware
https://www.phoronix.com/news/AMDGPU-Linux-6.5-Improvements
But I didn't notice that anything began to work better after enabling
this feature.
On the contrary, my kernel logs began to become overgrown with
unpleasant errors.
See here: https://gitlab.freedesktop.org/drm/amd/-/issues/2796
I bisected this issue and bisect heads me to commit
299004271cbf0315da327c4bd67aec3e7041cb32 which enables SubVP high
refresh rate.
But without SubVP I also had 120Hz and 4K. So I ask again what is the
profit of SubVP?

-- 
Best Regards,
Mike Gavrilov.

Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

2023-12-15 Thread Mikhail Gavrilov

On Tue, Feb 28, 2023 at 5:43 PM Christian König
 wrote:
>
> The point is it doesn't need to talk to the amdgpu hardware. What it
> does is that it talks to the good old VGA/VESA emulation and that just
> happens to be still enabled by the BIOS/GRUB.
>
> And that VGA/VESA emulation doesn't need any BAR or whatever to keep the
> hw running in the state where it was initialized before the kernel
> started. The kernel just grabs the addresses where it needs to write the
> display data and keeps going with that.
>
> But when a hw specific driver wants to load this is the first thing
> which gets disabled because we need to load new firmware. And with the
> BARs disabled this can't be re-enabled without rebooting the system.
>
> > My suggestion is that if
> > amdgpu fails to talk to the hardware, then let another suitable driver
> > do it. I attached a system log when I apply "pci=nocrs" with
> > "modprobe.blacklist=amdgpu" for showing that graphics work right in
> > this case.
> > To do this, does the Linux module loading mechanism need to be refined?
>
> That's actually working as expected. The real problem is that the BIOS
> on that system is so broken that we can't access the hw correctly.
>
> What we could to do is to check the BARs very early on and refuse to
> load when they are disable. The problem with this approach is that there
> are systems where it is normal that the BARs are disable until the
> driver loads and get enabled during the hardware initialization process.
>
> What you might want to look into is to find a quirk for the BIOS to
> properly enable the nvme controller.
>

That's interesting. I noticed that now amdgpu could work even with
parameter [pci=nocrs] on 6.7.0-0.rc4 and higher kernels.
It means BARs became available?
I attached here the kerner log and lspci. What's changed?

-- 
Best Regards,
Mike Gavrilov.
<>
<>

Re: 6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6

2023-11-17 Thread Mikhail Gavrilov

On Thu, Nov 16, 2023 at 11:56 PM Alex Deucher  wrote:
>
> This patch should address the issue:
> https://patchwork.freedesktop.org/patch/567101/
> If you still see issues, you may also need this series:
> https://patchwork.freedesktop.org/series/126220/
>
> Alex

Thanks.
The first one patch is enough.
Tested-on: 7900XTX, 6900XT and 6800M.
Tested-by: Mikhail Gavrilov 

-- 
Best Regards,
Mike Gavrilov.

Re: regression/bisected/6.7rc1: Instead of desktop I see a horizontal flashing bar with a picture of the desktop background on white screen

2023-11-15 Thread Mikhail Gavrilov

On Wed, Nov 15, 2023 at 11:39 PM Lee, Alvin  wrote:
>
> This change has a DMCUB dependency - are you able to update your DMCUB 
> version as well?
>

I can confirm this issue was gone after updating firmware.

❯ dmesg | grep DMUB
[   11.496679] [drm] Loading DMUB firmware via PSP: version=0x07002300
[   12.000314] [drm] DMUB hardware initialized: version=0x07002300



-- 
Best Regards,
Mike Gavrilov.

Re: regression/bisected/6.7rc1: Instead of desktop I see a horizontal flashing bar with a picture of the desktop background on white screen

2023-11-15 Thread Mikhail Gavrilov

On Wed, Nov 15, 2023 at 11:14 PM Hamza Mahfooz  wrote:
>
> What version of DMUB firmware are you on?
> The easiest way to find out would be using the following:
>
> # dmesg | grep DMUB
>

Sapphire AMD Radeon RX 7900 XTX PULSE OC:
❯ dmesg | grep DMUB
[   14.341362] [drm] Loading DMUB firmware via PSP: version=0x07002100
[   14.725547] [drm] DMUB hardware initialized: version=0x07002100

Reference GIGABYTE Radeon RX 7900 XTX 24G:
❯ dmesg | grep DMUB
[   11.405115] [drm] Loading DMUB firmware via PSP: version=0x07002100
[   11.773395] [drm] DMUB hardware initialized: version=0x07002100


-- 
Best Regards,
Mike Gavrilov.

Re: regression/bisected/6.7rc1: Instead of desktop I see a horizontal flashing bar with a picture of the desktop background on white screen

2023-11-15 Thread Mikhail Gavrilov

On Tue, Nov 14, 2023 at 11:03 PM Mikhail Gavrilov
 wrote:
>
> On Tue, Nov 14, 2023 at 3:55 PM Mikhail Gavrilov
>  wrote:
> >
> > Hi,
> > Yesterday came the 6.7-rc1 kernel.
> > And surprisingly it turned out it is not working with my LG C3.
> > I use this OLED TV as my primary monitor.
> > After login to GNOME I see a horizontal flashing bar with a picture of
> > the desktop background on white screen.
> > Demonstration: https://youtu.be/7F76VfRkrVo
> >
> > I made a bisection.
> > And bisect said that the first bad commit is:
> > commit ed6e2782e974750f671e1101250bb19045be
> > Author: Alvin Lee 
> > Date:   Mon Oct 23 14:33:16 2023 -0400
> >
> > drm/amd/display: For cursor P-State allow for SubVP
> >
> > [Description]
> > - Similar to FPO, SubVP should also force cursor P-State
> >   allow instead of relying on natural assertion
> > - Implement code path to force and unforce cursor P-State
> >   allow for SubVP
> >
> > Reviewed-by: Samson Tam 
> > Acked-by: Hersen Wu 
> > Signed-off-by: Alvin Lee 
> > Tested-by: Daniel Wheeler 
> > Signed-off-by: Alex Deucher 
> >
> >  drivers/gpu/drm/amd/display/dc/hwss/dcn32/dcn32_hwseq.c | 17 
> > ++---
> >  1 file changed, 2 insertions(+), 15 deletions(-)
> >
> > My hardware specs: https://linux-hardware.org/?probe=1c989dab38
> >
> > --
> > Best Regards,
> > Mike Gavrilov.
>
> I forgot kernel logs. Not sure it would be helpful because I didn't
> notice anything unusual.
>

This only appears on 7900XTX and 120Hz.

-- 
Best Regards,
Mike Gavrilov.

regression/bisected/6.7rc1: Instead of desktop I see a horizontal flashing bar with a picture of the desktop background on white screen

2023-11-14 Thread Mikhail Gavrilov

Hi,
Yesterday came the 6.7-rc1 kernel.
And surprisingly it turned out it is not working with my LG C3.
I use this OLED TV as my primary monitor.
After login to GNOME I see a horizontal flashing bar with a picture of
the desktop background on white screen.
Demonstration: https://youtu.be/7F76VfRkrVo

I made a bisection.
And bisect said that the first bad commit is:
commit ed6e2782e974750f671e1101250bb19045be
Author: Alvin Lee 
Date:   Mon Oct 23 14:33:16 2023 -0400

drm/amd/display: For cursor P-State allow for SubVP

[Description]
- Similar to FPO, SubVP should also force cursor P-State
  allow instead of relying on natural assertion
- Implement code path to force and unforce cursor P-State
  allow for SubVP

Reviewed-by: Samson Tam 
Acked-by: Hersen Wu 
Signed-off-by: Alvin Lee 
Tested-by: Daniel Wheeler 
Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/display/dc/hwss/dcn32/dcn32_hwseq.c | 17 ++---
 1 file changed, 2 insertions(+), 15 deletions(-)

My hardware specs: https://linux-hardware.org/?probe=1c989dab38

-- 
Best Regards,
Mike Gavrilov.

Re: 6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6

2023-11-07 Thread Mikhail Gavrilov

On Wed, Nov 8, 2023 at 12:12 AM Alex Deucher  wrote:
>
> The attached patch should fix it.  Not sure why your GPU shows up as
> busy.  The AGP aperture was just disabled.

Tested-by: Mikhail Gavrilov 
Thanks, after applying the patch GPU loading meets expectations.
Games are working so overall all looking good for now.

-- 
Best Regards,
Mike Gavrilov.

Re: 6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6

2023-11-07 Thread Mikhail Gavrilov

On Mon, Nov 6, 2023 at 8:29 PM Alex Deucher  wrote:
>
> Already fixed in this commit:
> https://gitlab.freedesktop.org/agd5f/linux/-/commit/d1d4c0b7b65b7fab2bc6f97af9e823b1c42ccdb0
> Which is in included in last weeks PR.
>

Thanks, it fixed the issue above.
But, unfortunately this is not the only problem which I see on my laptop.
Now I am observing 100% GPU loading all the time.
And it looks as I show on this screenshot: https://postimg.cc/QHLQncMg

And another bisect round says that this commit is blame:
❯ git bisect good
de59b69932e64d77445d973a101d81d6e7e670c6 is the first bad commit
commit de59b69932e64d77445d973a101d81d6e7e670c6
Author: Alex Deucher 
Date:   Wed Sep 20 13:27:58 2023 -0400

drm/amdgpu/gmc: set a default disable value for AGP

To disable AGP, the start needs to be set to a higher
value than the end.  Set a default disable value for
the AGP aperture and allow the IP specific GMC code
to enable it selectively be calling amdgpu_gmc_agp_location().

Reviewed-by: Christian König 
Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c   | 27 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h   |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c|  3 +++
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c|  3 ++-
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c|  3 ++-
 drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |  3 ++-
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +-
 10 files changed, 37 insertions(+), 18 deletions(-)

I checked twice and ensure that it not happens on commit
29495d81457a483c2859ccde59cc063034bfe47d

-- 
Best Regards,
Mike Gavrilov.

Re: [bug/bisected] commit a2848d08742c8e8494675892c02c0d22acbe3cf8 cause general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN NOPTI

2023-07-18 Thread Mikhail Gavrilov

On Tue, Jul 18, 2023 at 7:13 AM Chen, Guchun  wrote:
>
> [Public]
>
> Hello Mike,
>
> I guess this patch can resolve your problem.
> https://patchwork.freedesktop.org/patch/547897/
>
> Regards,
> Guchun
>

Tested-by: Mikhail Gavrilov 
Thanks, the issue was gone with this patch.

I didn't say anything above about how to reproduce this problem.
Case was like this:
On a dual GPU laptop, I ran Google Chrome on a discrete graphics card.
I used for it this command:
$ DRI_PRIME=1 google-chrome-unstable --disable-features=Vulkan

-- 
Best Regards,
Mike Gavrilov.

Re: [regression][6.5] KASAN: slab-out-of-bounds in amdgpu_vm_pt_create+0x555/0x670 [amdgpu] on Radeon 7900XTX

2023-07-16 Thread Mikhail Gavrilov

On Fri, Jul 14, 2023 at 4:09 PM Chen, Guchun  wrote:
>
> Thanks for your patience on this, Mike. I think 
> https://patchwork.freedesktop.org/patch/547592/ can help this, please take a 
> try.

Tested-by: Mikhail Gavrilov 
Thanks it looks good. I spent the whole weekend with these patches on
top of 3f01e9fed845 and didn't notice any regressions.

-- 
Best Regards,
Mike Gavrilov.

Re: [regression][6.5] KASAN: slab-out-of-bounds in amdgpu_vm_pt_create+0x555/0x670 [amdgpu] on Radeon 7900XTX

2023-07-07 Thread Mikhail Gavrilov

On Fri, Jul 7, 2023 at 6:01 AM Chen, Guchun  wrote:
>
> [Public]
>
> Hi Mike,
>
> Yes, we are aware of this problem, and we are working on that. The problem is 
> caused by recent code stores xcp_id to amdgpu bo for accounting memory usage 
> and so on. However, not all VMs are attached to that like the case in 
> amdgpu_mes_self_test.
>

I would like to take part in testing the fix.

-- 
Best Regards,
Mike Gavrilov.

Re: [6.4-rc7][regression] slab-out-of-bounds in amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]

2023-06-21 Thread Mikhail Gavrilov

On Wed, Jun 21, 2023 at 12:47 PM Zhu, Jiadong  wrote:
>
> [AMD Official Use Only - General]
>
> Hi,
>
> It is fixed on  
> https://patchwork.freedesktop.org/patch/542647/?series=119384=2
>
> Could you make sure if this patch is included.
>

I confirm this patch fixes the issue.
But this patch is still not merged yet in 6.4 that is a problem.

-- 
Best Regards,
Mike Gavrilov.

[6.4-rc7][regression] slab-out-of-bounds in amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]

2023-06-21 Thread Mikhail Gavrilov

Hi,
after commit 5b711e7f9c73e5ff44d6ac865711d9a05c2a0360 I see KASAN
sanitizer bug message at every boot:

Backtrace:
[   18.600551] 
==
[   18.600558] BUG: KASAN: slab-out-of-bounds in
amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]
[   18.600943] Write of size 8 at addr 8881e4d3a098 by task kworker/8:1/133

[   18.600952] CPU: 8 PID: 133 Comm: kworker/8:1 Tainted: GW
 L---  ---  6.4.0-0.rc7.53.fc39.x86_64+debug #1
[   18.600960] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.331 02/24/2023
[   18.600966] Workqueue: events
amdgpu_device_delayed_init_work_handler [amdgpu]
[   18.601253] Call Trace:
[   18.601256]  
[   18.601260]  dump_stack_lvl+0x76/0xd0
[   18.601267]  print_report+0xcf/0x670
[   18.601275]  ? amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]
[   18.601573]  ? amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]
[   18.601865]  kasan_report+0xa8/0xe0
[   18.601870]  ? amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]
[   18.602163]  amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu]
[   18.602455]  gfx_v9_0_ring_emit_ib_gfx+0x4cc/0xd50 [amdgpu]
[   18.602767]  ? amdgpu_sw_ring_ib_begin+0x1b4/0x3d0 [amdgpu]
[   18.603061]  amdgpu_ib_schedule+0x7cb/0x1570 [amdgpu]
[   18.603354]  gfx_v9_0_ring_test_ib+0x375/0x540 [amdgpu]
[   18.603656]  ? __pfx_gfx_v9_0_ring_test_ib+0x10/0x10 [amdgpu]
[   18.603959]  ? __pfx_lock_acquire+0x10/0x10
[   18.603966]  amdgpu_ib_ring_tests+0x2bc/0x490 [amdgpu]
[   18.604260]  amdgpu_device_delayed_init_work_handler+0x15/0x30 [amdgpu]
[   18.604544]  process_one_work+0x888/0x1460
[   18.604551]  ? worker_thread+0x2c8/0x12c0
[   18.604555]  ? __pfx_process_one_work+0x10/0x10
[   18.604562]  worker_thread+0x104/0x12c0
[   18.604567]  ? __kthread_parkme+0xc1/0x1f0
[   18.604573]  ? __pfx_worker_thread+0x10/0x10
[   18.604577]  kthread+0x2ee/0x3c0
[   18.604581]  ? __pfx_kthread+0x10/0x10
[   18.604586]  ret_from_fork+0x2c/0x50
[   18.604593]  

[   18.604598] Allocated by task 466:
[   18.604601]  kasan_save_stack+0x33/0x60
[   18.604606]  kasan_set_track+0x25/0x30
[   18.604610]  __kasan_kmalloc+0x8f/0xa0
[   18.604614]  __kmalloc+0x62/0x160
[   18.604618]  amdgpu_ring_mux_init+0x6e/0x1b0 [amdgpu]
[   18.604905]  gfx_v9_0_sw_init+0xffe/0x2930 [amdgpu]
[   18.605197]  amdgpu_device_init+0x3c36/0x7fc0 [amdgpu]
[   18.605476]  amdgpu_driver_load_kms+0x1d/0x4b0 [amdgpu]
[   18.605753]  amdgpu_pci_probe+0x279/0x9a0 [amdgpu]
[   18.606029]  local_pci_probe+0xdd/0x190
[   18.606034]  pci_device_probe+0x23a/0x770
[   18.606039]  really_probe+0x3e2/0xb80
[   18.606044]  __driver_probe_device+0x18c/0x450
[   18.606048]  driver_probe_device+0x4a/0x120
[   18.606052]  __driver_attach+0x1e5/0x4a0
[   18.606056]  bus_for_each_dev+0x109/0x190
[   18.606061]  bus_add_driver+0x2a1/0x570
[   18.606064]  driver_register+0x134/0x460
[   18.606069]  do_one_initcall+0xd5/0x3b0
[   18.606073]  do_init_module+0x238/0x770
[   18.606079]  load_module+0x5581/0x6f10
[   18.606082]  __do_sys_init_module+0x1f2/0x220
[   18.606086]  do_syscall_64+0x60/0x90
[   18.606091]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

[   18.606099] The buggy address belongs to the object at 8881e4d3a000
which belongs to the cache kmalloc-128 of size 128
[   18.606106] The buggy address is located 24 bytes to the right of
allocated 128-byte region [8881e4d3a000, 8881e4d3a080)

[   18.606115] The buggy address belongs to the physical page:
[   18.606119] page:024dbf3d refcount:1 mapcount:0
mapping: index:0x0 pfn:0x1e4d3a
[   18.606126] head:024dbf3d order:1 entire_mapcount:0
nr_pages_mapped:0 pincount:0
[   18.606132] flags:
0x17c0010200(slab|head|node=0|zone=2|lastcpupid=0x1f)
[   18.606138] page_type: 0x()
[   18.606143] raw: 0017c0010200 8881000428c0 dead0122

[   18.606148] raw:  00200020 0001

[   18.606153] page dumped because: kasan: bad access detected

[   18.606159] Memory state around the buggy address:
[   18.606162]  8881e4d39f80: 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00
[   18.606167]  8881e4d3a000: 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00
[   18.606172] >8881e4d3a080: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[   18.606176] ^
[   18.606180]  8881e4d3a100: 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 fc
[   18.606184]  8881e4d3a180: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[   18.606189] 
==
[   18.606201] Disabling lock debugging due to kernel taint

>From bisect log:
5b711e7f9c73e5ff44d6ac865711d9a05c2a0360 is the first bad commit
commit 5b711e7f9c73e5ff44d6ac865711d9a05c2a0360
Author: Jiadong Zhu 
Date:   Thu May 25 18:42:15 2023 +0800

drm/amdgpu: Implement gfx9

Re: [PATCH 2/2] drm/amdgpu: make sure that BOs have a backing store

2023-06-06 Thread Mikhail Gavrilov

On Mon, Jun 5, 2023 at 2:11 PM Christian König
 wrote:
>
> It's perfectly possible that the BO is about to be destroyed and doesn't
> have a backing store associated with it.
>

Thanks Christian. I appreciate your brilliant work.
"KASAN: null-ptr-deref in range
[0x0010-0x0017]" finally fixed.
Tested-by: Mikhail Gavrilov 

-- 
Best Regards,
Mike Gavrilov.

Re: KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] - RIP: 0010:amdgpu_bo_get_memory+0x80/0x360 [amdgpu]

2023-05-18 Thread Mikhail Gavrilov

On Mon, May 8, 2023 at 3:40 PM Mikhail Gavrilov
 wrote:
>
> No one can reproduce this?
> I prepared a video instruction which can helps:
> https://youtu.be/0ipQnMpZG1Y
>
> 1. Run script which would calculate watchers:
> $ for i in {1..9}; do sudo curl -s
> https://raw.githubusercontent.com/fatso83/dotfiles/master/utils/scripts/inotify-consumers
> | bash; done
>
> 2. Run the game "Devision 2"
>
> 3. Run 20 windows of Google Chrome with such script
> $ for i in {1..20}; do google-chrome-unstable
> --profile-directory="Test-2" --new-window --start-maximized
> "youtube.com" &; done
>
> I hope after it you see the desired backtrace.
>

I found another way to reproduce the problem.

Demonstration: https://youtu.be/6cvs4cCMo4M

1. Run the game "Devision 2"
2. Run 20 windows of Google Chrome with such script $ for i in
{1..20}; do google-chrome-unstable --profile-directory="Test-2"
--new-window --start-maximized "youtube.com" &; done
3. Run "nvtop" and got kernel bug.

After it "nvtop" stop working until reboot.

Can anyone confirm it, please?

-- 
Best Regards,
Mike Gavrilov.

Re: KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] - RIP: 0010:amdgpu_bo_get_memory+0x80/0x360 [amdgpu]

2023-05-08 Thread Mikhail Gavrilov

On Fri, May 5, 2023 at 6:44 PM Mikhail Gavrilov
 wrote:
> I need to say that it may not be easy to reproduce this bug.
> For helping reproduce:
> 1. I looped script above:
> $ for i in {1..9}; do sudo curl -s
> https://raw.githubusercontent.com/fatso83/dotfiles/master/utils/scripts/inotify-consumers
> | bash; done
> 2. Launched google chrome with 26 opened windows
> 3. And played in the game Division 2.
> A little time and luck and I get the desired backtrace again and again.
>
> I am ready to answer any question and open for testing any patches.
> Thanks.

No one can reproduce this?
I prepared a video instruction which can helps:
https://youtu.be/0ipQnMpZG1Y

1. Run script which would calculate watchers:
$ for i in {1..9}; do sudo curl -s
https://raw.githubusercontent.com/fatso83/dotfiles/master/utils/scripts/inotify-consumers
| bash; done

2. Run the game "Devision 2"

3. Run 20 windows of Google Chrome with such script
$ for i in {1..20}; do google-chrome-unstable
--profile-directory="Test-2" --new-window --start-maximized
"youtube.com" &; done

I hope after it you see the desired backtrace.

-- 
Best Regards,
Mike Gavrilov.

Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

2023-04-25 Thread Mikhail Gavrilov

On Thu, Apr 20, 2023 at 3:32 PM Mikhail Gavrilov
 wrote:
>
> Important don't give up.
> https://youtu.be/25zhHBGIHJ8 [40 min]
> https://youtu.be/utnDR26eYBY [50 min]
> https://youtu.be/DJQ_tiimW6g [12 min]
> https://youtu.be/Y6AH1oJKivA [6 min]
> Yes the issue is everything reproducible, but time to time it not
> happens at first attempt.
> I also uploaded other videos which proves that the issue definitely
> exists if someone will launch those games in turn.
> Reproducibility is only a matter of time.
>
> Anyway I didn't want you to spend so much time trying to reproduce it.
> This monkey business fits me more than you.
> It would be better if I could collect more useful info.

Christian,
Did you manage to reproduce the problem?

At the weekend I faced with slab-use-after-free in amdgpu_vm_handle_moved.
I didn't play in the games at this time.
The Xwayland process was affected so it leads to desktop hang.

==
BUG: KASAN: slab-use-after-free in amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
Read of size 8 at addr 888295c66190 by task Xwayland:cs0/173185

CPU: 21 PID: 173185 Comm: Xwayland:cs0 Tainted: GWL
---  ---  6.3.0-0.rc7.20230420gitcb0856346a60.59.fc39.x86_64+debug
#1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 4601 02/02/2023
Call Trace:
 
 dump_stack_lvl+0x76/0xd0
 print_report+0xcf/0x670
 ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
 ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
 kasan_report+0xa8/0xe0
 ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
 amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
 amdgpu_cs_ioctl+0x2b7e/0x5630 [amdgpu]
 ? __pfx___lock_acquire+0x10/0x10
 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
 ? mark_lock+0x101/0x16e0
 ? __lock_acquire+0xe54/0x59f0
 ? __pfx_lock_release+0x10/0x10
 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
 drm_ioctl_kernel+0x1fc/0x3d0
 ? __pfx_drm_ioctl_kernel+0x10/0x10
 drm_ioctl+0x4c5/0xaa0
 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
 ? __pfx_drm_ioctl+0x10/0x10
 ? _raw_spin_unlock_irqrestore+0x66/0x80
 ? lockdep_hardirqs_on+0x81/0x110
 ? _raw_spin_unlock_irqrestore+0x4f/0x80
 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
 __x64_sys_ioctl+0x131/0x1a0
 do_syscall_64+0x60/0x90
 ? do_syscall_64+0x6c/0x90
 ? lockdep_hardirqs_on+0x81/0x110
 ? do_syscall_64+0x6c/0x90
 ? lockdep_hardirqs_on+0x81/0x110
 ? do_syscall_64+0x6c/0x90
 ? lockdep_hardirqs_on+0x81/0x110
 ? do_syscall_64+0x6c/0x90
 ? lockdep_hardirqs_on+0x81/0x110
 entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7ffb71b0892d
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:7ffb677fe840 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 7ffb677fe9f8 RCX: 7ffb71b0892d
RDX: 7ffb677fe900 RSI: c0186444 RDI: 000d
RBP: 7ffb677fe890 R08: 7ffb677fea50 R09: 7ffb677fe8e0
R10: 556c4611bec0 R11: 0246 R12: 7ffb677fe900
R13: c0186444 R14: 000d R15: 7ffb677fe9f8
 

Allocated by task 173181:
 kasan_save_stack+0x33/0x60
 kasan_set_track+0x25/0x30
 __kasan_kmalloc+0x8f/0xa0
 __kmalloc_node+0x65/0x160
 amdgpu_bo_create+0x31e/0xfb0 [amdgpu]
 amdgpu_bo_create_user+0xca/0x160 [amdgpu]
 amdgpu_gem_create_ioctl+0x398/0x980 [amdgpu]
 drm_ioctl_kernel+0x1fc/0x3d0
 drm_ioctl+0x4c5/0xaa0
 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
 __x64_sys_ioctl+0x131/0x1a0
 do_syscall_64+0x60/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Freed by task 173185:
 kasan_save_stack+0x33/0x60
 kasan_set_track+0x25/0x30
 kasan_save_free_info+0x2e/0x50
 __kasan_slab_free+0x10b/0x1a0
 slab_free_freelist_hook+0x11e/0x1d0
 __kmem_cache_free+0xc0/0x2e0
 ttm_bo_release+0x667/0x9e0 [ttm]
 amdgpu_bo_unref+0x35/0x70 [amdgpu]
 amdgpu_gem_object_free+0x73/0xb0 [amdgpu]
 drm_gem_handle_delete+0xe3/0x150
 drm_ioctl_kernel+0x1fc/0x3d0
 drm_ioctl+0x4c5/0xaa0
 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
 __x64_sys_ioctl+0x131/0x1a0
 do_syscall_64+0x60/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Last potentially related work creation:
 kasan_save_stack+0x33/0x60
 __kasan_record_aux_stack+0x97/0xb0
 __call_rcu_common.constprop.0+0xf8/0x1af0
 drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
 dma_resv_reserve_fences+0x4dc/0x7f0
 ttm_eu_reserve_buffers+0x3f6/0x1190 [ttm]
 amdgpu_cs_ioctl+0x204d/0x5630 [amdgpu]
 drm_ioctl_kernel+0x1fc/0x3d0
 drm_ioctl+0x4c5/0xaa0
 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
 __x64_sys_ioctl+0x131/0x1a0
 do_syscall_64+0x60/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Second to last potentially related work creation:
 kasan_save_stack+0x33/0x60
 __kasan_record_aux_stack+0x97/0xb0
 __call_rcu_common.constprop.0+0xf8/0x1af0
 drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
 amdgpu_ctx_add_fence+0x2b1/0x

Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

2023-04-20 Thread Mikhail Gavrilov

On Thu, Apr 20, 2023 at 2:59 PM Christian König
 wrote:
> Could you try drm-misc-next as well?

If as I assume I cloned right repo
$ git clone -b drm-misc-next
git://anongit.freedesktop.org/drm/drm-misc linux-drm-misc-next
for my hardware last commit on this branch is turned out completely unworking.
Instead of the GDM login screen I see a black screen and hear howls of GPU fans.

In the kernel logs I see general protection fault:
general protection fault, probably for non-canonical address
0xdc2b:  [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0158-0x015f]
CPU: 0 PID: 749 Comm: sdma0 Tainted: GWL
6.3.0-rc4-misc-next-91c249b2b9f6a80c744387b6713adf275ffd296b+ #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 4601 02/02/2023
RIP: 0010:drm_sched_get_cleanup_job+0x41b/0x5c0 [gpu_sched]
Code: fa 48 c1 ea 03 80 3c 02 00 75 5c 49 8b 9f 80 00 00 00 48 b8 00
00 00 00 00 fc ff df 48 8d bb 58 01 00 00 48 89 fa 48 c1 ea 03 <80> 3c
02 00 75 55 48 01 ab 58 01 00 00 e9 0c fd ff ff 48 89 ef e8
RSP: 0018:c9000548fdb8 EFLAGS: 00010216
RAX: dc00 RBX:  RCX: 
RDX: 002b RSI: 0004 RDI: 0158
RBP: 085c R08:  R09: 888170711783
R10: ed102e0e22f0 R11: 8da81678 R12: 8881707116b0
R13: 888170711780 R14: 888266f89820 R15: 888266f89808
FS:  () GS:888fa200() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 560cea4a8000 CR3: 000191602000 CR4: 00350ef0
Call Trace:
 
 drm_sched_main+0xc3/0x930 [gpu_sched]
 ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
 ? __pfx_autoremove_wake_function+0x10/0x10
 ? __kthread_parkme+0xc1/0x1f0
 ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
 kthread+0x2a2/0x340
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2c/0x50
 
Modules linked in: amdgpu(+) drm_ttm_helper ttm video crct10dif_pclmul
drm_suballoc_helper crc32_pclmul iommu_v2 crc32c_intel drm_buddy
polyval_clmulni gpu_sched polyval_generic ucsi_ccg drm_display_helper
typec_ucsi nvme ghash_clmulni_intel igb typec ccp sha512_ssse3 cec
nvme_core sp5100_tco dca i2c_algo_bit nvme_common wmi ip6_tables
ip_tables fuse
---[ end trace  ]---
RIP: 0010:drm_sched_get_cleanup_job+0x41b/0x5c0 [gpu_sched]
Code: fa 48 c1 ea 03 80 3c 02 00 75 5c 49 8b 9f 80 00 00 00 48 b8 00
00 00 00 00 fc ff df 48 8d bb 58 01 00 00 48 89 fa 48 c1 ea 03 <80> 3c
02 00 75 55 48 01 ab 58 01 00 00 e9 0c fd ff ff 48 89 ef e8
RSP: 0018:c9000548fdb8 EFLAGS: 00010216
RAX: dc00 RBX:  RCX: 
RDX: 002b RSI: 0004 RDI: 0158
RBP: 085c R08:  R09: 888170711783
R10: ed102e0e22f0 R11: 8da81678 R12: 8881707116b0
R13: 888170711780 R14: 888266f89820 R15: 888266f89808
FS:  () GS:888fa200() knlGS:


I also attached a full system log.

-- 
Best Regards,
Mike Gavrilov.


system-log.tar.xz
Description: application/xz

Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

2023-04-20 Thread Mikhail Gavrilov

On Thu, Apr 20, 2023 at 2:59 PM Christian König
 wrote:
>
> Could you try drm-misc-next as well?
>
> Going to give drm-fixes another round of testing.
>
> Thanks,
> Christian.

Important don't give up.
https://youtu.be/25zhHBGIHJ8 [40 min]
https://youtu.be/utnDR26eYBY [50 min]
https://youtu.be/DJQ_tiimW6g [12 min]
https://youtu.be/Y6AH1oJKivA [6 min]
Yes the issue is everything reproducible, but time to time it not
happens at first attempt.
I also uploaded other videos which proves that the issue definitely
exists if someone will launch those games in turn.
Reproducibility is only a matter of time.

Anyway I didn't want you to spend so much time trying to reproduce it.
This monkey business fits me more than you.
It would be better if I could collect more useful info.

-- 
Best Regards,
Mike Gavrilov.

Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

2023-04-19 Thread Mikhail Gavrilov

On Wed, Apr 19, 2023 at 1:12 PM Christian König
 wrote:
>
> I'm already looking into this, but can't figure out why we run into
> problems here.
>
> What happens is that a CS is aborted without sending the job to the
> scheduler and in this case the cleanup function doesn't seem to work.
>
> Christian.

I can easily reproduce it on any AMD GPU hardware.
You can add more logs to debug and I return with new logs which explains this.
Thanks.

-- 
Best Regards,
Mike Gavrilov.

Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

2023-04-19 Thread Mikhail Gavrilov

Christian?

❯ /usr/src/kernels/6.3.0-0.rc7.56.fc39.x86_64/scripts/faddr2line
/lib/debug/lib/modules/6.3.0-0.rc7.56.fc39.x86_64/kernel/drivers/gpu/drm/scheduler/gpu-sched.ko.debug
drm_sched_job_cleanup+0x9a
drm_sched_job_cleanup+0x9a/0x130:
drm_sched_job_cleanup at
/usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c:808
(discriminator 3)

❯ cat -s -n 
/usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c
| head -818 | tail -20
   799 /* drm_sched_job_arm() has been called */
   800 dma_fence_put(>s_fence->finished);
   801 } else {
   802 /* aborted job before committing to run it */
   803 drm_sched_fence_free(job->s_fence);
   804 }
   805
   806 job->s_fence = NULL;
   807
   808 xa_for_each(>dependencies, index, fence) {
   809 dma_fence_put(fence);
   810 }
   811 xa_destroy(>dependencies);
   812
   813 }
   814 EXPORT_SYMBOL(drm_sched_job_cleanup);
   815
   816 /**
   817 * drm_sched_ready - is the scheduler ready
   818 *

> git blame drivers/gpu/drm/scheduler/sched_main.c -L 800,819
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-17 10:49:16 +0200 800)
dma_fence_put(>s_fence->finished);
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-17 10:49:16 +0200 801) } else {
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-17 10:49:16 +0200 802) /* aborted job
before committing to run it */
d4c16733e7960 drivers/gpu/drm/scheduler/sched_main.c(Boris
Brezillon 2021-09-03 14:05:54 +0200 803)
drm_sched_fence_free(job->s_fence);
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-17 10:49:16 +0200 804) }
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-17 10:49:16 +0200 805)
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat
Masetty  2018-10-29 15:02:28 +0530 806) job->s_fence = NULL;
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-05 12:46:49 +0200 807)
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-05 12:46:49 +0200 808)
xa_for_each(>dependencies, index, fence) {
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-05 12:46:49 +0200 809)
dma_fence_put(fence);
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-05 12:46:49 +0200 810) }
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-05 12:46:49 +0200 811)
xa_destroy(>dependencies);
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-05 12:46:49 +0200 812)
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat
Masetty  2018-10-29 15:02:28 +0530 813) }
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat
Masetty  2018-10-29 15:02:28 +0530 814)
EXPORT_SYMBOL(drm_sched_job_cleanup);
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat
Masetty  2018-10-29 15:02:28 +0530 815)
e688b728228b9 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c (Christian
König 2015-08-20 17:01:01 +0200 816) /**
2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
Deshmukh  2018-05-29 11:23:07 +0530 817)  * drm_sched_ready - is the
scheduler ready
2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
Deshmukh  2018-05-29 11:23:07 +0530 818)  *
2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
Deshmukh  2018-05-29 11:23:07 +0530 819)  * @sched: scheduler instance

Daniel, because Christian, looks a little busy. Can you help? The git
blame says that you are the author of code which KASAN mentions in its
report.
The issue is reproducible on all available AMD hardware: 6800M, 6900XT, 7900XTX.

-- 
Best Regards,
Mike Gavrilov.

Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

2023-04-14 Thread Mikhail Gavrilov

On Tue, Apr 11, 2023 at 10:40 PM Mikhail Gavrilov
 wrote:
>
> Hi,
> KASAN continues to find problems in the drm_sched_job_cleanup code at 6.3rc6.
> I not got any feedback in the thread
> https://lore.kernel.org/lkml/cabxgcsmvub2ra4d+k5cna0_2521tox++d4nmoukki4x2-q_...@mail.gmail.com/
> Therefore, I decided to start a separate thread. Since the problems
> are different, the symptoms are also different.
>
> Reproduction scenario.
> After launching one of the listed games:
> - Cyberpunk 2077
> - Forza Horizon 4
> - Forza Horizon 5
> - Sackboy: A Big Adventure
>
> Firstly after some time (may be after several attempts) appears bug
> message from KASAN:
> ==
> BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
> Read of size 4 at addr 0078 by task ForzaHorizon4.e/31587
>
> CPU: 15 PID: 31587 Comm: ForzaHorizon4.e Tainted: GWL
> ---  ---  6.3.0-0.rc6.49.fc39.x86_64+debug #1
> Hardware name: System manufacturer System Product Name/ROG STRIX
> X570-I GAMING, BIOS 4601 02/02/2023
> Call Trace:
>  
>  dump_stack_lvl+0x72/0xc0
>  kasan_report+0xa4/0xe0
>  ? drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
>  kasan_check_range+0x104/0x1b0
>  drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
>  ? __pfx_drm_sched_job_cleanup+0x10/0x10 [gpu_sched]
>  ? slab_free_freelist_hook+0x11e/0x1d0
>  ? amdgpu_cs_parser_fini+0x363/0x5a0 [amdgpu]
>  amdgpu_job_free+0x40/0x1b0 [amdgpu]
>  amdgpu_cs_parser_fini+0x3c9/0x5a0 [amdgpu]
>  ? __pfx_amdgpu_cs_parser_fini+0x10/0x10 [amdgpu]
>  amdgpu_cs_ioctl+0x3d9/0x5630 [amdgpu]
>  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>  ? __kmem_cache_free+0xbc/0x2e0
>  ? mark_lock+0x101/0x16e0
>  ? __lock_acquire+0xe54/0x59f0
>  ? kasan_save_stack+0x3f/0x50
>  ? __pfx_lock_release+0x10/0x10
>  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>  drm_ioctl_kernel+0x1f8/0x3d0
>  ? __pfx_drm_ioctl_kernel+0x10/0x10
>  drm_ioctl+0x4c1/0xaa0
>  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>  ? __pfx_drm_ioctl+0x10/0x10
>  ? _raw_spin_unlock_irqrestore+0x62/0x80
>  ? lockdep_hardirqs_on+0x7d/0x100
>  ? _raw_spin_unlock_irqrestore+0x4b/0x80
>  amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu]
>  __x64_sys_ioctl+0x12d/0x1a0
>  do_syscall_64+0x5c/0x90
>  ? do_syscall_64+0x68/0x90
>  ? lockdep_hardirqs_on+0x7d/0x100
>  ? do_syscall_64+0x68/0x90
>  ? do_syscall_64+0x68/0x90
>  ? lockdep_hardirqs_on+0x7d/0x100
>  ? do_syscall_64+0x68/0x90
>  ? asm_exc_page_fault+0x22/0x30
>  ? lockdep_hardirqs_on+0x7d/0x100
>  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7fb8a270881d
> Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
> 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
> 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
> RSP: 002b:467ad060 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 467ad358 RCX: 7fb8a270881d
> RDX: 467ad140 RSI: c0186444 RDI: 005a
> RBP: 467ad0b0 R08: 7fb7f00d3eb0 R09: 467ad100
> R10: 7fb88c68fb20 R11: 0246 R12: 467ad140
> R13: c0186444 R14: 005a R15: 7fb7f00d3e50
>  
> ==
>
> Finally it ends up with the games listed above stopping working they
> stuck after a kernel warning:
> general protection fault, probably for non-canonical address
> 0xdc0f:  [#1] PREEMPT SMP KASAN NOPTI
> KASAN: null-ptr-deref in range [0x0078-0x007f]
> CPU: 15 PID: 31587 Comm: ForzaHorizon4.e Tainted: GB   WL
> ---  ---  6.3.0-0.rc6.49.fc39.x86_64+debug #1
> Hardware name: System manufacturer System Product Name/ROG STRIX
> X570-I GAMING, BIOS 4601 02/02/2023
> RIP: 0010:drm_sched_job_cleanup+0xa7/0x290 [gpu_sched]
> Code: d6 01 00 00 4c 8b 75 20 be 04 00 00 00 4d 8d 66 78 4c 89 e7 e8
> ba 4d 4e c9 4c 89 e2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6
> 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 8a
> RSP: 0018:c9003676f5a8 EFLAGS: 00010216
> RAX: dc00 RBX: 88816f81f020 RCX: 0001
> RDX: 000f RSI: 0008 RDI: 9053e5e0
> RBP: 88816f81f000 R08: 0001 R09: 9053e5e7
> R10: fbfff20a7cbc R11: 6e696c6261736944 R12: 0078
> R13: 192006cedeb5 R14:  R15: c9003676f870
> FS:  4680f6c0() GS:888fa5c0() knlGS:2991
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fb854d6f010 CR3: 00017b2d6000 CR4: 00350ee0
> Call Trace

Re: BUG: KASAN: slab-use-after-free in drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]

2023-04-04 Thread Mikhail Gavrilov

On Fri, Mar 24, 2023 at 7:37 PM Christian König
 wrote:
>
> Yeah, that one
>
> Thanks for the info, looks like this isn't fixed.
>
> Christian.
>

Hi,
glad to see that "BUG: KASAN: slab-use-after-free in
drm_sched_get_cleanup_job+0x47b/0x5c0" was fixed in 6.3-rc5.
For history it would be good to know the commit which fixes this issue.
I waited for this moment because I know other one issue which was also
found by KASAN santiniser.

BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
Read of size 4 at addr 0078 by task GameThread/23915

CPU: 10 PID: 23915 Comm: GameThread Tainted: GWL
---  ---  6.3.0-0.rc5.42.fc39.x86_64+debug #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 4601 02/02/2023
Call Trace:
 
 dump_stack_lvl+0x72/0xc0
 kasan_report+0xa4/0xe0
 ? drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
 kasan_check_range+0x104/0x1b0
 drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
 ? __pfx_drm_sched_job_cleanup+0x10/0x10 [gpu_sched]
 ? slab_free_freelist_hook+0x11e/0x1d0
 ? amdgpu_cs_parser_fini+0x363/0x5a0 [amdgpu]
 amdgpu_job_free+0x40/0x1b0 [amdgpu]
 amdgpu_cs_parser_fini+0x3c9/0x5a0 [amdgpu]
 ? __pfx_amdgpu_cs_parser_fini+0x10/0x10 [amdgpu]
 amdgpu_cs_ioctl+0x3d9/0x5630 [amdgpu]
 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
 ? mark_lock+0x101/0x16e0
 ? __lock_acquire+0xe54/0x59f0
 ? __pfx_lock_release+0x10/0x10
 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
 drm_ioctl_kernel+0x1f8/0x3d0
 ? __pfx_drm_ioctl_kernel+0x10/0x10
 drm_ioctl+0x4c1/0xaa0
 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
 ? __pfx_drm_ioctl+0x10/0x10
 ? _raw_spin_unlock_irqrestore+0x62/0x80
 ? lockdep_hardirqs_on+0x7d/0x100
 ? _raw_spin_unlock_irqrestore+0x4b/0x80
 amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu]
 __x64_sys_ioctl+0x12d/0x1a0
 do_syscall_64+0x5c/0x90
 ? do_syscall_64+0x68/0x90
 ? lockdep_hardirqs_on+0x7d/0x100
 ? do_syscall_64+0x68/0x90
 ? do_syscall_64+0x68/0x90
 ? lockdep_hardirqs_on+0x7d/0x100
 ? do_syscall_64+0x68/0x90
 ? do_syscall_64+0x68/0x90
 ? lockdep_hardirqs_on+0x7d/0x100
 entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fe97a50881d
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:7c35d3f0 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 7c35d6e8 RCX: 7fe97a50881d
RDX: 7c35d4d0 RSI: c0186444 RDI: 00ae
RBP: 7c35d440 R08: 7fe8fc0f0970 R09: 7c35d490
R10: 7fb79000 R11: 0246 R12: 7c35d4d0
R13: c0186444 R14: 00ae R15: 7fe8fc0f0900
 

I know at least 3 games which 100% triggering this bug:
- Cyberpunk 2077
- Forza Horizon 4
- Forza Horizon 5

We would continue to discuss it here or better create a new thread
(for someone who is also faced with this issue could easily find a
solution on the internet)?

A full kernel log as usual attached here.

-- 
Best Regards,
Mike Gavrilov.


dmesg.tar.xz
Description: application/xz

Re: BUG: KASAN: slab-use-after-free in drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]

2023-03-23 Thread Mikhail Gavrilov

On Tue, Mar 21, 2023 at 11:47 PM Christian König
 wrote:
>
> Hi Mikhail,
>
> That looks like a reference counting issue to me.
>
> I'm going to take a look, but we have already fixed one of those recently.
>
> Probably best that you try this on drm-fixes, just to double check that
> this isn't the same issue.
>

Hi Christian,
you meant this branch?
$ git clone -b drm-fixes git://anongit.freedesktop.org/drm/drm linux-drm

If yes I just checked and unfortunately see this issue unfixed there.

[ 1984.295833] 
==
[ 1984.295876] BUG: KASAN: slab-use-after-free in
drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
[ 1984.295898] Read of size 8 at addr 88814cadc4c0 by task sdma1/764

[ 1984.295924] CPU: 12 PID: 764 Comm: sdma1 Tainted: GWL
  6.3.0-rc3-drm-fixes+ #1
[ 1984.295937] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4601 02/02/2023
[ 1984.295951] Call Trace:
[ 1984.295963]  
[ 1984.295975]  dump_stack_lvl+0x72/0xc0
[ 1984.295991]  print_report+0xcf/0x670
[ 1984.296007]  ? drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
[ 1984.296030]  ? drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
[ 1984.296047]  kasan_report+0xa4/0xe0
[ 1984.296118]  ? drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
[ 1984.296149]  drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
[ 1984.296175]  drm_sched_main+0x643/0x990 [gpu_sched]
[ 1984.296204]  ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
[ 1984.296222]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 1984.296290]  ? __kthread_parkme+0xc1/0x1f0
[ 1984.296304]  ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
[ 1984.296321]  kthread+0x29e/0x340
[ 1984.296334]  ? __pfx_kthread+0x10/0x10
[ 1984.296501]  ret_from_fork+0x2c/0x50
[ 1984.296518]  

[ 1984.296539] Allocated by task 12194:
[ 1984.296552]  kasan_save_stack+0x2f/0x50
[ 1984.296566]  kasan_set_track+0x21/0x30
[ 1984.296578]  __kasan_kmalloc+0x8b/0x90
[ 1984.296590]  amdgpu_driver_open_kms+0x10b/0x5a0 [amdgpu]
[ 1984.297051]  drm_file_alloc+0x46e/0x880
[ 1984.297064]  drm_open_helper+0x161/0x460
[ 1984.297076]  drm_open+0x1e7/0x5c0
[ 1984.297089]  drm_stub_open+0x24d/0x400
[ 1984.297107]  chrdev_open+0x215/0x620
[ 1984.297125]  do_dentry_open+0x5f1/0x1000
[ 1984.297146]  path_openat+0x1b3d/0x28a0
[ 1984.297164]  do_filp_open+0x1bd/0x400
[ 1984.297180]  do_sys_openat2+0x140/0x420
[ 1984.297197]  __x64_sys_openat+0x11f/0x1d0
[ 1984.297213]  do_syscall_64+0x5b/0x80
[ 1984.297231]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

[ 1984.297266] Freed by task 12195:
[ 1984.297284]  kasan_save_stack+0x2f/0x50
[ 1984.297303]  kasan_set_track+0x21/0x30
[ 1984.297323]  kasan_save_free_info+0x2a/0x50
[ 1984.297343]  __kasan_slab_free+0x107/0x1a0
[ 1984.297361]  slab_free_freelist_hook+0x11e/0x1d0
[ 1984.297373]  __kmem_cache_free+0xbc/0x2e0
[ 1984.297385]  amdgpu_driver_postclose_kms+0x582/0x8d0 [amdgpu]
[ 1984.297821]  drm_file_free.part.0+0x638/0xb70
[ 1984.297834]  drm_release+0x1ea/0x470
[ 1984.297845]  __fput+0x213/0x9e0
[ 1984.297857]  task_work_run+0x11b/0x200
[ 1984.297869]  exit_to_user_mode_prepare+0x23a/0x260
[ 1984.297883]  syscall_exit_to_user_mode+0x16/0x50
[ 1984.297896]  do_syscall_64+0x67/0x80
[ 1984.297907]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

[ 1984.298033] Last potentially related work creation:
[ 1984.298044]  kasan_save_stack+0x2f/0x50
[ 1984.298057]  __kasan_record_aux_stack+0x97/0xb0
[ 1984.298075]  __call_rcu_common.constprop.0+0xf8/0x1af0
[ 1984.298095]  amdgpu_bo_list_put+0x1a4/0x1f0 [amdgpu]
[ 1984.298557]  amdgpu_cs_parser_fini+0x293/0x5a0 [amdgpu]
[ 1984.299055]  amdgpu_cs_ioctl+0x4f2a/0x5630 [amdgpu]
[ 1984.299624]  drm_ioctl_kernel+0x1f8/0x3d0
[ 1984.299637]  drm_ioctl+0x4c1/0xaa0
[ 1984.299649]  amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu]
[ 1984.300083]  __x64_sys_ioctl+0x12d/0x1a0
[ 1984.300097]  do_syscall_64+0x5b/0x80
[ 1984.300109]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

[ 1984.300135] Second to last potentially related work creation:
[ 1984.300149]  kasan_save_stack+0x2f/0x50
[ 1984.300167]  __kasan_record_aux_stack+0x97/0xb0
[ 1984.300185]  __call_rcu_common.constprop.0+0xf8/0x1af0
[ 1984.300203]  amdgpu_bo_list_put+0x1a4/0x1f0 [amdgpu]
[ 1984.300692]  amdgpu_cs_parser_fini+0x293/0x5a0 [amdgpu]
[ 1984.301133]  amdgpu_cs_ioctl+0x4f2a/0x5630 [amdgpu]
[ 1984.301577]  drm_ioctl_kernel+0x1f8/0x3d0
[ 1984.301598]  drm_ioctl+0x4c1/0xaa0
[ 1984.301610]  amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu]
[ 1984.302043]  __x64_sys_ioctl+0x12d/0x1a0
[ 1984.302056]  do_syscall_64+0x5b/0x80
[ 1984.302068]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

[ 1984.302090] The buggy address belongs to the object at 88814cadc000
which belongs to the cache kmalloc-4k of size 4096
[ 1984.302103] The buggy address is located 1216 bytes inside of
freed 4096-byte region [88814cadc000, 88814cadd000)

[ 1984.302129] The buggy address belongs to the phys

[6.3][regression] commit a4e771729a51168bc36317effaa9962e336d4f5e lead to flood kernel logs with warning messages "at kernel/workqueue.c:3167 __flush_work+0x472/0x500"

2023-03-08 Thread Mikhail Gavrilov

Hi,
I didn't faced to issue drm_bridge_hpd_enable+0x94/0x9c [drm] but
fixing this issue leads to warning messages on my laptop ASUS ROG
Strix G15 Advantage Edition G513QY-HQ007 which has two AMD GPU.
Discrete Radeon 6800M and integrated in CPU Cezanne Vega 8.

I found bad commit by bisecting:
❯ git bisect bad
a4e771729a51168bc36317effaa9962e336d4f5e is the first bad commit
commit a4e771729a51168bc36317effaa9962e336d4f5e
Author: Dmitry Baryshkov 
Date:   Tue Jan 24 12:45:48 2023 +0200

drm/probe_helper: sort out poll_running vs poll_enabled

There are two flags attemting to guard connector polling:
poll_enabled and poll_running. While poll_enabled semantics is clearly
defined and fully adhered (mark that drm_kms_helper_poll_init() was
called and not finalized by the _fini() call), the poll_running flag
doesn't have such clearliness.

This flag is used only in drm_helper_probe_single_connector_modes() to
guard calling of drm_kms_helper_poll_enable, it doesn't guard the
drm_kms_helper_poll_fini(), etc. Change it to only be set if the polling
is actually running. Tie HPD enablement to this flag.

This fixes the following warning reported after merging the HPD series:

Hot plug detection already enabled
WARNING: CPU: 2 PID: 9 at drivers/gpu/drm/drm_bridge.c:1257
drm_bridge_hpd_enable+0x94/0x9c [drm]
Modules linked in: videobuf2_memops snd_soc_simple_card
snd_soc_simple_card_utils fsl_imx8_ddr_perf videobuf2_common
snd_soc_imx_spdif adv7511 etnaviv imx8m_ddrc imx_dcss mc cec nwl_dsi
gov
CPU: 2 PID: 9 Comm: kworker/u8:0 Not tainted
6.2.0-rc2-15208-g25b283acd578 #6
Hardware name: NXP i.MX8MQ EVK (DT)
Workqueue: events_unbound deferred_probe_work_func
pstate: 6005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : drm_bridge_hpd_enable+0x94/0x9c [drm]
lr : drm_bridge_hpd_enable+0x94/0x9c [drm]
sp : 89ef3740
x29: 89ef3740 x28: 09331f00 x27: 1000
x26: 0020 x25: 81148ed8 x24: 0a8fe000
x23: fffd x22: 05086348 x21: 81133ee0
x20: 0550d800 x19: 05086288 x18: 0006
x17:  x16: 896ef008 x15: 972891004260
x14: 2a1403e19400 x13: 972891004260 x12: 2a1403e19400
x11: 7100385f29400801 x10: 0aa0 x9 : 88112744
x8 : 00250b00 x7 : 0003 x6 : 0011
x5 :  x4 : bd986a48 x3 : 0001
x2 :  x1 :  x0 : 0025
Call trace:
 drm_bridge_hpd_enable+0x94/0x9c [drm]
 drm_bridge_connector_enable_hpd+0x2c/0x3c [drm_kms_helper]
 drm_kms_helper_poll_enable+0x94/0x10c [drm_kms_helper]
 drm_helper_probe_single_connector_modes+0x1a8/0x510 [drm_kms_helper]
 drm_client_modeset_probe+0x204/0x1190 [drm]
 __drm_fb_helper_initial_config_and_unlock+0x5c/0x4a4 [drm_kms_helper]
 drm_fb_helper_initial_config+0x54/0x6c [drm_kms_helper]
 drm_fbdev_client_hotplug+0xd0/0x140 [drm_kms_helper]
 drm_fbdev_generic_setup+0x90/0x154 [drm_kms_helper]
 dcss_kms_attach+0x1c8/0x254 [imx_dcss]
 dcss_drv_platform_probe+0x90/0xfc [imx_dcss]
 platform_probe+0x70/0xcc
 really_probe+0xc4/0x2e0
 __driver_probe_device+0x80/0xf0
 driver_probe_device+0xe0/0x164
 __device_attach_driver+0xc0/0x13c
 bus_for_each_drv+0x84/0xe0
 __device_attach+0xa4/0x1a0
 device_initial_probe+0x1c/0x30
 bus_probe_device+0xa4/0xb0
 deferred_probe_work_func+0x90/0xd0
 process_one_work+0x200/0x474
 worker_thread+0x74/0x43c
 kthread+0xfc/0x110
 ret_from_fork+0x10/0x20
---[ end trace  ]---

Reported-by: Laurentiu Palcu 
Fixes: c8268795c9a9 ("drm/probe-helper: enable and disable HPD on
connectors")
Tested-by: Marek Szyprowski 
Tested-by: Chen-Yu Tsai 
Acked-by: Laurentiu Palcu 
Tested-by: Laurentiu Palcu 
Tested-by: Laurent Pinchart 
Signed-off-by: Dmitry Baryshkov 
Signed-off-by: Neil Armstrong 
Link: 
https://patchwork.freedesktop.org/patch/msgid/20230124104548.3234554-2-dmitry.barysh...@linaro.org
(cherry picked from commit d33a54e3991dfce88b4fc6d9c3360951c2c5660d)
Signed-off-by: Thomas Zimmermann 

 drivers/gpu/drm/drm_probe_helper.c | 42 +++---
 1 file changed, 21 insertions(+), 21 deletions(-)

Of course I tried to check the bisect assumption by reverting this
commit. And I can confirm without commit
a4e771729a51168bc36317effaa9962e336d4f5e the warning messages do not
appear within a day.

I attached a full kernel log if someone would be interested to see it.

-- 
Best Regards,
Mike Gavrilov.
git bisect start
# status: waiting for both good and bad commits
# good: [5b7c4cabbb65f5c469464da6c5f614cbd7f730f2] Merge tag 'net-next-6.3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git

Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

2023-02-28 Thread Mikhail Gavrilov

On Mon, Feb 27, 2023 at 3:22 PM Christian König
>
> Unfortunately yes. We could clean that up a bit more so that you don't
> run into a BUG() assertion, but what essentially happens here is that we
> completely fail to talk to the hardware.
>
> In this situation we can't even re-enable vesa or text console any more.
>
Then I don't understand why when amdgpu is blacklisted via
modprobe.blacklist=amdgpu then I see graphics and could login into
GNOME. Yes without hardware acceleration, but it is better than non
working graphics. It means there is some other driver (I assume this
is "video") which can successfully talk to the AMD hardware in
conditions where amdgpu cannot do this. My suggestion is that if
amdgpu fails to talk to the hardware, then let another suitable driver
do it. I attached a system log when I apply "pci=nocrs" with
"modprobe.blacklist=amdgpu" for showing that graphics work right in
this case.
To do this, does the Linux module loading mechanism need to be refined?


-- 
Best Regards,
Mike Gavrilov.


system-without-amdgpu.tar.xz
Description: application/xz

Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

2023-02-24 Thread Mikhail Gavrilov

On Fri, Feb 24, 2023 at 8:31 PM Christian König
 wrote:
>
> Sorry I totally missed that you attached the full dmesg to your original
> mail.
>
> Yeah, the driver did fail gracefully. But then X doesn't come up and
> then gdm just dies.

Are you sure that these messages should be present when the driver
fails gracefully?

turning off the locking correctness validator.
CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L
---  ---  6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
#1
Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
BIOS G513QY.320 09/07/2022
Call Trace:
 
 dump_stack_lvl+0x57/0x90
 register_lock_class+0x47d/0x490
 __lock_acquire+0x74/0x21f0
 ? lock_release+0x155/0x450
 lock_acquire+0xd2/0x320
 ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
 ? lock_is_held_type+0xce/0x120
 _raw_spin_lock_irqsave+0x4d/0xa0
 ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
 amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
 amdgpu_device_fini_hw+0x43/0x2c0 [amdgpu]
 amdgpu_driver_load_kms+0xe8/0x190 [amdgpu]
 amdgpu_pci_probe+0x140/0x420 [amdgpu]
 local_pci_probe+0x41/0x90
 pci_device_probe+0xc3/0x230
 really_probe+0x1b6/0x410
 __driver_probe_device+0x78/0x170
 driver_probe_device+0x1f/0x90
 __driver_attach+0xd2/0x1c0
 ? __pfx___driver_attach+0x10/0x10
 bus_for_each_dev+0x8a/0xd0
 bus_add_driver+0x141/0x230
 driver_register+0x77/0x120
 ? __pfx_init_module+0x10/0x10 [amdgpu]
 do_one_initcall+0x6e/0x350
 do_init_module+0x4a/0x220
 __do_sys_init_module+0x192/0x1c0
 do_syscall_64+0x5b/0x80
 ? asm_exc_page_fault+0x22/0x30
 ? lockdep_hardirqs_on+0x7d/0x100
 entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fd58cfcb1be
Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01
RSP: 002b:7ffd1d1065d8 EFLAGS: 0246 ORIG_RAX: 00af
RAX: ffda RBX: 55b0b5aa6d70 RCX: 7fd58cfcb1be
RDX: 55b0b5a96670 RSI: 016b6156 RDI: 7fd589392010
RBP: 7ffd1d106690 R08: 55b0b5a93bd0 R09: 016b6ff0
R10: 55b5eea2c333 R11: 0246 R12: 55b0b5a96670
R13: 0002 R14: 55b0b5a9c170 R15: 55b0b5aa58a0
 
amdgpu: probe of :03:00.0 failed with error -12
amdgpu :08:00.0: enabling device (0006 -> 0007)
[drm] initializing kernel modesetting (RENOIR 0x1002:0x1638 0x1043:0x16C2 0xC4).


list_add corruption. prev->next should be next (c0940328), but
was . (prev=8c9b734062b0).
[ cut here ]
kernel BUG at lib/list_debug.c:30!
invalid opcode:  [#1] PREEMPT SMP NOPTI
CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L
---  ---  6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
#1
Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
BIOS G513QY.320 09/07/2022
RIP: 0010:__list_add_valid+0x74/0x90
Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b
48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b
48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d
RSP: 0018:a50f81aafa00 EFLAGS: 00010246
RAX: 0075 RBX: 8c9b734062b0 RCX: 
RDX:  RSI: 0027 RDI: 
RBP: 8c9b734062b0 R08:  R09: a50f81aaf8a0
R10: 0003 R11: 8caa1d2fffe8 R12: 8c9b7c0a5e48
R13:  R14: c13a6d20 R15: 
FS:  7fd58c6a5940() GS:8ca9d9a0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 55b0b5a955e0 CR3: 00017e86 CR4: 00750ee0
PKRU: 5554
Call Trace:
 
 ttm_device_init+0x184/0x1c0 [ttm]
 amdgpu_ttm_init+0xb8/0x610 [amdgpu]
 ? _printk+0x60/0x80
 gmc_v9_0_sw_init+0x4a3/0x7c0 [amdgpu]
 amdgpu_device_init+0x14e5/0x2520 [amdgpu]
 amdgpu_driver_load_kms+0x15/0x190 [amdgpu]
 amdgpu_pci_probe+0x140/0x420 [amdgpu]
 local_pci_probe+0x41/0x90
 pci_device_probe+0xc3/0x230
 really_probe+0x1b6/0x410
 __driver_probe_device+0x78/0x170
 driver_probe_device+0x1f/0x90
 __driver_attach+0xd2/0x1c0
 ? __pfx___driver_attach+0x10/0x10
 bus_for_each_dev+0x8a/0xd0
 bus_add_driver+0x141/0x230
 driver_register+0x77/0x120
 ? __pfx_init_module+0x10/0x10 [amdgpu]
 do_one_initcall+0x6e/0x350
 do_init_module+0x4a/0x220
 __do_sys_init_module+0x192/0x1c0
 do_syscall_64+0x5b/0x80
 ? asm_exc_page_fault+0x22/0x30
 ? lockdep_hardirqs_on+0x7d/0x100
 entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fd58cfcb1be
Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 48
RSP: 002b:7ffd1d1065d8 EFLAGS: 0246 ORIG_RAX: 00af
RAX: ffda RBX: 55b0b5aa6d70 RCX: 7fd58cfcb1be
RDX: 55b0b5a96670 RSI: 016b6156 RDI: 7fd589392010
RBP:

Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

2023-02-24 Thread Mikhail Gavrilov

On Fri, Feb 24, 2023 at 12:13 PM Christian König
 wrote:
>
> Hi Mikhail,
>
> this is pretty clearly a problem with the system and/or it's BIOS and
> not the GPU hw or the driver.
>
> The option pci=nocrs makes the kernel ignore additional resource windows
> the BIOS reports through ACPI. This then most likely leads to problems
> with amdgpu because it can't bring up its PCIe resources any more.
>
> The output of "sudo lspci - -s $BUSID_OF_AMDGPU" might help
> understand the problem

I attach both lspci for pci=nocrs and without pci=nocrs.

The differences for Cezanne Radeon Vega Series:
with pci=nocrs:
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Interrupt: pin A routed to IRQ 255
Region 4: I/O ports at e000 [disabled] [size=256]
Capabilities: [c0] MSI-X: Enable- Count=4 Masked-

Without pci=nocrs:
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Interrupt: pin A routed to IRQ 44
Region 4: I/O ports at e000 [size=256]
Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-


The differences for Navi 22 Radeon 6800M:
with pci=nocrs:
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Interrupt: pin A routed to IRQ 255
Region 0: Memory at f8 (64-bit, prefetchable) [disabled] [size=16G]
Region 2: Memory at fc (64-bit, prefetchable) [disabled] [size=256M]
Region 5: Memory at fca0 (32-bit, non-prefetchable) [disabled] [size=1M]
AtomicOpsCtl: ReqEn-
Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address:   Data: 

Without pci=nocrs:
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 103
Region 0: Memory at f8 (64-bit, prefetchable) [size=16G]
Region 2: Memory at fc (64-bit, prefetchable) [size=256M]
Region 5: Memory at fca0 (32-bit, non-prefetchable) [size=1M]
AtomicOpsCtl: ReqEn+
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee0  Data: 

> but I strongly suggest to try a BIOS update first.

This is the first thing that was done. And I am afraid no more BIOS updates.
https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/

I also have experience in dealing with manufacturers' tech support.
Usually it ends with "we do not provide drivers for Linux".

-- 
Best Regards,
Mike Gavrilov.
❯ sudo lspci - -s 08:00.0
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] 
Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c4) (prog-if 00 
[VGA controller])
Subsystem: ASUSTeK Computer Inc. Radeon Vega 8
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ SERR- 
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0-,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 
unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- 
TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit 
Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- 
LTR-
 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ 
EETLPPrefix+, MaxEETLPPrefixes 1
 EmergencyPowerReduction Not Supported, 
EmergencyPowerReductionInit-
 FRS-
 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 
10BitTagReq- OBFF Disabled,
 AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 
2Retimers+ DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
 Transmit

amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

2023-02-23 Thread Mikhail Gavrilov

Hi,
I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But
it is impossible to use without AC power because the system losts nvme
when I disconnect the power adapter.

Messages from kernel log when it happens:
nvme nvme0: controller is down; will reset: CSTS=0x, PCI_STATUS=0x10
nvme nvme0: Does your device have a faulty power saving mode enabled?
nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
and report a bug

I tried to use recommended parameters
(nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve
this issue, but without successed.

In the linux-nvme mail list the last advice was to try the "pci=nocrs"
parameter.

But with this parameter the amdgpu driver refuses to work and makes
the system unbootable. I can solve the problem with the booting system
by blacklisting the driver but it is not a good solution, because I
don't wanna lose the GPU.

Why amdgpu not work with "pci=nocrs" ?
And is it possible to solve this incompatibility?
It is very important because when I boot the system without amdgpu
driver with "pci=nocrs" nvme is not losts when I disconnect the power
adapter. So "pci=nocrs" really helps.

Below that I see in kernel log when adds "pci=nocrs" parameter:

amdgpu :03:00.0: amdgpu: Fetched VBIOS from ATRM
amdgpu: ATOM BIOS: SWBRT77321.001
[drm] VCN(0) decode is enabled in VM mode
[drm] VCN(0) encode is enabled in VM mode
[drm] JPEG decode is enabled in VM mode
Console: switching to colour dummy device 80x25
amdgpu :03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature
disabled as experimental (default)
[drm] GPU posting now...
[drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment
size is 9-bit
amdgpu :03:00.0: amdgpu: VRAM: 12272M 0x0080 -
0x0082FEFF (12272M used)
amdgpu :03:00.0: amdgpu: GART: 512M 0x - 0x1FFF
amdgpu :03:00.0: amdgpu: AGP: 267894784M 0x0084 -
0x
[drm] Detected VRAM RAM=12272M, BAR=16384M
[drm] RAM width 192bits GDDR6
[drm] amdgpu: 12272M of VRAM memory ready
[drm] amdgpu: 31774M of GTT memory ready.
amdgpu :03:00.0: amdgpu: (-14) failed to allocate kernel bo
[drm] Debug VRAM access will use slowpath MM access
amdgpu :03:00.0: amdgpu: Failed to DMA MAP the dummy page
[drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block
 failed -12
amdgpu :03:00.0: amdgpu: amdgpu_device_ip_init failed
amdgpu :03:00.0: amdgpu: Fatal error during GPU init
amdgpu :03:00.0: amdgpu: amdgpu: finishing device.

Of course a full system log is also attached.

-- 
Best Regards,
Mike Gavrilov.


system-log-Fatal-error-during-GPU-init.tar.xz
Description: application/xz

Re: [bug][vaapi][h264] The commit 7cbe08a930a132d84b4cf79953b00b074ec7a2a7 on certain video files leads to problems with VAAPI hardware decoding.

2023-02-17 Thread Mikhail Gavrilov

On Fri, Feb 17, 2023 at 8:30 PM Alex Deucher  wrote:
>
> On Fri, Feb 17, 2023 at 1:10 AM Mikhail Gavrilov
>  wrote:
> >
> > On Fri, Dec 9, 2022 at 7:37 PM Leo Liu  wrote:
> > >
> > > Please try the latest AMDGPU driver:
> > >
> > > https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next/
> > >
> >
> > Sorry Leo, I miss your message.
> > This issue is still actual for 6.2-rc8.
> >
> > In my first message I was mistaken.
> >
> > > Before kernel 5.16 this only led to an artifact in the form of
> > > a green bar at the top of the screen, then starting from 5.17
> > > the GPU began to freeze.
> >
> > The real behaviour before 5.18:
> > - vlc could plays video with small artifacts in the form of a green
> > bar on top of the video
> > - after playing video process vlc correctly exiting
> >
> > On 5.18 this behaviour changed:
> > - vlc show black screen instead of playing video
> > - after playing the process not exiting
> > - if I tries kill vlc process with 'kill -9' vlc became zombi process
> > and many other processes start hangs (in kernel log appears follow
> > lines after 2 minutes)
> >
> > INFO: task vlc:sh8:5248 blocked for more than 122 seconds.
> >   Tainted: GWL     ---  5.18.0-60.fc37.x86_64+debug 
> > #1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > task:vlc:sh8 state:D stack:13616 pid: 5248 ppid:  1934 
> > flags:0x4006
> > Call Trace:
> >  
> >  __schedule+0x492/0x1650
> >  ? _raw_spin_unlock_irqrestore+0x40/0x60
> >  ? debug_check_no_obj_freed+0x12d/0x250
> >  schedule+0x4e/0xb0
> >  schedule_timeout+0xe1/0x120
> >  ? lock_release+0x215/0x460
> >  ? trace_hardirqs_on+0x1a/0xf0
> >  ? _raw_spin_unlock_irqrestore+0x40/0x60
> >  dma_fence_default_wait+0x197/0x240
> >  ? __bpf_trace_dma_fence+0x10/0x10
> >  dma_fence_wait_timeout+0x229/0x260
> >  drm_sched_entity_fini+0x101/0x270 [gpu_sched]
> >  amdgpu_vm_fini+0x2b5/0x460 [amdgpu]
> >  ? idr_destroy+0x70/0xb0
> >  ? mutex_destroy+0x1e/0x50
> >  amdgpu_driver_postclose_kms+0x1ec/0x2c0 [amdgpu]
> >  drm_file_free.part.0+0x20d/0x260
> >  drm_release+0x6a/0x120
> >  __fput+0xab/0x270
> >  task_work_run+0x5c/0xa0
> >  do_exit+0x394/0xc40
> >  ? rcu_read_lock_sched_held+0x10/0x70
> >  do_group_exit+0x33/0xb0
> >  get_signal+0xbbc/0xbc0
> >  arch_do_signal_or_restart+0x30/0x770
> >  ? do_futex+0xfd/0x190
> >  ? __x64_sys_futex+0x63/0x190
> >  exit_to_user_mode_prepare+0x172/0x270
> >  syscall_exit_to_user_mode+0x16/0x50
> >  do_syscall_64+0x67/0x80
> >  ? do_syscall_64+0x67/0x80
> >  ? rcu_read_lock_sched_held+0x10/0x70
> >  ? trace_hardirqs_on_prepare+0x5e/0x110
> >  ? do_syscall_64+0x67/0x80
> >  ? rcu_read_lock_sched_held+0x10/0x70
> >  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > RIP: 0033:0x7f82c2364529
> > RSP: 002b:7f8210ff8c00 EFLAGS: 0246 ORIG_RAX: 00ca
> > RAX: fe00 RBX:  RCX: 7f82c2364529
> > RDX:  RSI: 0189 RDI: 7f823022542c
> > RBP: 7f8210ff8c30 R08:  R09: 
> > R10:  R11: 0246 R12: 
> > R13:  R14: 0001 R15: 7f823022542c
> >  
> > INFO: lockdep is turned off.
> >
> > I bisected this issue and problematic commit is
> >
> > ❯ git bisect bad
> > 5f3854f1f4e211f494018160b348a1c16e58013f is the first bad commit
> > commit 5f3854f1f4e211f494018160b348a1c16e58013f
> > Author: Alex Deucher 
> > Date:   Thu Mar 24 18:04:00 2022 -0400
> >
> > drm/amdgpu: add more cases to noretry=1
> >
> > Port current list from amd-staging-drm-next.
> >
> > Signed-off-by: Alex Deucher 
> >
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > Unfortunately I couldn't simply revert this commit on 6.2-rc8 for
> > checking, because it leads to conflicts.
> >
> > Alex, you as author of this commit could help me with it?
>
> append amdgpu.noretry=0 to the kernel command line in grub.

Thanks, I checked the "amdgpu.noretry=0" and after the page fault
occurs vlc could play video with little artifacts.

So I have some questions:

1. Why retrys was disabled by default if it really stills needed for
recoverable page faults? As Christian answered me before here:
https

Re: [bug][vaapi][h264] The commit 7cbe08a930a132d84b4cf79953b00b074ec7a2a7 on certain video files leads to problems with VAAPI hardware decoding.

2023-02-16 Thread Mikhail Gavrilov

On Fri, Dec 9, 2022 at 7:37 PM Leo Liu  wrote:
>
> Please try the latest AMDGPU driver:
>
> https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next/
>

Sorry Leo, I miss your message.
This issue is still actual for 6.2-rc8.

In my first message I was mistaken.

> Before kernel 5.16 this only led to an artifact in the form of
> a green bar at the top of the screen, then starting from 5.17
> the GPU began to freeze.

The real behaviour before 5.18:
- vlc could plays video with small artifacts in the form of a green
bar on top of the video
- after playing video process vlc correctly exiting

On 5.18 this behaviour changed:
- vlc show black screen instead of playing video
- after playing the process not exiting
- if I tries kill vlc process with 'kill -9' vlc became zombi process
and many other processes start hangs (in kernel log appears follow
lines after 2 minutes)

INFO: task vlc:sh8:5248 blocked for more than 122 seconds.
  Tainted: GWL     ---  5.18.0-60.fc37.x86_64+debug #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:vlc:sh8 state:D stack:13616 pid: 5248 ppid:  1934 flags:0x4006
Call Trace:
 
 __schedule+0x492/0x1650
 ? _raw_spin_unlock_irqrestore+0x40/0x60
 ? debug_check_no_obj_freed+0x12d/0x250
 schedule+0x4e/0xb0
 schedule_timeout+0xe1/0x120
 ? lock_release+0x215/0x460
 ? trace_hardirqs_on+0x1a/0xf0
 ? _raw_spin_unlock_irqrestore+0x40/0x60
 dma_fence_default_wait+0x197/0x240
 ? __bpf_trace_dma_fence+0x10/0x10
 dma_fence_wait_timeout+0x229/0x260
 drm_sched_entity_fini+0x101/0x270 [gpu_sched]
 amdgpu_vm_fini+0x2b5/0x460 [amdgpu]
 ? idr_destroy+0x70/0xb0
 ? mutex_destroy+0x1e/0x50
 amdgpu_driver_postclose_kms+0x1ec/0x2c0 [amdgpu]
 drm_file_free.part.0+0x20d/0x260
 drm_release+0x6a/0x120
 __fput+0xab/0x270
 task_work_run+0x5c/0xa0
 do_exit+0x394/0xc40
 ? rcu_read_lock_sched_held+0x10/0x70
 do_group_exit+0x33/0xb0
 get_signal+0xbbc/0xbc0
 arch_do_signal_or_restart+0x30/0x770
 ? do_futex+0xfd/0x190
 ? __x64_sys_futex+0x63/0x190
 exit_to_user_mode_prepare+0x172/0x270
 syscall_exit_to_user_mode+0x16/0x50
 do_syscall_64+0x67/0x80
 ? do_syscall_64+0x67/0x80
 ? rcu_read_lock_sched_held+0x10/0x70
 ? trace_hardirqs_on_prepare+0x5e/0x110
 ? do_syscall_64+0x67/0x80
 ? rcu_read_lock_sched_held+0x10/0x70
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f82c2364529
RSP: 002b:7f8210ff8c00 EFLAGS: 0246 ORIG_RAX: 00ca
RAX: fe00 RBX:  RCX: 7f82c2364529
RDX:  RSI: 0189 RDI: 7f823022542c
RBP: 7f8210ff8c30 R08:  R09: 
R10:  R11: 0246 R12: 
R13:  R14: 0001 R15: 7f823022542c
 
INFO: lockdep is turned off.

I bisected this issue and problematic commit is

❯ git bisect bad
5f3854f1f4e211f494018160b348a1c16e58013f is the first bad commit
commit 5f3854f1f4e211f494018160b348a1c16e58013f
Author: Alex Deucher 
Date:   Thu Mar 24 18:04:00 2022 -0400

drm/amdgpu: add more cases to noretry=1

Port current list from amd-staging-drm-next.

Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 3 +++
 1 file changed, 3 insertions(+)

Unfortunately I couldn't simply revert this commit on 6.2-rc8 for
checking, because it leads to conflicts.

Alex, you as author of this commit could help me with it?


-- 
Best Regards,
Mike Gavrilov.

Re: [regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70

2023-02-13 Thread Mikhail Gavrilov

On Thu, Feb 9, 2023 at 10:17 PM Leo Li  wrote:
>
> Hi Mikhail, seems like your report flew past me, thanks for the ping.
>
> This might be a simple issue of not backing off when deadlock was hit.
> drm_atomic_normalize_zpos() can return an error code, and I ignored it
> (oops!)
>
> Can you give this patch a try?
> https://gitlab.freedesktop.org/-/snippets/7414
>
> - Leo
>

Thanks,
I think the time for testing was enough.
I observed three computers with different GPUs 6800M, 6900XT and
7900XTX for more than 3 days. And a warning message about
drm_modeset_drop_locks no longer appears anymore.

I hope this patch will have time to be merged in 6.2 before release.

Tested-by: Mikhail Gavrilov 

-- 
Best Regards,
Mike Gavrilov.

uptime.tar.xz
Description: application/xz

Re: [regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70

2023-02-09 Thread Mikhail Gavrilov

Harry, please don't ignore me.
This issue still happens in 6.1 and 6.2
Leo you are the author of the problematic commit please don't stand aside.
Really nobody is interested in clean logs without warnings and errors?
I am 100% sure that reverting commit
b261509952bc19d1012cf732f853659be6ebc61e will stop these warnings. I
also attached fresh logs from 6.2.0-0.rc6.
6.2-rc7 I started to build without commit
b261509952bc19d1012cf732f853659be6ebc61e to avoid these warnings.


On Thu, Oct 13, 2022 at 6:36 PM Mikhail Gavrilov
>
> Hi!
> I bisected an issue of the 6.0 kernel which started happening after
> 6.0-rc7 on all my machines.
>
> Backtrace of this issue looks like as:
>
> [ 2807.339439] [ cut here ]
> [ 2807.339445] WARNING: CPU: 11 PID: 2061 at
> drivers/gpu/drm/drm_modeset_lock.c:276
> drm_modeset_drop_locks+0x63/0x70
> [ 2807.339453] Modules linked in: tls uinput rfcomm snd_seq_dummy
> snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
> nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
> nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
> qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir
> snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek sunrpc snd_sof
> snd_hda_codec_hdmi snd_hda_codec_generic snd_sof_utils snd_hda_intel
> snd_intel_dspcfg mt7921e snd_intel_sdw_acpi binfmt_misc snd_soc_core
> mt7921_common snd_hda_codec snd_compress vfat ac97_bus edac_mce_amd
> mt76_connac_lib snd_pcm_dmaengine fat snd_hda_core snd_rpl_pci_acp6x
> snd_pci_acp6x mt76 btusb snd_hwdep kvm_amd btrtl snd_seq btbcm
> mac80211 snd_seq_device kvm btintel btmtk libarc4 snd_pcm
> snd_pci_acp5x bluetooth snd_timer snd_rn_pci_acp3x irqbypass
> snd_acp_config snd_soc_acpi cfg80211 rapl snd joydev pcspkr
> asus_nb_wmi wmi_bmof
> [ 2807.339519]  snd_pci_acp3x soundcore i2c_piix4 k10temp amd_pmc
> asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi
> crct10dif_pclmul iommu_v2 crc32_pclmul ledtrig_audio crc32c_intel
> gpu_sched sparse_keymap platform_profile hid_multitouch
> polyval_clmulni nvme ucsi_acpi drm_buddy polyval_generic
> drm_display_helper ghash_clmulni_intel serio_raw nvme_core ccp
> typec_ucsi rfkill sp5100_tco r8169 cec nvme_common typec wmi video
> i2c_hid_acpi i2c_hid ip6_tables ip_tables fuse
> [ 2807.339540] Unloaded tainted modules: acpi_cpufreq():1
> acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1
> acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1
> amd64_edac():1 acpi_cpufreq():1 acpi_cpufreq():1 amd64_edac():1
> amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1
> amd64_edac():1 acpi_cpufreq():1 amd64_edac():1 acpi_cpufreq():1
> fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1
> amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
> fjes():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1
> amd64_edac():1 fjes():1 acpi_cpufreq():1 amd64_edac():1
> pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 amd64_edac():1
> pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1
> fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1
> acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 fjes():1
> acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
> acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 acpi_cpufreq():1
> pcc_cpufreq():1 fjes():1
> [ 2807.339579]  acpi_cpufreq():1 fjes():1 pcc_cpufreq():1
> acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 fjes():1
> acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1
> acpi_cpufreq():1 fjes():1 acpi_cpufreq():1 fjes():1 fjes():1 fjes():1
> fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
> fjes():1 fjes():1 fjes():1 fjes():1
> [ 2807.339596] CPU: 11 PID: 2061 Comm: gnome-shell Tainted: GW
>L 6.0.0-rc4-07-cb0eca01ad9756e853efec3301203c2b5b45aa9f+ #16
> [ 2807.339598] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
> G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022
> [ 2807.339600] RIP: 0010:drm_modeset_drop_locks+0x63/0x70
> [ 2807.339602] Code: 42 08 48 89 10 48 89 1b 48 8d bb 50 ff ff ff 48
> 89 5b 08 e8 3f 41 55 00 48 8b 45 78 49 39 c4 75 c6 5b 5d 41 5c c3 cc
> cc cc cc <0f> 0b eb ac 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55
> 41 54
> [ 2807.339604] RSP: 0018:b6ad46e07b80 EFLAGS: 00010282
> [ 2807.339606] RAX: 0001 RBX:  RCX: 
> 0002
> [ 2807.339607] RDX: 0001 RSI: a6a118b1 RDI: 
> b6ad46e07c00
> [ 2807.339608] RBP: b6ad46e07c00 R08:  R09: 
> 
> [ 2807.339609] R10:  R11: 0001 R12: 
> 
> [ 2807.339610] R13: 9

Re: [PATCH] drm/amd: fix memory leak in amdgpu_cs_sync_rings

2023-02-03 Thread Mikhail Gavrilov

On Fri, Feb 3, 2023 at 12:10 AM Bert Karwatzki  wrote:
>
> I hope I got it right this time:
> Here is the fix for
> Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2360
>
> From 6e064c9565ef0da890f3fcb2a4f6a8cd44a12fdb Mon Sep 17 00:00:00 2001
> From: Bert Karwatzki 
> Date: Thu, 2 Feb 2023 19:50:27 +0100
> Subject: [PATCH] Fix memory leak in amdgpu_cs_sync_rings.
>
> Signed-off-by: Bert Karwatzki 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 0f4cb41078c1..08eced097bd8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -1222,10 +1222,13 @@ static int amdgpu_cs_sync_rings(struct
> amdgpu_cs_parser *p)
>  * next job actually sees the results from the
> previous one
>  * before we start executing on the same scheduler
> ring.
>  */
> -   if (!s_fence || s_fence->sched != sched)
> +   if (!s_fence || s_fence->sched != sched) {
> +   dma_fence_put(fence);
> continue;
> +   }
>
> r = amdgpu_sync_fence(>gang_leader->explicit_sync,
> fence);
> +   dma_fence_put(fence);
> if (r)
>     return r;
> }
> --
> 2.39.1
>

As a bug reporter I can confirm this patch fixes a memory leak.
Tested-by: Mikhail Gavrilov 

-- 
Best Regards,
Mike Gavrilov.

Re: [PATCH] drm/amdgpu: grab extra fence reference for drm_sched_job_add_dependency

2023-01-06 Thread Mikhail Gavrilov

On Thu, Jan 5, 2023 at 3:03 PM Christian König
 wrote:
>
> That one should be fixed by:
>
> commit 9f1ecfc5dcb47a7ca37be47b0eaca0f37f1ae93d
> Author: Dmitry Osipenko 
> Date:   Wed Nov 23 03:13:03 2022 +0300
>

Christian,
This patch was written Nov. 23, 2022, but still not submitted in 6.2!
Why?
It will close my questions about amdgpu right now.

Tested-by: Mikhail Gavrilov 

-- 
Best Regards,
Mike Gavrilov.

[6.2][regression] looks like commit aab9cf7b6954136f4339136a1a7fc0602a2c4d8b leads to use-after-free and random computer hangs

2022-12-18 Thread Mikhail Gavrilov

Hi,
The kernel 6.2 preparation cycle has begun.
And after the kernel was updated on my Fedora Rawhide I started
receiving use-after-free errors with complete computer hangs.
At least a good reproducer of this behaviour is launch of the game
"Marvel's Avengers".

The backtrace of the issue looks like:
[  550.435083] [ cut here ]
[  550.435110] refcount_t: underflow; use-after-free.
[  550.435808] WARNING: CPU: 9 PID: 738 at lib/refcount.c:25
refcount_warn_saturate+0x97/0x110
[  550.435812] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack
[  550.435887] refcount_t: saturated; leaking memory.
[  550.435893]  nf_defrag_ipv6 nf_defrag_ipv4
[  550.435902] WARNING: CPU: 26 PID: 5032 at lib/refcount.c:19
refcount_warn_saturate+0x74/0x110
[  550.435907]  ip_set
[  550.435909] Modules linked in:
[  550.435910]  nf_tables
[  550.435912]  uinput rfcomm
[  550.435918]  nfnetlink
[  550.435919]  snd_seq_dummy snd_hrtimer
[  550.435925]  qrtr
[  550.435926]  netconsole nft_objref
[  550.435931]  bnep
[  550.435933]  nf_conntrack_netbios_ns nf_conntrack_broadcast
[  550.435938]  sunrpc
[  550.435939]  nft_fib_inet
[  550.435941]  binfmt_misc
[  550.435942]  nft_fib_ipv4
[  550.435943]  iwlmvm
[  550.435130] WARNING: CPU: 25 PID: 740 at lib/refcount.c:28
refcount_warn_saturate+0xba/0x110
[  550.435945]  nft_fib_ipv6
[  550.435946]  btusb
[  550.435947]  nft_fib nft_reject_inet
[  550.435954]  btrtl
[  550.435955]  nf_reject_ipv4 nf_reject_ipv6
[  550.435963]  btbcm
[  550.435964]  nft_reject nft_ct
[  550.435969]  btintel
[  550.435971]  nft_chain_nat nf_nat
[  550.435977]  btmtk
[  550.435979]  nf_conntrack nf_defrag_ipv6
[  550.435984]  snd_seq_midi
[  550.435985]  nf_defrag_ipv4 ip_set
[  550.435991]  snd_seq_midi_event
[  550.435992]  nf_tables
[  550.435993]  bluetooth
[  550.435995]  nfnetlink
[  550.435996]  hid_logitech_hidpp
[  550.435142] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc iwlmvm btusb btrtl
btbcm btintel btmtk snd_seq_midi snd_seq_midi_event bluetooth
hid_logitech_hidpp snd_usb_audio iwlwifi xpad ff_memless
snd_usbmidi_lib snd_rawmidi mc ecdh_generic intel_rapl_msr
intel_rapl_common mt76x2u mt76x2_common joydev snd_hda_codec_realtek
mt76x02_usb edac_mce_amd snd_hda_codec_generic mt76_usb
snd_hda_codec_hdmi mt76x02_lib kvm_amd snd_hda_intel snd_intel_dspcfg
snd_intel_sdw_acpi snd_hda_codec mt76 vfat kvm snd_hda_core fat
snd_seq snd_hwdep irqbypass snd_seq_device mac80211 snd_pcm eeepc_wmi
asus_wmi ledtrig_audio sparse_keymap rapl platform_profile wmi_bmof
snd_timer snd pcspkr i2c_piix4
[  550.435997]  qrtr bnep
[  550.436003]  snd_usb_audio
[  550.436004]  sunrpc binfmt_misc
[  550.436010]  iwlwifi
[  550.436012]  iwlmvm btusb
[  550.436018]  xpad
[  550.436019]  btrtl btbcm
[  550.436025]  ff_memless
[  550.436026]  btintel
[  550.436027]  snd_usbmidi_lib
[  550.436029]  btmtk
[  550.436030]  snd_rawmidi
[  550.436031]  snd_seq_midi snd_seq_midi_event
[  550.436037]  mc
[  550.436038]  bluetooth
[  550.436039]  ecdh_generic
[  550.436041]  hid_logitech_hidpp snd_usb_audio
[  550.436046]  intel_rapl_msr
[  550.436048]  iwlwifi xpad
[  550.436054]  intel_rapl_common
[  550.436055]  ff_memless
[  550.436056]  mt76x2u
[  550.436058]  snd_usbmidi_lib snd_rawmidi
[  550.436063]  mt76x2_common
[  550.436064]  mc ecdh_generic
[  550.436070]  joydev
[  550.436071]  intel_rapl_msr intel_rapl_common
[  550.436076]  snd_hda_codec_realtek
[  550.436078]  mt76x2u
[  550.436079]  mt76x02_usb
[  550.436080]  mt76x2_common joydev
[  550.436086]  edac_mce_amd
[  550.436088]  snd_hda_codec_realtek mt76x02_usb
[  550.436094]  snd_hda_codec_generic
[  550.436095]  edac_mce_amd
[  550.436096]  mt76_usb
[  550.436098]  snd_hda_codec_generic mt76_usb
[  550.436104]  snd_hda_codec_hdmi
[  550.436106]  snd_hda_codec_hdmi
[  550.436107]  mt76x02_lib
[  550.435234]  k10temp soundcore libarc4 acpi_cpufreq cfg80211
hid_logitech_dj rfkill zram amdgpu drm_ttm_helper ttm video iommu_v2
gpu_sched drm_buddy crct10dif_pclmul crc32_pclmul crc32c_intel igb
ucsi_ccg drm_display_helper nvme typec_ucsi ghash_clmulni_intel ccp
typec cec sp5100_tco dca sha512_ssse3 nvme_core wmi ip6_tables
ip_tables fuse
[  550.436108]  mt76x02_lib kvm_amd
[  550.436115]  kvm_amd
[  550.436116]  snd_hda_intel snd_intel_dspcfg
[  550.436122]  snd_hda_intel
[  550.436123]  snd_intel_sdw_acpi
[  550.435284] CPU: 25 PID: 740 Comm: sdma2 Tainted: GWL

Re: Screen corruption using radeon kernel driver

2022-12-10 Thread Mikhail Krylov

On Wed, Nov 30, 2022 at 11:07:32AM -0500, Alex Deucher wrote:
> On Wed, Nov 30, 2022 at 10:42 AM Robin Murphy  wrote:
> >
> > On 2022-11-30 14:28, Alex Deucher wrote:
> > > On Wed, Nov 30, 2022 at 7:54 AM Robin Murphy  wrote:
> > >>
> > >> On 2022-11-29 17:11, Mikhail Krylov wrote:
> > >>> On Tue, Nov 29, 2022 at 11:05:28AM -0500, Alex Deucher wrote:
> > >>>> On Tue, Nov 29, 2022 at 10:59 AM Mikhail Krylov  
> > >>>> wrote:
> > >>>>>
> > >>>>> On Tue, Nov 29, 2022 at 09:44:19AM -0500, Alex Deucher wrote:
> > >>>>>> On Mon, Nov 28, 2022 at 3:48 PM Mikhail Krylov  
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher wrote:
> > >>>>>>>
> > >>>>>>>>>> [excessive quoting removed]
> > >>>>>>>
> > >>>>>>>>> So, is there any progress on this issue? I do understand it's not 
> > >>>>>>>>> a high
> > >>>>>>>>> priority one, and today I've checked it on 6.0 kernel, and
> > >>>>>>>>> unfortunately, it still persists...
> > >>>>>>>>>
> > >>>>>>>>> I'm considering writing a patch that will allow user to override
> > >>>>>>>>> need_dma32/dma_bits setting with a module parameter. I'll have 
> > >>>>>>>>> some time
> > >>>>>>>>> after the New Year for that.
> > >>>>>>>>>
> > >>>>>>>>> Is it at all possible that such a patch will be merged into 
> > >>>>>>>>> kernel?
> > >>>>>>>>>
> > >>>>>>>> On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov  
> > >>>>>>>> wrote:
> > >>>>>>>> Unless someone familiar with HIMEM can figure out what is going 
> > >>>>>>>> wrong
> > >>>>>>>> we should just revert the patch.
> > >>>>>>>>
> > >>>>>>>> Alex
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Okay, I was suggesting that mostly because
> > >>>>>>>
> > >>>>>>> a) it works for me with dma_bits = 40 (I understand that's what it 
> > >>>>>>> is
> > >>>>>>> without the original patch applied);
> > >>>>>>>
> > >>>>>>> b) there's a hint of uncertainity on this line
> > >>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359
> > >>>>>>> saying that for AGP dma_bits = 32 is the safest option, so 
> > >>>>>>> apparently there are
> > >>>>>>> setups, unlike mine, where dma_bits = 32 is better than 40.
> > >>>>>>>
> > >>>>>>> But I'm in no position to argue, just wanted to make myself clear.
> > >>>>>>> I'm okay with rebuilding the kernel for my machine until the 
> > >>>>>>> original
> > >>>>>>> patch is reverted or any other fix is applied.
> > >>>>>>
> > >>>>>> What GPU do you have and is it AGP?  If it is AGP, does setting
> > >>>>>> radeon.agpmode=-1 also fix it?
> > >>>>>>
> > >>>>>> Alex
> > >>>>>
> > >>>>> That is ATI Radeon X1950, and, unfortunately, radeon.agpmode=-1 
> > >>>>> doesn't
> > >>>>> help, it just makes 3D acceleration in games such as OpenArena stop
> > >>>>> working.
> > >>>>
> > >>>> Just to confirm, is the board AGP or PCIe?
> > >>>>
> > >>>> Alex
> > >>>
> > >>> It is AGP. That's an old machine.
> > >>
> > >> Can you check whether dma_addressing_limited() is actually returning the
> > >> expected result at the point of radeon_ttm_init()? Disabling highmem is
> > >> presumably just hiding whatever problem exists, by throwing away all
> > >>   >32-bit RAM such that use_dma32 doesn't matter.
> > >
> > > The device in question only supports a 32 bit DMA mask so
> > > dma_addressing_limited() should return true.  Bounce buffers are not
> > > really usable on GPUs because they map so much memory.  If
> > > dma_addressing_limited() returns false, that would explain it.
> >
> > Right, it appears to be the only part of the offending commit that
> > *could* reasonably make any difference, so I'm primarily wondering if
> > dma_get_required_mask() somehow gets confused.
> 
> Mikhail,
> 
> Can you see that dma_addressing_limited() and dma_get_required_mask()
> return in this case?
> 
> Alex
> 
> 
> >
> > Thanks,
> > Robin.

Hello again, I was able to confirm by adding printk() to the functions
and recompiling the kernel that dma_addressing_limited() returns
*false* on the kernel with the bug. 

And dma_get_required_mask() returns 0x7fff, as I said before.


signature.asc
Description: PGP signature

Re: [bug][vaapi][h264] The commit 7cbe08a930a132d84b4cf79953b00b074ec7a2a7 on certain video files leads to problems with VAAPI hardware decoding.

2022-12-07 Thread Mikhail Gavrilov

On Wed, Dec 7, 2022 at 7:58 PM Alex Deucher  wrote:
>
>
> What GPU do you have and what entries do you have in
> sys/class/drm/card0/device/ip_discovery/die/0/UVD for the device?

I bisected the issue on the Radeon 6800M.

Parent commit for 7cbe08a930a132d84b4cf79953b00b074ec7a2a7 is
46dd2965bdd1c5a4f6499c73ff32e636fa8f9769.
For both commits ip_discovery is absent.
# ls /sys/class/drm/card0/device/ | grep ip
# ls /sys/class/drm/card1/device/ | grep ip

But from verbose info I see that player for
7cbe08a930a132d84b4cf79953b00b074ec7a2a7 use acceleration:
$ vlc -v Downloads/test_sample_480_2.mp4
VLC media player 3.0.18 Vetinari (revision )
[561f72097520] main libvlc: Running vlc with the default
interface. Use 'cvlc' to use vlc without interface.
[7fa224001190] mp4 demux warning: elst box found
[7fa224001190] mp4 demux warning: STTS table of 1 entries
[7fa224001190] mp4 demux warning: CTTS table of 78 entries
[7fa224001190] mp4 demux warning: elst box found
[7fa224001190] mp4 demux warning: STTS table of 1 entries
[7fa224001190] mp4 demux warning: elst old=0 new=1
[7fa224d19010] faad decoder warning: decoded zero sample
[7fa224001190] mp4 demux warning: elst old=0 new=1
[7fa214007030] gl gl: Initialized libplacebo v4.208.0 (API v208)
libva info: VA-API version 1.16.0
libva error: vaGetDriverNameByIndex() failed with unknown libva error,
driver_name = (null)
[7fa214007030] glconv_vaapi_x11 gl error: vaInitialize: unknown libva error
libva info: VA-API version 1.16.0
libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_1_16
libva info: va_openDriver() returns 0
[7fa224c0b3a0] avcodec decoder: Using Mesa Gallium driver
23.0.0-devel for AMD Radeon RX 6800M (navi22, LLVM 15.0.4, DRM 3.42,
5.14.0-rc4-14-7cbe08a930a132d84b4cf79953b00b074ec7a2a7+) for hardware
decoding
[h264 @ 0x7fa224c3fa40] Using deprecated struct vaapi_context in decode.
[561f72174de0] pulse audio output warning: starting late (-9724 us)

And for 46dd2965bdd1c5a4f6499c73ff32e636fa8f9769 commit did not use
acceleration:
$ vlc -v Downloads/test_sample_480_2.mp4
VLC media player 3.0.18 Vetinari (revision )
[55f61ad35520] main libvlc: Running vlc with the default
interface. Use 'cvlc' to use vlc without interface.
[7fc7e8001190] mp4 demux warning: elst box found
[7fc7e8001190] mp4 demux warning: STTS table of 1 entries
[7fc7e8001190] mp4 demux warning: CTTS table of 78 entries
[7fc7e8001190] mp4 demux warning: elst box found
[7fc7e8001190] mp4 demux warning: STTS table of 1 entries
[7fc7e8001190] mp4 demux warning: elst old=0 new=1
[7fc7e8d19010] faad decoder warning: decoded zero sample
[7fc7e8001190] mp4 demux warning: elst old=0 new=1
[7fc7d8007030] gl gl: Initialized libplacebo v4.208.0 (API v208)
libva info: VA-API version 1.16.0
libva error: vaGetDriverNameByIndex() failed with unknown libva error,
driver_name = (null)
[7fc7d8007030] glconv_vaapi_x11 gl error: vaInitialize: unknown libva error
libva info: VA-API version 1.16.0
libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_1_16
libva info: va_openDriver() returns 0
[7fc7d40b3260] vaapi generic error: profile(7) is not supported
[7fc7d8a089c0] gl gl: Initialized libplacebo v4.208.0 (API v208)
Failed to open VDPAU backend libvdpau_nvidia.so: cannot open shared
object file: No such file or directory
Failed to open VDPAU backend libvdpau_nvidia.so: cannot open shared
object file: No such file or directory
[7fc7d89e4f80] gl gl: Initialized libplacebo v4.208.0 (API v208)
[55f61ae12de0] pulse audio output warning: starting late (-13537 us)

So my bisect didn't make sense :(
Anyway can you reproduce the issue with the attached sample file and
vlc on fresh kernel (6.1-rc8)?

Thanks!

-- 
Best Regards,
Mike Gavrilov.

Re: Screen corruption using radeon kernel driver

2022-12-01 Thread Mikhail Krylov

On Thu, Dec 01, 2022 at 02:00:58PM +, Robin Murphy wrote:
> On 2022-11-30 19:59, Mikhail Krylov wrote:
> > On Wed, Nov 30, 2022 at 11:07:32AM -0500, Alex Deucher wrote:
> > > On Wed, Nov 30, 2022 at 10:42 AM Robin Murphy  
> > > wrote:
> > > > 
> > > > On 2022-11-30 14:28, Alex Deucher wrote:
> > > > > On Wed, Nov 30, 2022 at 7:54 AM Robin Murphy  
> > > > > wrote:
> > > > > > 
> > > > > > On 2022-11-29 17:11, Mikhail Krylov wrote:
> > > > > > > On Tue, Nov 29, 2022 at 11:05:28AM -0500, Alex Deucher wrote:
> > > > > > > > On Tue, Nov 29, 2022 at 10:59 AM Mikhail Krylov 
> > > > > > > >  wrote:
> > > > > > > > > 
> > > > > > > > > On Tue, Nov 29, 2022 at 09:44:19AM -0500, Alex Deucher wrote:
> > > > > > > > > > On Mon, Nov 28, 2022 at 3:48 PM Mikhail Krylov 
> > > > > > > > > >  wrote:
> > > > > > > > > > > 
> > > > > > > > > > > On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher 
> > > > > > > > > > > wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > > > [excessive quoting removed]
> > > > > > > > > > > 
> > > > > > > > > > > > > So, is there any progress on this issue? I do 
> > > > > > > > > > > > > understand it's not a high
> > > > > > > > > > > > > priority one, and today I've checked it on 6.0 
> > > > > > > > > > > > > kernel, and
> > > > > > > > > > > > > unfortunately, it still persists...
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I'm considering writing a patch that will allow user 
> > > > > > > > > > > > > to override
> > > > > > > > > > > > > need_dma32/dma_bits setting with a module parameter. 
> > > > > > > > > > > > > I'll have some time
> > > > > > > > > > > > > after the New Year for that.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Is it at all possible that such a patch will be 
> > > > > > > > > > > > > merged into kernel?
> > > > > > > > > > > > > 
> > > > > > > > > > > > On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > Unless someone familiar with HIMEM can figure out what 
> > > > > > > > > > > > is going wrong
> > > > > > > > > > > > we should just revert the patch.
> > > > > > > > > > > > 
> > > > > > > > > > > > Alex
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Okay, I was suggesting that mostly because
> > > > > > > > > > > 
> > > > > > > > > > > a) it works for me with dma_bits = 40 (I understand 
> > > > > > > > > > > that's what it is
> > > > > > > > > > > without the original patch applied);
> > > > > > > > > > > 
> > > > > > > > > > > b) there's a hint of uncertainity on this line
> > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359
> > > > > > > > > > > saying that for AGP dma_bits = 32 is the safest option, 
> > > > > > > > > > > so apparently there are
> > > > > > > > > > > setups, unlike mine, where dma_bits = 32 is better than 
> > > > > > > > > > > 40.
> > > > > > > > > > > 
> > > > > > > > > > > But I'm in no position to argue, just wanted to make 
> > > > > > > > > > > myself clear.
> >

Re: Screen corruption using radeon kernel driver

2022-11-30 Thread Mikhail Krylov

On Wed, Nov 30, 2022 at 11:07:32AM -0500, Alex Deucher wrote:
> On Wed, Nov 30, 2022 at 10:42 AM Robin Murphy  wrote:
> >
> > On 2022-11-30 14:28, Alex Deucher wrote:
> > > On Wed, Nov 30, 2022 at 7:54 AM Robin Murphy  wrote:
> > >>
> > >> On 2022-11-29 17:11, Mikhail Krylov wrote:
> > >>> On Tue, Nov 29, 2022 at 11:05:28AM -0500, Alex Deucher wrote:
> > >>>> On Tue, Nov 29, 2022 at 10:59 AM Mikhail Krylov  
> > >>>> wrote:
> > >>>>>
> > >>>>> On Tue, Nov 29, 2022 at 09:44:19AM -0500, Alex Deucher wrote:
> > >>>>>> On Mon, Nov 28, 2022 at 3:48 PM Mikhail Krylov  
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher wrote:
> > >>>>>>>
> > >>>>>>>>>> [excessive quoting removed]
> > >>>>>>>
> > >>>>>>>>> So, is there any progress on this issue? I do understand it's not 
> > >>>>>>>>> a high
> > >>>>>>>>> priority one, and today I've checked it on 6.0 kernel, and
> > >>>>>>>>> unfortunately, it still persists...
> > >>>>>>>>>
> > >>>>>>>>> I'm considering writing a patch that will allow user to override
> > >>>>>>>>> need_dma32/dma_bits setting with a module parameter. I'll have 
> > >>>>>>>>> some time
> > >>>>>>>>> after the New Year for that.
> > >>>>>>>>>
> > >>>>>>>>> Is it at all possible that such a patch will be merged into 
> > >>>>>>>>> kernel?
> > >>>>>>>>>
> > >>>>>>>> On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov  
> > >>>>>>>> wrote:
> > >>>>>>>> Unless someone familiar with HIMEM can figure out what is going 
> > >>>>>>>> wrong
> > >>>>>>>> we should just revert the patch.
> > >>>>>>>>
> > >>>>>>>> Alex
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Okay, I was suggesting that mostly because
> > >>>>>>>
> > >>>>>>> a) it works for me with dma_bits = 40 (I understand that's what it 
> > >>>>>>> is
> > >>>>>>> without the original patch applied);
> > >>>>>>>
> > >>>>>>> b) there's a hint of uncertainity on this line
> > >>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359
> > >>>>>>> saying that for AGP dma_bits = 32 is the safest option, so 
> > >>>>>>> apparently there are
> > >>>>>>> setups, unlike mine, where dma_bits = 32 is better than 40.
> > >>>>>>>
> > >>>>>>> But I'm in no position to argue, just wanted to make myself clear.
> > >>>>>>> I'm okay with rebuilding the kernel for my machine until the 
> > >>>>>>> original
> > >>>>>>> patch is reverted or any other fix is applied.
> > >>>>>>
> > >>>>>> What GPU do you have and is it AGP?  If it is AGP, does setting
> > >>>>>> radeon.agpmode=-1 also fix it?
> > >>>>>>
> > >>>>>> Alex
> > >>>>>
> > >>>>> That is ATI Radeon X1950, and, unfortunately, radeon.agpmode=-1 
> > >>>>> doesn't
> > >>>>> help, it just makes 3D acceleration in games such as OpenArena stop
> > >>>>> working.
> > >>>>
> > >>>> Just to confirm, is the board AGP or PCIe?
> > >>>>
> > >>>> Alex
> > >>>
> > >>> It is AGP. That's an old machine.
> > >>
> > >> Can you check whether dma_addressing_limited() is actually returning the
> > >> expected result at the point of radeon_ttm_init()? Disabling highmem is
> > >> presumably just hiding whatever problem exists, by throwing away all
> > >>   >32-bit RAM such that use_dma32 doesn't matter.
> > >
> > > The device in question only supports a 32 bit DMA mask so
> > > dma_addressing_limited() should return true.  Bounce buffers are not
> > > really usable on GPUs because they map so much memory.  If
> > > dma_addressing_limited() returns false, that would explain it.
> >
> > Right, it appears to be the only part of the offending commit that
> > *could* reasonably make any difference, so I'm primarily wondering if
> > dma_get_required_mask() somehow gets confused.
> 
> Mikhail,
> 
> Can you see that dma_addressing_limited() and dma_get_required_mask()
> return in this case?
> 
> Alex
> 
> 
> >
> > Thanks,
> > Robin.

Unfortunately, right now I don't have enough time for kernel
modifications and rebuilds (I will later!), so I did a quick-and-dirty
research with kprobe. 

The problem is that dma_addressing_limited() seems to be inlined and
kprobe fails to intercept it.

But I managed to get the result of dma_get_required_mask(). It returns
0x7fff (!) on the vanilla (with the patch, buggy) kernel:
 
$ sudo kprobe-perf 'r:dma_get_required_mask $retval'
Tracing kprobe dma_get_required_mask. Ctrl-C to end.
modprobe-1244[000] d...   105.582816: dma_get_required_mask: 
(radeon_ttm_init+0x61/0x240 [radeon] <- dma_get_required_mask) arg1=0x7fff

This function does not even get called in the kernel without the patch
that I built myself. I believe that's because ttm_bo_device_init()
doesn't call it without the patch.

Hope that helps at least a bit. If not, I'll be able to do more thorough
research in a couple of weeks, probably.


signature.asc
Description: PGP signature

Re: Screen corruption using radeon kernel driver

2022-11-29 Thread Mikhail Krylov

On Tue, Nov 29, 2022 at 11:05:28AM -0500, Alex Deucher wrote:
> On Tue, Nov 29, 2022 at 10:59 AM Mikhail Krylov  wrote:
> >
> > On Tue, Nov 29, 2022 at 09:44:19AM -0500, Alex Deucher wrote:
> > > On Mon, Nov 28, 2022 at 3:48 PM Mikhail Krylov  wrote:
> > > >
> > > > On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher wrote:
> > > >
> > > > >>> [excessive quoting removed]
> > > >
> > > > >> So, is there any progress on this issue? I do understand it's not a 
> > > > >> high
> > > > >> priority one, and today I've checked it on 6.0 kernel, and
> > > > >> unfortunately, it still persists...
> > > > >>
> > > > >> I'm considering writing a patch that will allow user to override
> > > > >> need_dma32/dma_bits setting with a module parameter. I'll have some 
> > > > >> time
> > > > >> after the New Year for that.
> > > > >>
> > > > >> Is it at all possible that such a patch will be merged into kernel?
> > > > >>
> > > > > On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov  
> > > > > wrote:
> > > > > Unless someone familiar with HIMEM can figure out what is going wrong
> > > > > we should just revert the patch.
> > > > >
> > > > > Alex
> > > >
> > > >
> > > > Okay, I was suggesting that mostly because
> > > >
> > > > a) it works for me with dma_bits = 40 (I understand that's what it is
> > > > without the original patch applied);
> > > >
> > > > b) there's a hint of uncertainity on this line
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359
> > > > saying that for AGP dma_bits = 32 is the safest option, so apparently 
> > > > there are
> > > > setups, unlike mine, where dma_bits = 32 is better than 40.
> > > >
> > > > But I'm in no position to argue, just wanted to make myself clear.
> > > > I'm okay with rebuilding the kernel for my machine until the original
> > > > patch is reverted or any other fix is applied.
> > >
> > > What GPU do you have and is it AGP?  If it is AGP, does setting
> > > radeon.agpmode=-1 also fix it?
> > >
> > > Alex
> >
> > That is ATI Radeon X1950, and, unfortunately, radeon.agpmode=-1 doesn't
> > help, it just makes 3D acceleration in games such as OpenArena stop
> > working.
> 
> Just to confirm, is the board AGP or PCIe?
> 
> Alex

It is AGP. That's an old machine.


signature.asc
Description: PGP signature

Re: Screen corruption using radeon kernel driver

2022-11-29 Thread Mikhail Krylov

On Tue, Nov 29, 2022 at 09:44:19AM -0500, Alex Deucher wrote:
> On Mon, Nov 28, 2022 at 3:48 PM Mikhail Krylov  wrote:
> >
> > On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher wrote:
> >
> > >>> [excessive quoting removed]
> >
> > >> So, is there any progress on this issue? I do understand it's not a high
> > >> priority one, and today I've checked it on 6.0 kernel, and
> > >> unfortunately, it still persists...
> > >>
> > >> I'm considering writing a patch that will allow user to override
> > >> need_dma32/dma_bits setting with a module parameter. I'll have some time
> > >> after the New Year for that.
> > >>
> > >> Is it at all possible that such a patch will be merged into kernel?
> > >>
> > > On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov  wrote:
> > > Unless someone familiar with HIMEM can figure out what is going wrong
> > > we should just revert the patch.
> > >
> > > Alex
> >
> >
> > Okay, I was suggesting that mostly because
> >
> > a) it works for me with dma_bits = 40 (I understand that's what it is
> > without the original patch applied);
> >
> > b) there's a hint of uncertainity on this line
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359
> > saying that for AGP dma_bits = 32 is the safest option, so apparently there 
> > are
> > setups, unlike mine, where dma_bits = 32 is better than 40.
> >
> > But I'm in no position to argue, just wanted to make myself clear.
> > I'm okay with rebuilding the kernel for my machine until the original
> > patch is reverted or any other fix is applied.
> 
> What GPU do you have and is it AGP?  If it is AGP, does setting
> radeon.agpmode=-1 also fix it?
> 
> Alex

That is ATI Radeon X1950, and, unfortunately, radeon.agpmode=-1 doesn't
help, it just makes 3D acceleration in games such as OpenArena stop
working.


signature.asc
Description: PGP signature

Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-11-28 Thread Mikhail Gavrilov

On Tue, Nov 22, 2022 at 12:16 PM Christian König
 wrote:
>
> Ah, thanks a lot for this. I've already pushed the patches into our
> internal branch, but getting this confirmation is still great!
>
> This was quite some fundamental bug in the handling and I hope to get
> this completely reworked at some point since it is currently only mitigated.

Looks like the final version of this patch successfully merged in 6.1-rc7.
Big thanks, all games work again!

> No idea what that could be. Modesetting is not something I work on.
>
> The best advice I can give you is to maybe ping Harry and our other
> display people, they should know that stuff better than I do.

Unfortunately Harry didn't answer. I hope my email wasn't marked as spam.

-- 
Best Regards,
Mike Gavrilov.

Re: Screen corruption using radeon kernel driver

2022-11-28 Thread Mikhail Krylov

On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher wrote:

>>> [excessive quoting removed]

>> So, is there any progress on this issue? I do understand it's not a high 
>>  
>>   
>> priority one, and today I've checked it on 6.0 kernel, and   
>>  
>>   
>> unfortunately, it still persists...  
>>  
>>   
>>  
>>  
>>   
>> I'm considering writing a patch that will allow user to override 
>>  
>>   
>> need_dma32/dma_bits setting with a module parameter. I'll have some time 
>>  
>>   
>> after the New Year for that. 
>>  
>>   
>>  
>>      
>>   
>> Is it at all possible that such a patch will be merged into kernel?  
>>
> On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov  wrote:
> Unless someone familiar with HIMEM can figure out what is going wrong
> we should just revert the patch.
> 
> Alex

Okay, I was suggesting that mostly because 

a) it works for me with dma_bits = 40 (I understand that's what it is
without the original patch applied);

b) there's a hint of uncertainity on this line 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359
saying that for AGP dma_bits = 32 is the safest option, so apparently there are
setups, unlike mine, where dma_bits = 32 is better than 40.

But I'm in no position to argue, just wanted to make myself clear.
I'm okay with rebuilding the kernel for my machine until the original
patch is reverted or any other fix is applied.

signature.asc
Description: PGP signature

Re: Screen corruption using radeon kernel driver

2022-11-28 Thread Mikhail Krylov

On Mon, Apr 25, 2022 at 01:22:04PM -0400, Alex Deucher wrote:
> + dri-devel
> 
> On Mon, Apr 25, 2022 at 3:33 AM Krylov Michael  wrote:
> >
> > Hello!
> >
> > After updating my Linux kernel from version 4.19 (Debian 10 version) to
> > 5.10 (packaged with Debian 11), I've noticed that the image
> > displayed on my older computer, 32-bit Pentium 4 using ATI Radeon X1950
> > AGP video card is severely corrupted in the graphical (Xorg and Wayland)
> > mode: all kinds of black and white stripes across the screen, some
> > letters missing, etc.
> >
> > I've checked several options (Xorg drivers, Wayland instead of
> > Xorg, radeon.agpmode=-1 in kernel command line and so on), but the
> > problem persisted. I've managed to find that the problem was in the
> > kernel, as everything worked well with 4.19 kernel with everything
> > else being from Debian 11.
> >
> > I have managed to find the culprit of that corruption, that is the
> > commit 33b3ad3788aba846fc8b9a065fe2685a0b64f713 on the linux kernel.
> > Reverting this commit and building the kernel with that commit reverted
> > fixes the problem. Disabling HIMEM also gets rid of that problem. But it
> > also leaves the system with less that 1G of RAM, which is, of course,
> > undesirable.
> >
> > Apparently this problem is somewhat known, as I can tell after googling
> > for the commit id, see this link for example:
> > https://lkml.org/lkml/2020/1/9/518
> >
> > Mageia distro, for example, reverted this commit in the kernel they are
> > building:
> >
> > http://sophie.zarb.org/distrib/Mageia/7/i586/by-pkgid/b9193a4f85192bc57f4d770fb9bb399c/files/32
> >
> > I've reported this bug to Debian bugtracker, checked the recent verion
> > of the kernel (5.17), bug still persists. Here's a link to the Debian
> > bug page:
> >
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=993670
> >
> > I'm not sure if reverting this commit is the correct way to go, so if
> > you need to check any changes/patches that I could apply and test on
> > the real hardware, I'll be glad to do that (but please keep in mind
> > that testing could take some time, I don't have access to this computer
> > 24/7, but I'll do my best to respond ASAP).
> 
> I would be happy to revert that commit.  I attempted to revert it a
> year or so ago, but Christoph didn't want to.  He was going to look
> further into it.  I was not able to repro the issue.  It seemed to be
> related to highmem support.  You might try disabling that.  Here is
> the previous thread for reference:
> https://lists.freedesktop.org/archives/amd-gfx/2020-September/053922.html
> 
> Alex

So, is there any progress on this issue? I do understand it's not a high
priority one, and today I've checked it on 6.0 kernel, and
unfortunately, it still persists...

I'm considering writing a patch that will allow user to override
need_dma32/dma_bits setting with a module parameter. I'll have some time
after the New Year for that.

Is it at all possible that such a patch will be merged into kernel?



signature.asc
Description: PGP signature

Re: [regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70

2022-11-22 Thread Mikhail Gavrilov

On Thu, Oct 13, 2022 at 6:36 PM Mikhail Gavrilov
 wrote:
>
> Hi!
> I bisected an issue of the 6.0 kernel which started happening after
> 6.0-rc7 on all my machines.
>
> Backtrace of this issue looks like as:
>
> [ 2807.339439] [ cut here ]
> [ 2807.339445] WARNING: CPU: 11 PID: 2061 at
> drivers/gpu/drm/drm_modeset_lock.c:276
> drm_modeset_drop_locks+0x63/0x70
>
> bisect points to this commit: b261509952bc19d1012cf732f853659be6ebc61e.
>
> After reverting this commit the WARNING messages described here disappeared.
>

Hi Harry, Christian says that you can help with it.

Thanks.

-- 
Best Regards,
Mike Gavrilov.

Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-11-21 Thread Mikhail Gavrilov

On Mon, Nov 14, 2022 at 6:22 PM Christian König
 wrote:
>
> I've found and fixed a few problems around the userptr handling which
> might explain what you see here.
>
> A series of four patches starting with "drm/amdgpu: always register an
> MMU notifier for userptr" is under review now.
>
> Going to give that a bit cleanup later today and will CC you when I send
> that out. Would be nice if you could give that some testing.
>
> Thanks,
> Christian.
>

Christian, I tested all four patches around week and can say that this
issue is completely gone.
All known broken games working.
Tested-by: Mikhail Gavrilov 

The only thing I don't like is the flood in the kernel logs of the
message "WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276
drm_modeset_drop_locks+0x63/0x70", but this is not related to the
patches being checked.
All kernel logs uploaded to pastebin [1][2][3][4][5][6][7][8]

I wrote a separate bug report about "drm_modeset_lock" [9], it's a
pity that no one paid attention to it. I even found the first bad
commit. It is b261509952bc19d1012cf732f853659be6ebc61e.

[1] https://pastebin.com/WZWczupk
[2] https://pastebin.com/f4i9pvjS
[3] https://pastebin.com/rsDWaMR1
[4] https://pastebin.com/tDNEYJq0
[5] https://pastebin.com/xfZVbm1f
[6] https://pastebin.com/Vx9gDyKt
[7] https://pastebin.com/XvRkLckV
[8] https://pastebin.com/pd8WBkgx
[9] https://www.spinics.net/lists/dri-devel/msg367543.html

Thanks.

-- 
Best Regards,
Mike Gavrilov.

Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-11-02 Thread Mikhail Gavrilov

On Tue, Nov 1, 2022 at 10:52 PM Christian König
 wrote:
>
> Let's focus on one problem at a time.
>
> The issue here is that somehow userptr handling became racy after we
> removed the lock, but I don't see why.
>
> We need to fix this ASAP since it is probably a much wider problem and
> the additional lock just hides it somehow.
>
> Going to provide you with an updated patch tomorrow.
>
> Thanks,
> Christian.

Recently sackboy has been updated and now the kernel log contains a
trace very similar to the one in the first post, even with the patch
applied.

[  155.948044] [ cut here ]
[  155.948164] WARNING: CPU: 3 PID: 4850 at
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:678
amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu]
[  155.948342] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep intel_rapl_msr intel_rapl_common snd_hda_codec_realtek
snd_sof_amd_renoir snd_sof_amd_acp snd_hda_codec_generic
snd_hda_codec_hdmi snd_sof_pci sunrpc binfmt_misc snd_sof
snd_hda_intel snd_sof_utils snd_intel_dspcfg mt7921e
snd_intel_sdw_acpi snd_hda_codec mt7921_common snd_soc_core
edac_mce_amd mt76_connac_lib btusb snd_hda_core snd_compress snd_hwdep
mt76 btrtl ac97_bus kvm_amd snd_pcm_dmaengine btbcm snd_rpl_pci_acp6x
snd_pci_acp6x btintel mac80211 btmtk snd_seq snd_seq_device kvm
snd_pcm snd_pci_acp5x libarc4 bluetooth irqbypass vfat snd_timer
snd_rn_pci_acp3x fat rapl snd_acp_config asus_nb_wmi snd cfg80211
snd_soc_acpi wmi_bmof k10temp pcspkr
[  155.948436]  snd_pci_acp3x i2c_piix4 soundcore asus_wireless
amd_pmc joydev zram amdgpu drm_ttm_helper ttm crct10dif_pclmul
hid_asus crc32_pclmul asus_wmi crc32c_intel iommu_v2 ledtrig_audio
polyval_clmulni gpu_sched sparse_keymap polyval_generic
platform_profile drm_buddy drm_display_helper nvme rfkill
ghash_clmulni_intel hid_multitouch ucsi_acpi sha512_ssse3 nvme_core
typec_ucsi serio_raw sp5100_tco r8169 ccp cec nvme_common typec
i2c_hid_acpi i2c_hid video wmi ip6_tables ip_tables fuse
[  155.948540] CPU: 3 PID: 4850 Comm: Sackboy-Win64-T Tainted: G
 WL---  ---
6.1.0-0.rc3.20221101git5aaef24b5c6d.29.fc38.x86_64 #1
[  155.948544] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022
[  155.948547] RIP: 0010:amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu]
[  155.948748] Code: 9e f1 e9 32 ff ff ff 4c 89 e9 89 ea 48 c7 c6 a8
a3 fd c0 48 c7 c7 88 81 1e c1 e8 af 97 ea f1 eb 8e 66 90 bd f2 ff ff
ff eb 8d <0f> 0b eb f5 bd fd ff ff ff eb 82 bd f2 ff ff ff e9 62 ff ff
ff 48
[  155.948751] RSP: 0018:960b544d3a50 EFLAGS: 00010282
[  155.948756] RAX: 8a4e40d44e00 RBX: 8a4f0e564140 RCX: 0001
[  155.948759] RDX:  RSI: 8a4e40d44e00 RDI: 8a4f4b52b400
[  155.948761] RBP: 8a4e8c979000 R08: 0dc0 R09: 
[  155.948764] R10: 0001 R11:  R12: 8a4e8aaad558
[  155.948767] R13: 3b91 R14: 8a4f0e667180 R15: 8a4f4b52b458
[  155.948770] FS:  7fa13fe006c0() GS:8a5d16e0()
knlGS:36f8
[  155.948772] CS:  0010 DS:  ES:  CR0: 80050033
[  155.948775] CR2: 25c9e1d0 CR3: 00036199 CR4: 00750ee0
[  155.948778] PKRU: 5554
[  155.948780] Call Trace:
[  155.948783]  
[  155.948790]  amdgpu_cs_ioctl+0x9fd/0x2030 [amdgpu]
[  155.948992]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  155.949155]  drm_ioctl_kernel+0xac/0x160
[  155.949165]  drm_ioctl+0x1e7/0x450
[  155.949172]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  155.949344]  amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
[  155.949528]  __x64_sys_ioctl+0x90/0xd0
[  155.949537]  do_syscall_64+0x5b/0x80
[  155.949547]  ? lock_is_held_type+0xe8/0x140
[  155.949559]  ? do_syscall_64+0x67/0x80
[  155.949565]  ? lockdep_hardirqs_on+0x7d/0x100
[  155.949573]  ? do_syscall_64+0x67/0x80
[  155.949579]  ? do_syscall_64+0x67/0x80
[  155.949586]  ? do_syscall_64+0x67/0x80
[  155.949592]  ? lockdep_hardirqs_on+0x7d/0x100
[  155.949597]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  155.949603] RIP: 0033:0x7fa1b7fd912f
[  155.949610] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24
10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00
00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28
00 00
[  155.949615] RSP: 002b:7fa13fdfe920 EFLAGS: 0246 ORIG_RAX:
0010
[  155.949621] RAX: ffda RBX: 7fa13fdfebe8 RCX: 7fa1b7fd912f
[  155.949625] RDX: 7fa13fdfea10 RSI: c0186444 RDI: 0165
[  155.949629] RBP: 7fa13fdfea10 R08: 7f9ff80018e0 R09: 7fa13fdfe9c0
[  155.949633] R10: 7eb11590 R11: 0246 R12: c0186444
[

Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-10-30 Thread Mikhail Gavrilov

On Wed, Oct 26, 2022 at 12:29 PM Christian König
 wrote:
>
> Attached is the original test patch rebased on current amd-staging-drm-next.
>
> Can you test if this is enough to make sure that the games start without
> crashing by fetching the userptrs?

1. Over the past week the list of games affected by this issue updated
with new games: The Outlast Trials, Gotham Knights, Sackboy: A Big
Adventure.

2. I tested the patch and it really solves the problem with the launch
of all the listed games and does not create new problems.

3. The only thing I noticed is that in the game Sackboy: A Big
Adventure, when using the kernel built from the commit
b229b6ca5abbd63ff40c1396095b1b36b18139c3 + the attached patch, I can’t
connect to friend coop session due to the steam client hangs. The
kernel built from commit 736ec9fadd7a1fde8480df7e5cfac465c07ff6f3
(this is the commit prior to dd80d9c8eecac8c516da5b240d01a35660ba6cb6)
free of this problem.

I need to spend some more time to find the commit after which leads to
hanging [3] the steam client.

Thanks.

-- 
Best Regards,
Mike Gavrilov.

Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-10-21 Thread Mikhail Gavrilov

On Fri, Oct 21, 2022 at 1:33 PM Christian König
 wrote:
>
> Hi,
>
> yes Bas already reported this issue, but I couldn't reproduce it. Need
> to come up with a patch to narrow this down further.
>
> Can I send you something to test?

I would appreciate to test any patches and ideas.

-- 
Best Regards,
Mike Gavrilov.

[6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-10-21 Thread Mikhail Gavrilov

Hi!
I found that some games (Cyberpunk 2077, Forza Horizon 4/5) hang at
start after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6.

dd80d9c8eecac8c516da5b240d01a35660ba6cb6 is the first bad commit
commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6
Author: Christian König 
Date:   Thu Jul 14 10:23:38 2022 +0200

drm/amdgpu: revert "partial revert "remove ctx->lock" v2"

This reverts commit 94f4c4965e5513ba624488f4b601d6b385635aec.

We found that the bo_list is missing a protection for its list entries.
Since that is fixed now this workaround can be removed again.

Signed-off-by: Christian König 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c |  2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  1 -
 3 files changed, 6 insertions(+), 18 deletions(-)


And when it happening in kernel log appears a such backtrace:
[  231.331210] [ cut here ]
[  231.331262] WARNING: CPU: 11 PID: 6555 at
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:675
amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu]
[  231.331424] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir
snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek snd_sof
snd_hda_codec_generic snd_hda_codec_hdmi snd_sof_utils mt7921e
snd_hda_intel sunrpc snd_intel_dspcfg mt7921_common binfmt_misc
snd_intel_sdw_acpi snd_hda_codec mt76_connac_lib edac_mce_amd btusb
snd_soc_core mt76 snd_hda_core btrtl snd_hwdep snd_compress kvm_amd
ac97_bus snd_seq btbcm snd_pcm_dmaengine btintel snd_rpl_pci_acp6x
mac80211 btmtk snd_pci_acp6x kvm snd_seq_device snd_pcm snd_pci_acp5x
libarc4 irqbypass bluetooth snd_rn_pci_acp3x snd_timer pcspkr
asus_nb_wmi rapl joydev wmi_bmof snd_acp_config cfg80211 snd_soc_acpi
vfat snd
[  231.331490]  snd_pci_acp3x i2c_piix4 soundcore fat k10temp amd_pmc
asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi
iommu_v2 crct10dif_pclmul crc32_pclmul gpu_sched crc32c_intel
ledtrig_audio sparse_keymap polyval_clmulni platform_profile drm_buddy
polyval_generic hid_multitouch drm_display_helper rfkill nvme
ucsi_acpi ghash_clmulni_intel nvme_core video typec_ucsi serio_raw ccp
sha512_ssse3 sp5100_tco r8169 cec nvme_common typec wmi i2c_hid_acpi
i2c_hid ip6_tables ip_tables fuse
[  231.331532] CPU: 11 PID: 6555 Comm: GameThread Tainted: GW
  L---  ---
6.1.0-0.rc1.20221019gitaae703b02f92.17.fc38.x86_64 #1
[  231.331534] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022
[  231.331537] RIP: 0010:amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu]
[  231.331654] Code: a8 d0 e9 32 ff ff ff 4c 89 e9 89 ea 48 c7 c6 40
82 f3 c0 48 c7 c7 10 60 14 c1 e8 2f a0 f4 d0 eb 8e 66 90 bd f2 ff ff
ff eb 8d <0f> 0b eb f5 bd fd ff ff ff eb 82 bd f2 ff ff ff e9 62 ff ff
ff 48
[  231.331656] RSP: 0018:aad4c705bae8 EFLAGS: 00010286
[  231.331659] RAX: 8e9cbdbe3200 RBX: 8e997e3f2440 RCX: 
[  231.331661] RDX:  RSI: 8e9cbdbe3200 RDI: 8e9c31208000
[  231.331663] RBP: 0001 R08: 0dc0 R09: 
[  231.331665] R10: 0001 R11:  R12: aad4c705bb90
[  231.331666] R13: 7651 R14: 8e9c89f334e0 R15: 8e991fda8000
[  231.331668] FS:  7c2af6c0() GS:8ea7d8e0()
knlGS:7b2c
[  231.331671] CS:  0010 DS:  ES:  CR0: 80050033
[  231.331673] CR2: 7ff65ffd8000 CR3: 0004f90f CR4: 00750ee0
[  231.331674] PKRU: 5554
[  231.331676] Call Trace:
[  231.331678]  
[  231.331682]  amdgpu_cs_ioctl+0x87e/0x1fc0 [amdgpu]
[  231.331824]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  231.331981]  drm_ioctl_kernel+0xac/0x160
[  231.331990]  drm_ioctl+0x1e7/0x450
[  231.331994]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  231.332118]  amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
[  231.332233]  __x64_sys_ioctl+0x90/0xd0
[  231.332238]  do_syscall_64+0x5b/0x80
[  231.332243]  ? asm_exc_page_fault+0x22/0x30
[  231.332247]  ? lockdep_hardirqs_on+0x7d/0x100
[  231.332250]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  231.332253] RIP: 0033:0x7ff677c5704f
[  231.332256] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24
10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00
00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28
00 00
[  231.332258] RSP: 002b:7c2ad470 EFLAGS: 0246 ORIG_RAX:
0010
[  231.332261] RAX: ffda RBX: 7c2ad718 RCX: 7ff677c5704f
[  231.332263] RDX: 7c2ad540 RSI: c0186444

Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga

2022-10-17 Thread Mikhail Gavrilov

On Wed, May 11, 2022 at 5:01 PM Christian König
 wrote:
>
>
> We have implemented a workaround, but still don't know the exact root cause.
>
> If anybody wants to look into this it would be rather helpful to be able
> to reproduce the issue.
>
> Regards,
> Christian.

I see that issue was returned after this commit
dd80d9c8eecac8c516da5b240d01a35660ba6cb6 is the first bad commit
commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6
Author: Christian König 
Date:   Thu Jul 14 10:23:38 2022 +0200

drm/amdgpu: revert "partial revert "remove ctx->lock" v2"

This reverts commit 94f4c4965e5513ba624488f4b601d6b385635aec.

We found that the bo_list is missing a protection for its list entries.
Since that is fixed now this workaround can be removed again.

Signed-off-by: Christian König 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c |  2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  1 -
 3 files changed, 6 insertions(+), 18 deletions(-)

The games Forza Horizon 4 and Cyberpunk 2077 again hangs at start.


-- 
Best Regards,
Mike Gavrilov.

[regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70

2022-10-13 Thread Mikhail Gavrilov

Hi!
I bisected an issue of the 6.0 kernel which started happening after
6.0-rc7 on all my machines.

Backtrace of this issue looks like as:

[ 2807.339439] [ cut here ]
[ 2807.339445] WARNING: CPU: 11 PID: 2061 at
drivers/gpu/drm/drm_modeset_lock.c:276
drm_modeset_drop_locks+0x63/0x70
[ 2807.339453] Modules linked in: tls uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir
snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek sunrpc snd_sof
snd_hda_codec_hdmi snd_hda_codec_generic snd_sof_utils snd_hda_intel
snd_intel_dspcfg mt7921e snd_intel_sdw_acpi binfmt_misc snd_soc_core
mt7921_common snd_hda_codec snd_compress vfat ac97_bus edac_mce_amd
mt76_connac_lib snd_pcm_dmaengine fat snd_hda_core snd_rpl_pci_acp6x
snd_pci_acp6x mt76 btusb snd_hwdep kvm_amd btrtl snd_seq btbcm
mac80211 snd_seq_device kvm btintel btmtk libarc4 snd_pcm
snd_pci_acp5x bluetooth snd_timer snd_rn_pci_acp3x irqbypass
snd_acp_config snd_soc_acpi cfg80211 rapl snd joydev pcspkr
asus_nb_wmi wmi_bmof
[ 2807.339519]  snd_pci_acp3x soundcore i2c_piix4 k10temp amd_pmc
asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi
crct10dif_pclmul iommu_v2 crc32_pclmul ledtrig_audio crc32c_intel
gpu_sched sparse_keymap platform_profile hid_multitouch
polyval_clmulni nvme ucsi_acpi drm_buddy polyval_generic
drm_display_helper ghash_clmulni_intel serio_raw nvme_core ccp
typec_ucsi rfkill sp5100_tco r8169 cec nvme_common typec wmi video
i2c_hid_acpi i2c_hid ip6_tables ip_tables fuse
[ 2807.339540] Unloaded tainted modules: acpi_cpufreq():1
acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1
acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1
amd64_edac():1 acpi_cpufreq():1 acpi_cpufreq():1 amd64_edac():1
amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1
amd64_edac():1 acpi_cpufreq():1 amd64_edac():1 acpi_cpufreq():1
fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1
amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
fjes():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 fjes():1 acpi_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1
fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1
acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 fjes():1
acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 acpi_cpufreq():1
pcc_cpufreq():1 fjes():1
[ 2807.339579]  acpi_cpufreq():1 fjes():1 pcc_cpufreq():1
acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 fjes():1
acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1
acpi_cpufreq():1 fjes():1 acpi_cpufreq():1 fjes():1 fjes():1 fjes():1
fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
fjes():1 fjes():1 fjes():1 fjes():1
[ 2807.339596] CPU: 11 PID: 2061 Comm: gnome-shell Tainted: GW
   L 6.0.0-rc4-07-cb0eca01ad9756e853efec3301203c2b5b45aa9f+ #16
[ 2807.339598] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022
[ 2807.339600] RIP: 0010:drm_modeset_drop_locks+0x63/0x70
[ 2807.339602] Code: 42 08 48 89 10 48 89 1b 48 8d bb 50 ff ff ff 48
89 5b 08 e8 3f 41 55 00 48 8b 45 78 49 39 c4 75 c6 5b 5d 41 5c c3 cc
cc cc cc <0f> 0b eb ac 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55
41 54
[ 2807.339604] RSP: 0018:b6ad46e07b80 EFLAGS: 00010282
[ 2807.339606] RAX: 0001 RBX:  RCX: 0002
[ 2807.339607] RDX: 0001 RSI: a6a118b1 RDI: b6ad46e07c00
[ 2807.339608] RBP: b6ad46e07c00 R08:  R09: 
[ 2807.339609] R10:  R11: 0001 R12: 
[ 2807.339610] R13: 9801ca24bb00 R14: 9801ca24bb00 R15: 
[ 2807.339611] FS:  7f57445b0600() GS:981198e0()
knlGS:
[ 2807.339613] CS:  0010 DS:  ES:  CR0: 80050033
[ 2807.339614] CR2: 7f574367f000 CR3: 0001236ae000 CR4: 00750ee0
[ 2807.339615] PKRU: 5554
[ 2807.339616] Call Trace:
[ 2807.339618]  
[ 2807.339621]  drm_mode_atomic_ioctl+0x3b9/0xac0
[ 2807.339627]  ? drm_atomic_set_property+0xb60/0xb60
[ 2807.339629]  drm_ioctl_kernel+0xac/0x160
[ 2807.339633]  drm_ioctl+0x22d/0x410
[ 2807.339635]  ? drm_atomic_set_property+0xb60/0xb60
[ 2807.339639]  amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
[ 2807.339834]  __x64_sys_ioctl+0x90/0xd0
[ 2807.339838]  do_syscall_64+0x5b/0x80
[ 2807.339843]  ? rcu_read_lock_sched_held+0x10/0x80
[ 2807.339846]  ? trace_hardirqs_on_prepare+0x55/0xe0
[ 2807.339849]  ? do_syscall_64+0x67/0x80
[ 2807.339851]  ?

[regression][6.1] After commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86 system randomly hungs

2022-10-11 Thread Mikhail Gavrilov

Hi!
The hungs occurs randomly, but I found good reproductive scenario
(This is running the campaign in the game Halo Infinite)
The backtrace is look like this:

[  147.260971] BUG: kernel NULL pointer dereference, address: 0088
[  147.260987] [ cut here ]
[  147.260988] WARNING: CPU: 3 PID: 0 at kernel/softirq.c:321
__local_bh_disable_ip+0x9e/0xb0
[  147.260993] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
nf_tables nfnetlink qrtr bnep sunrpc snd_sof_amd_renoir intel_rapl_msr
snd_sof_amd_acp intel_rapl_common mt7921e snd_sof_pci mt7921_common
binfmt_misc snd_sof mt76_connac_lib snd_sof_utils vfat
snd_hda_codec_realtek snd_soc_core snd_hda_codec_generic mt76 fat
snd_hda_codec_hdmi snd_hda_intel edac_mce_amd snd_compress ac97_bus
btusb kvm_amd snd_intel_dspcfg snd_pcm_dmaengine btrtl
snd_intel_sdw_acpi btbcm snd_hda_codec snd_pci_acp6x mac80211 kvm
snd_hda_core btintel btmtk irqbypass snd_hwdep snd_seq libarc4
snd_seq_device bluetooth snd_pcm snd_pci_acp5x snd_timer
snd_rn_pci_acp3x cfg80211 rapl pcspkr joydev asus_nb_wmi wmi_bmof
snd_acp_config snd snd_soc_acpi k10temp
[  147.261033]  soundcore i2c_piix4 snd_pci_acp3x asus_wireless
amd_pmc zram amdgpu drm_ttm_helper ttm hid_asus iommu_v2 asus_wmi
gpu_sched ledtrig_audio sparse_keymap drm_buddy platform_profile
drm_display_helper crct10dif_pclmul crc32_pclmul nvme rfkill
crc32c_intel ucsi_acpi hid_multitouch video ghash_clmulni_intel
nvme_core ccp typec_ucsi serio_raw r8169 cec sp5100_tco typec
i2c_hid_acpi wmi i2c_hid ip6_tables ip_tables fuse
[  147.261045] CPU: 3 PID: 0 Comm: swapper/3 Tainted: GWL
   6.0.0-rc2-02-907cc346ff6a69a08b4786c4ed2a78ac0120b9da+ #124
[  147.261046] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022
[  147.261047] RIP: 0010:__local_bh_disable_ip+0x9e/0xb0
[  147.261048] Code: 25 00 1e 02 00 48 89 df e8 6f 23 08 00 85 c0 75
0e 48 89 9d 30 1c 00 00 5b 5d c3 cc cc cc cc 31 ff 31 db e8 54 23 08
00 eb e7 <0f> 0b e9 76 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
44 00
[  147.261049] RSP: 0018:a4e1c028c8d8 EFLAGS: 00010006
[  147.261050] RAX: 80010005 RBX: 0201 RCX: 0018
[  147.261051] RDX: 0f440b255950 RSI: 0201 RDI: c1b652e5
[  147.261051] RBP: 93a4eaf00fd8 R08: 0001 R09: 
[  147.261052] R10: 7635d840c31a8942 R11: fcca632b3d1b0d46 R12: 93a4f7831000
[  147.261052] R13: 93a4eaf00ee0 R14: 93a4efd84178 R15: 93a4efd84000
[  147.261053] FS:  () GS:93b396e0()
knlGS:
[  147.261054] CS:  0010 DS:  ES:  CR0: 80050033
[  147.261055] CR2: 0088 CR3: 00012a61 CR4: 00750ee0
[  147.261056] PKRU: 5554
[  147.261056] Call Trace:
[  147.261060]  
[  147.261068]  _raw_spin_lock_bh+0x1d/0x80
[  147.261074]  ieee80211_queue_skb+0x125/0x7a0 [mac80211]
[  147.261113]  ? __skb_get_hash+0x55/0x200
[  147.261117]  ieee80211_tx_8023+0x9c/0x1c0 [mac80211]
[  147.261155]  ieee80211_subif_start_xmit_8023+0x2b5/0x510 [mac80211]
[  147.261191]  netpoll_start_xmit+0x121/0x190
[  147.261199]  netpoll_send_skb+0x1fc/0x300
[  147.261202]  write_msg+0xdc/0xf0 [netconsole]
[  147.261207]  console_emit_next_record.constprop.0+0x17d/0x300
[  147.261214]  console_unlock+0xf3/0x1f0
[  147.261215]  vprintk_emit+0x152/0x350
[  147.261217]  ? plist_add+0xba/0xf0
[  147.261223]  _printk+0x48/0x4e
[  147.261231]  ? rcu_read_lock_sched_held+0x10/0x80
[  147.261235]  page_fault_oops.cold+0xcf/0x1f9
[  147.261240]  ? do_user_addr_fault+0x65/0x6b0
[  147.261243]  ? _raw_spin_unlock_irqrestore+0x40/0x60
[  147.261247]  exc_page_fault+0x7e/0x300
[  147.261249]  asm_exc_page_fault+0x22/0x30
[  147.261252] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x1e0 [gpu_sched]
[  147.261255] Code: 89 d7 e8 87 02 0d f0 e9 54 ff ff ff 48 89 d7 e8
ea 66 37 f0 e9 47 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 41 54 55 53
48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d 70 02 00 00 48 8b 85 a8 03 00
00 f0
[  147.261256] RSP: 0018:a4e1c028cdc8 EFLAGS: 00010093
[  147.261257] RAX: c06dc380 RBX:  RCX: 0018
[  147.261257] RDX: 0efa9afe3594 RSI: 93a7a4c1ec90 RDI: 
[  147.261258] RBP: 93a7a4c1ee10 R08: 0001 R09: 
[  147.261259] R10:  R11: 0001 R12: a4e1c028cde8
[  147.261259] R13: 0086 R14:  R15: 93a4fbed0198
[  147.261261]  ? drm_sched_job_done.isra.0+0x1e0/0x1e0 [gpu_sched]
[  147.261266]  dma_fence_signal_timestamp_locked+0x9e/0x1c0
[  147.261274]  dma_fence_signal+0x36/0x70
[  147.261276]

Re: [BUG][5.20] refcount_t: underflow; use-after-free

2022-09-19 Thread Mikhail Gavrilov

Hi!
Unfortunately the use-after-free issue still happens on the 6.0-rc5 kernel.
The issue became hard to repeat. I spent the whole day at the computer
when use-after-free again happened, I was playing the game Tiny Tina's
Wonderlands.
Therefore, forget about repeatability. It remains only to hope for
logs and tracing.
I didn't see anything new in the logs. It seems that we need to
somehow expand the logging so that the next time this happens we have
more information.

Sep 18 20:52:16 primary-ws gnome-shell[2388]:
meta_window_set_stack_position_no_sync: assertion
'window->stack_position >= 0' failed
Sep 18 20:52:27 primary-ws gnome-shell[2388]:
meta_window_set_stack_position_no_sync: assertion
'window->stack_position >= 0' failed
Sep 18 20:53:44 primary-ws gnome-shell[2388]: Window manager warning:
Window 0x4e3 sets an MWM hint indicating it isn't resizable, but
sets min size 1 x 1 and max size 2147483647 x 2147483647; this doesn't
make much sense.
Sep 18 20:53:45 primary-ws kernel: umip_printk: 11 callbacks suppressed
Sep 18 20:53:45 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:14ebb0d03 sp:4ee528: SGDT instruction cannot be used by
applications.
Sep 18 20:53:45 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:14ebb0d03 sp:4ee528: For now, expensive software emulation returns
the result.
Sep 18 20:53:53 primary-ws gnome-shell[2388]:
meta_window_set_stack_position_no_sync: assertion
'window->stack_position >= 0' failed
Sep 18 20:53:53 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:14ebb0d03 sp:4ee528: SGDT instruction cannot be used by
applications.
Sep 18 20:53:53 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:14ebb0d03 sp:4ee528: For now, expensive software emulation returns
the result.
Sep 18 20:54:15 primary-ws kernel: umip: Wonderlands.exe[214194]
ip:15a270815 sp:6eaef490: SGDT instruction cannot be used by
applications.
Sep 18 20:56:01 primary-ws kernel: umip_printk: 15 callbacks suppressed
Sep 18 20:56:01 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:15e3a82b0 sp:4ed178: SGDT instruction cannot be used by
applications.
Sep 18 20:56:01 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:15e3a82b0 sp:4ed178: For now, expensive software emulation returns
the result.
Sep 18 20:56:03 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:15e3a82b0 sp:4edbe8: SGDT instruction cannot be used by
applications.
Sep 18 20:56:03 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:15e3a82b0 sp:4edbe8: For now, expensive software emulation returns
the result.
Sep 18 20:56:03 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:15e3a82b0 sp:4ebf18: SGDT instruction cannot be used by
applications.
Sep 18 20:57:55 primary-ws kernel: [ cut here ]
Sep 18 20:57:55 primary-ws kernel: refcount_t: underflow; use-after-free.
Sep 18 20:57:55 primary-ws kernel: WARNING: CPU: 22 PID: 235114 at
lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
Sep 18 20:57:55 primary-ws kernel: Modules linked in: tls uinput
rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_>
Sep 18 20:57:55 primary-ws kernel:  asus_wmi ledtrig_audio
sparse_keymap platform_profile irqbypass rfkill mc rapl snd_timer
video wmi_bmof pcspkr snd k10temp i2c_piix4 soundcore acpi_cpufreq
zram amdgpu drm_ttm_helper ttm iommu_v2 crct1>
Sep 18 20:57:55 primary-ws kernel: Unloaded tainted modules:
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_eda>
Sep 18 20:57:55 primary-ws kernel:  pcc_cpufreq():1 pcc_cpufreq():1
fjes():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1
fjes():1
Sep 18 20:57:55 primary-ws kernel: CPU: 22 PID: 235114 Comm:
kworker/22:0 Tainted: GWL---  ---
6.0.0-0.rc5.20220914git3245cb65fd91.39.fc38.x86_64 #1
Sep 18 20:57:55 primary-ws kernel: Hardware name: System manufacturer
System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
Sep 18 20:57:55 primary-ws kernel: Workqueue: events
drm_sched_entity_kill_jobs_work [gpu_sched]
Sep 18 20:57:55 primary-ws kernel: RIP: 0010:refcount_warn_saturate+0xba/0x110
Sep 18 20:57:55 primary-ws kernel: Code: 01 01 e8 69 6b 6f 00 0f 0b e9
32 38 a5 00 80 3d 4d 7d be 01 00 75 85 48 c7 c7 80 b7 8e 95 c6 05 3d
7d be 01 01 e8 46 6b 6f 00 <0f> 0b e9 0f 38 a5 00 80 3d 28 7d be 01 00
0f 85 5e ff ff ff 48 c7
Sep 18 20:57:55 primary-ws kernel: RSP: 0018:a1a853ccbe60 EFLAGS: 00010286
Sep 18 20:57:55 primary-ws kernel: RAX: 0026 RBX:
8e0e60a96c28 RCX: 
Sep 18 20:57:55 primary-ws kernel: RDX: 0001 RSI:
958d255c RDI: 
Sep 18 20:57:55 primary-ws kernel: RBP: 8e19a83f5600 R08:
 R09: a1a853ccbd10
Sep 18 20:57:55 primary-ws kernel: R10: 0003 R11:
8e19ee2fffe8 R12: 8e19a83fc800
Sep 18

Re: [BUG][5.20] refcount_t: underflow; use-after-free

2022-08-24 Thread Mikhail Gavrilov

On Fri, Aug 19, 2022 at 5:13 PM Maíra Canal  wrote:
>
> Hi Mikhail,
>
> Could you please specify the steps to reproduce this use-after-free? I
> will try to reproduce it on the RX5700 XT and bisect the issue.
>

Hi Maíra, thanks for help.

I'm afraid that it will be unrealistic to reproduce, because on a
laptop with 6800M (also RDNA 2 graphics) the problem does not repeat.

Sorry for the long silence, but I was trying to bisect the problem myself.

git bisect start
# status: waiting for both good and bad commits
# good: [3d7cb6b04c3f3115719235cc6866b10326de34cd] Linux 5.19
git bisect good 3d7cb6b04c3f3115719235cc6866b10326de34cd
# status: waiting for bad commit, 1 good commit known
# bad: [7ebfc85e2cd7b08f518b526173e9a33b56b3913b] Merge tag
'net-6.0-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
git bisect bad 7ebfc85e2cd7b08f518b526173e9a33b56b3913b

# bad: [b44f2fd87919b5ae6e1756d4c7ba2cbba22238e1] Merge tag
'drm-next-2022-08-03' of git://anongit.freedesktop.org/drm/drm
# 001: GPU hangs + use-after-free issue - https://pastebin.com/z86E9ydx
git bisect bad b44f2fd87919b5ae6e1756d4c7ba2cbba22238e1

# good: [526942b8134cc34d25d27f95dfff98b8ce2f6fcd] Merge tag
'ata-5.20-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
# 002: good - https://pastebin.com/9qki65Sj
git bisect good 526942b8134cc34d25d27f95dfff98b8ce2f6fcd

# good: [45490ce2ff833c4ec0de66705e46ba41320860cb] nfp: flower: add
support for tunnel offload without key ID
# 003: good - https://pastebin.com/vHk5eRkw
git bisect good 45490ce2ff833c4ec0de66705e46ba41320860cb

# skip: [e23a5e14aa278858c2e3d81ec34e83aa9a4177c5] Backmerge tag
'v5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux
into drm-next
# 004: GPU not switched in graphic mode - https://pastebin.com/RmqCTMLD
git bisect skip e23a5e14aa278858c2e3d81ec34e83aa9a4177c5

# bad: [b2065fb21d9a789b14f737ea90facedabadeb8a4] drm/amdgpu: fix
i2s_pdata out of bound array access
# 005: GPU hangs + use-after-free issue - https://pastebin.com/Zgw5Hc48
git bisect bad b2065fb21d9a789b14f737ea90facedabadeb8a4

# skip: [344feb7ccf764756937cfd74fa4ac5caba069c99] Merge tag
'amd-drm-next-5.20-2022-07-05' of
https://gitlab.freedesktop.org/agd5f/linux into drm-next
# 006: GPU not switched in graphic mode - https://pastebin.com/b8BUBE7Q
git bisect skip 344feb7ccf764756937cfd74fa4ac5caba069c99

# skip: [869b10ac8d2300327f554d83f4dbab041bf27d49] drm/amdgpu: add dm
ip block for dcn 3.1.4
# 007: GPU not switched in graphic mode - https://pastebin.com/byd7HECH
git bisect skip 869b10ac8d2300327f554d83f4dbab041bf27d49

# skip: [676ad8e997036e2f815c293b76c356fb7cc97a08] drm: rcar-du: Lift
z-pos restriction on primary plane for Gen3
# 008: GPU not switched in graphic mode - https://pastebin.com/3fXCTinb
git bisect skip 676ad8e997036e2f815c293b76c356fb7cc97a08

# skip: [5c57cbc390b166950c2e6c2f0c4edaeb0f47e97d] drm/bridge: lt9211:
Convert to drm_of_get_data_lanes_count
# 009: Build error - https://pastebin.com/rxHe9QRB
git bisect skip 5c57cbc390b166950c2e6c2f0c4edaeb0f47e97d

# skip: [6db5e0c8692e590734a7ec7455365d9cbaa15ef1] Merge tag
'drm-intel-next-2022-07-06' of
git://anongit.freedesktop.org/drm/drm-intel into drm-next
# 010: GPU not switched in graphic mode - https://pastebin.com/rqubSuc8
git bisect skip 6db5e0c8692e590734a7ec7455365d9cbaa15ef1

# skip: [5d763a9955f0fbf2681a2f1fa87c416056bd0c89] drm/amd/display:
Remove compiler warning
# 011: GPU not switched in graphic mode - https://pastebin.com/BrJs6ybP
git bisect skip 5d763a9955f0fbf2681a2f1fa87c416056bd0c89

# skip: [e6c2db2be986158afb9991d9fa8a38fe65a88516] drm/i915: Don't use
DRM_DEBUG_WARN_ON for unexpected l3bank/mslice config
# 012: GPU not switched in graphic mode - https://pastebin.com/yxppyqbD
git bisect skip e6c2db2be986158afb9991d9fa8a38fe65a88516

# bad: [cb6b81b21bd9cf09d72b7fe711be1b55001eb166] Merge tag
'drm-misc-next-fixes-2022-07-21' of
git://anongit.freedesktop.org/drm/drm-misc into drm-next
# 013: GPU hangs without use-after-free issue - https://pastebin.com/iRek4bBy
git bisect bad cb6b81b21bd9cf09d72b7fe711be1b55001eb166

# skip: [48b927770f8ad3f8cf4a024a552abf272af9f592]
drm/exynos/exynos7_drm_decon: free resources when clk_set_parent()
failed.
# 014: GPU not switched in graphic mode - https://pastebin.com/ekp10xhP
git bisect skip 48b927770f8ad3f8cf4a024a552abf272af9f592

# skip: [c5da61cf5bab30059f22ea368702c445ee87171a] drm/amdgpu/display:
add missing FP_START/END checks dcn32_clk_mgr.c
# 015: GPU not switched in graphic mode - https://pastebin.com/YbskKWmA
git bisect skip c5da61cf5bab30059f22ea368702c445ee87171a

# skip: [a77f7c89e62c6dfe405a64995812746f27adc510] drm/edid: convert
drm_gtf_modes_for_range() to drm_edid
# 016: GPU not switched in graphic mode - https://pastebin.com/bA2AwkJ7
git bisect skip a77f7c89e62c6dfe405a64995812746f27adc510

# skip: [6fde8eec71796f3534f0c274066862829813b21f] drm/doc: Add KUnit
documentation
# 017: GPU not switched in gr

Re: [BUG][5.20] refcount_t: underflow; use-after-free

2022-08-17 Thread Mikhail Gavrilov

On Wed, Aug 17, 2022 at 11:43 PM Maíra Canal  wrote:
>
> Hi Mikhail,
>
> Looks like 45ecaea738830b9d521c93520c8f201359dcbd95 ("drm/sched: Partial
> revert of 'drm/sched: Keep s_fence->parent pointer'") introduced the
> error. Try reverting it and check if the use-after-free still happens.

Thanks, but unfortunately, this did not lead to the expected result.
Again happens use-after-free in an incomprehensible context.
>From the new: added warning "suspicious RCU usage" but it looks like
it is completely not related to the use-after-free issue.

[ 215.434115] [ cut here ]
[ 215.434184] refcount_t: underflow; use-after-free.
[ 215.434204] WARNING: CPU: 7 PID: 1258 at lib/refcount.c:28
refcount_warn_saturate+0xba/0x110
[ 215.434214] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep sunrpc binfmt_misc snd_seq_midi snd_seq_midi_event
intel_rapl_msr intel_rapl_common snd_hda_codec_realtek vfat
snd_hda_codec_generic snd_hda_codec_hdmi mt76x2u fat mt76x2_common
snd_hda_intel mt76x02_usb snd_intel_dspcfg snd_intel_sdw_acpi mt76_usb
iwlmvm edac_mce_amd snd_usb_audio snd_hda_codec mt76x02_lib
snd_hda_core snd_usbmidi_lib snd_hwdep snd_rawmidi uvcvideo mt76
kvm_amd snd_seq videobuf2_vmalloc videobuf2_memops snd_seq_device
mac80211 videobuf2_v4l2 videobuf2_common kvm btusb iwlwifi snd_pcm
btrtl videodev libarc4 eeepc_wmi btbcm asus_wmi iwlmei btintel
ledtrig_audio xpad irqbypass sparse_keymap btmtk platform_profile
joydev
[ 215.434436] hid_logitech_hidpp rapl ff_memless mc snd_timer
bluetooth cfg80211 video pcspkr wmi_bmof snd soundcore k10temp
i2c_piix4 rfkill mei asus_ec_sensors acpi_cpufreq zram amdgpu
drm_ttm_helper ttm iommu_v2 ucsi_ccg gpu_sched crct10dif_pclmul
crc32_pclmul typec_ucsi drm_buddy crc32c_intel ghash_clmulni_intel ccp
igb sp5100_tco typec drm_display_helper nvme dca nvme_core cec wmi
ip6_tables ip_tables fuse
[ 215.434528] Unloaded tainted modules: amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 fjes():1
[ 215.434672] pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1
pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
[ 215.434702] CPU: 7 PID: 1258 Comm: kworker/7:3 Tainted: G W L
--- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
[ 215.434709] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[ 215.434715] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched]
[ 215.434728] RIP: 0010:refcount_warn_saturate+0xba/0x110
[ 215.434734] Code: 01 01 e8 59 59 6f 00 0f 0b e9 22 46 a5 00 80 3d be
7d be 01 00 75 85 48 c7 c7 c0 99 8e 92 c6 05 ae 7d be 01 01 e8 36 59
6f 00 <0f> 0b e9 ff 45 a5 00 80 3d 99 7d be 01 00 0f 85 5e ff ff ff 48
c7
[ 215.434740] RSP: 0018:9ccb0237fe60 EFLAGS: 00010286
[ 215.434747] RAX: 0026 RBX: 8d531f6f2828 RCX: 
[ 215.434753] RDX: 0001 RSI: 928d07a4 RDI: 
[ 215.434757] RBP: 8d61e47f5600 R08:  R09: 9ccb0237fd10
[ 215.434762] R10: 0003 R11: 8d622e2fffe8 R12: 8d61e47fc800
[ 215.434767] R13: 8d5313e95500 R14: 8d61e47fc805 R15: 8d531f6f2830
[ 215.434772] FS: () GS:8d61e460()
knlGS:
[ 215.434777] CS: 0010 DS:  ES:  CR0: 80050033
[ 215.434782] CR2: 7f0c8b815048 CR3: 0001ab0e8000 CR4: 00350ee0
[ 215.434788] Call Trace:
[ 215.434792] 
[ 215.434797] process_one_work+0x2a0/0x600
[ 215.434819] worker_thread+0x4f/0x3a0
[ 215.434830] ? process_one_work+0x600/0x600
[ 215.434836] kthread+0xf5/0x120
[ 215.434842] ? kthread_complete_and_exit+0x20/0x20
[ 215.434854] ret_from_fork+0x22/0x30
[ 215.434881] 
[ 215.434885] irq event stamp: 134873
[ 215.434890] hardirqs last enabled at (134881): []

Re: [BUG][5.20] refcount_t: underflow; use-after-free

2022-08-17 Thread Mikhail Gavrilov

On Wed, Aug 17, 2022 at 9:08 PM Melissa Wen  wrote:
>
> Hi Mikhail,
>
> IIUC, you got this second user-after-free by applying the first version
> of Maíra's patch, right? So, that version was adding another unbalanced
> unlock to the cs ioctl flow, but it was solved in the latest version,
> that you can find here: https://patchwork.freedesktop.org/patch/497680/
> If this is the situation, can you check this last version?
>
> Thanks,
>
> Melissa

With the last version warning "bad unlock balance detected!" was gone,
but the user-after-free issue remains.
And again "Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched]".

[  297.834779] [ cut here ]
[  297.834818] refcount_t: underflow; use-after-free.
[  297.834831] WARNING: CPU: 30 PID: 2377 at lib/refcount.c:28
refcount_warn_saturate+0xba/0x110
[  297.834838] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep sunrpc binfmt_misc snd_seq_midi snd_seq_midi_event mt76x2u
mt76x2_common mt76x02_usb mt76_usb mt76x02_lib snd_hda_codec_realtek
iwlmvm intel_rapl_msr snd_hda_codec_generic snd_hda_codec_hdmi mt76
vfat fat snd_hda_intel intel_rapl_common mac80211 snd_intel_dspcfg
snd_intel_sdw_acpi snd_usb_audio snd_hda_codec snd_usbmidi_lib btusb
edac_mce_amd iwlwifi libarc4 uvcvideo snd_hda_core btrtl snd_rawmidi
snd_hwdep videobuf2_vmalloc btbcm kvm_amd videobuf2_memops snd_seq
iwlmei btintel videobuf2_v4l2 eeepc_wmi snd_seq_device
videobuf2_common btmtk kvm xpad videodev joydev irqbypass snd_pcm
asus_wmi hid_logitech_hidpp ff_memless cfg80211 bluetooth rapl mc
[  297.834932]  ledtrig_audio snd_timer sparse_keymap platform_profile
wmi_bmof snd video pcspkr k10temp i2c_piix4 rfkill soundcore mei
asus_ec_sensors acpi_cpufreq zram amdgpu drm_ttm_helper ttm
crct10dif_pclmul crc32_pclmul crc32c_intel iommu_v2 ucsi_ccg gpu_sched
typec_ucsi drm_buddy ghash_clmulni_intel drm_display_helper ccp igb
typec sp5100_tco nvme cec nvme_core dca wmi ip6_tables ip_tables fuse
[  297.834978] Unloaded tainted modules: amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 fjes():1
[  297.835055]  pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1
pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
[  297.835071] CPU: 30 PID: 2377 Comm: kworker/30:6 Tainted: G
WL---  ---
6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
[  297.835075] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[  297.835078] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched]
[  297.835085] RIP: 0010:refcount_warn_saturate+0xba/0x110
[  297.835088] Code: 01 01 e8 59 59 6f 00 0f 0b e9 22 46 a5 00 80 3d
be 7d be 01 00 75 85 48 c7 c7 c0 99 8e aa c6 05 ae 7d be 01 01 e8 36
59 6f 00 <0f> 0b e9 ff 45 a5 00 80 3d 99 7d be 01 00 0f 85 5e ff ff ff
48 c7
[  297.835091] RSP: 0018:bd3506df7e60 EFLAGS: 00010286
[  297.835095] RAX: 0026 RBX: 961b250cbc28 RCX: 
[  297.835097] RDX: 0001 RSI: aa8d07a4 RDI: 
[  297.835100] RBP: 96276a3f5600 R08:  R09: bd3506df7d10
[  297.835102] R10: 0003 R11: 9627ae2fffe8 R12: 96276a3fc800
[  297.835105] R13: 9618c03e6600 R14: 96276a3fc805 R15: 961b250cbc30
[  297.835108] FS:  () GS:96276a20()
knlGS:
[  297.835110] CS:  0010 DS:  ES:  CR0: 80050033
[  297.835113] CR2: 621001e4a000 CR3: 00018d958000 CR4: 00350ee0
[  297.835116] Call Trace:
[  297.835118]  
[  297.835121]  process_one_work+0x2a0/0x600
[  297.835133]  worker_thread+0x4f/0x3a0
[  297.835139]  ? process_one_work+0x600/0x600
[  297.835142]  kthread+0xf5/0x120
[  297.835145]  ? kthread_complete_and_exit+0

Re: [BUG][5.20] refcount_t: underflow; use-after-free

2022-08-16 Thread Mikhail Gavrilov

On Mon, Aug 15, 2022 at 3:37 PM Mikhail Gavrilov
 wrote:
>
> Thanks, I tested this patch.
> But with this patch use-after-free problem happening in another place:

Does anyone have an idea why the second use-after-free happened?
>From the trace I don't understand which code is related.
I don't quite understand what the "Workqueue" entry in the trace means.

[ 408.358737] [ cut here ]
[ 408.358743] refcount_t: underflow; use-after-free.
[ 408.358760] WARNING: CPU: 9 PID: 62 at lib/refcount.c:28
refcount_warn_saturate+0xba/0x110
[ 408.358769] Modules linked in: uinput snd_seq_dummy rfcomm
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep sunrpc binfmt_misc snd_seq_midi snd_seq_midi_event mt76x2u
mt76x2_common snd_hda_codec_realtek mt76x02_usb snd_hda_codec_generic
iwlmvm snd_hda_codec_hdmi mt76_usb intel_rapl_msr snd_hda_intel
mt76x02_lib intel_rapl_common snd_intel_dspcfg snd_intel_sdw_acpi mt76
snd_hda_codec vfat fat snd_usb_audio snd_hda_core edac_mce_amd
mac80211 snd_usbmidi_lib snd_hwdep snd_rawmidi mc snd_seq btusb
kvm_amd iwlwifi snd_seq_device btrtl btbcm libarc4 btintel eeepc_wmi
snd_pcm iwlmei kvm btmtk asus_wmi ledtrig_audio irqbypass joydev
snd_timer sparse_keymap bluetooth platform_profile rapl cfg80211 snd
video wmi_bmof soundcore i2c_piix4 k10temp rfkill mei
[ 408.358853] asus_ec_sensors acpi_cpufreq zram hid_logitech_hidpp
amdgpu igb dca drm_ttm_helper ttm iommu_v2 crct10dif_pclmul gpu_sched
crc32_pclmul ucsi_ccg crc32c_intel drm_buddy nvme typec_ucsi
drm_display_helper ghash_clmulni_intel ccp typec nvme_core sp5100_tco
cec wmi ip6_tables ip_tables fuse
[ 408.358880] Unloaded tainted modules: amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1
[ 408.358953] pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1
fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
[ 408.358967] CPU: 9 PID: 62 Comm: kworker/9:0 Tainted: G W L ---
--- 6.0.0-0.rc1.13.fc38.x86_64+debug #1
[ 408.358971] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[ 408.358974] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched]
[ 408.358982] RIP: 0010:refcount_warn_saturate+0xba/0x110
[ 408.358987] Code: 01 01 e8 d9 59 6f 00 0f 0b e9 a2 46 a5 00 80 3d 3e
7e be 01 00 75 85 48 c7 c7 70 99 8e 92 c6 05 2e 7e be 01 01 e8 b6 59
6f 00 <0f> 0b e9 7f 46 a5 00 80 3d 19 7e be 01 00 0f 85 5e ff ff ff 48
c7
[ 408.358990] RSP: 0018:b124003efe60 EFLAGS: 00010286
[ 408.358994] RAX: 0026 RBX: 9987a025d428 RCX: 
[ 408.358997] RDX: 0001 RSI: 928d0754 RDI: 
[ 408.358999] RBP: 9994e4ff5600 R08:  R09: b124003efd10
[ 408.359001] R10: 0003 R11: 99952e2fffe8 R12: 9994e4ffc800
[ 408.359004] R13: 998600228cc0 R14: 9994e4ffc805 R15: 9987a025d430
[ 408.359006] FS: () GS:9994e4e0()
knlGS:
[ 408.359009] CS: 0010 DS:  ES:  CR0: 80050033
[ 408.359012] CR2: 27ac39e78000 CR3: 0001a66d8000 CR4: 00350ee0
[ 408.359015] Call Trace:
[ 408.359017] 
[ 408.359020] process_one_work+0x2a0/0x600
[ 408.359032] worker_thread+0x4f/0x3a0
[ 408.359036] ? process_one_work+0x600/0x600
[ 408.359039] kthread+0xf5/0x120
[ 408.359044] ? kthread_complete_and_exit+0x20/0x20
[ 408.359049] ret_from_fork+0x22/0x30
[ 408.359061] 
[ 408.359063] irq event stamp: 5468
[ 408.359064] hardirqs last enabled at (5467): []
_raw_spin_unlock_irq+0x24/0x50
[ 408.359071] hardirqs last disabled at (5468): []
__schedule+0xe2c/0x16d0
[ 408.359076] softirqs last enabled at (2482): []
rht_deferred_worker+0x708/0xc00
[ 408.359079] softirqs last disabled at (2480): []
rht_deferred_worker+0x1f7/0xc00
[ 408.359082] ---[ end trace  ]---


Full kernel log i

Re: [BUG][5.20] refcount_t: underflow; use-after-free

2022-08-15 Thread Mikhail Gavrilov

On Mon, Aug 15, 2022 at 5:20 AM Maíra Canal  wrote:
>
> Hi Mikhail
>
> Looks like this use-after-free problem was introduced on
> 90af0ca047f3049c4b46e902f432ad6ef1e2ded6. Checking this patch it seems
> like: if amdgpu_cs_vm_handling return r != 0, then it will unlock
> bo_list_mutex inside the function amdgpu_cs_vm_handling and again on
> amdgpu_cs_parser_fini.
>
> Maybe the following patch will help:

Thanks, I tested this patch.
But with this patch use-after-free problem happening in another place:

[  894.012920] [ cut here ]
[  894.012939] refcount_t: underflow; use-after-free.
[  894.012968] WARNING: CPU: 14 PID: 205 at lib/refcount.c:28
refcount_warn_saturate+0xba/0x110
[  894.012999] Modules linked in: tls uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep sunrpc snd_seq_midi snd_seq_midi_event snd_hda_codec_realtek
mt76x2u mt76x2_common snd_hda_codec_generic snd_hda_codec_hdmi
intel_rapl_msr mt76x02_usb intel_rapl_common snd_hda_intel mt76_usb
snd_intel_dspcfg vfat iwlmvm snd_intel_sdw_acpi mt76x02_lib fat
snd_usb_audio snd_hda_codec mt76 edac_mce_amd snd_usbmidi_lib
snd_hda_core btusb snd_rawmidi snd_hwdep mac80211 mc iwlwifi btrtl
eeepc_wmi asus_wmi btbcm snd_seq kvm_amd libarc4 ledtrig_audio
snd_seq_device btintel iwlmei sparse_keymap btmtk kvm snd_pcm
irqbypass platform_profile snd_timer xpad joydev cfg80211 rapl
hid_logitech_hidpp bluetooth ff_memless wmi_bmof video pcspkr snd
k10temp i2c_piix4
[  894.013086]  soundcore rfkill mei asus_ec_sensors acpi_cpufreq zram
amdgpu drm_ttm_helper ttm iommu_v2 crct10dif_pclmul ucsi_ccg gpu_sched
crc32_pclmul crc32c_intel typec_ucsi drm_buddy typec
drm_display_helper ghash_clmulni_intel igb ccp cec nvme sp5100_tco
nvme_core dca wmi ip6_tables ip_tables fuse
[  894.013322] Unloaded tainted modules: amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1
[  894.013455]  pcc_cpufreq():1 pcc_cpufreq():1 fjes():1
pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
[  894.013690] CPU: 14 PID: 205 Comm: kworker/14:1 Tainted: GW
   L---  ---
5.20.0-0.rc0.20220812git7ebfc85e2cd7.11.fc38.x86_64 #1
[  894.013725] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[  894.013756] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched]
[  894.013779] RIP: 0010:refcount_warn_saturate+0xba/0x110
[  894.013796] Code: 01 01 e8 79 4a 6f 00 0f 0b e9 42 47 a5 00 80 3d
de 7e be 01 00 75 85 48 c7 c7 f8 98 8e 9c c6 05 ce 7e be 01 01 e8 56
4a 6f 00 <0f> 0b e9 1f 47 a5 00 80 3d b9 7e be 01 00 0f 85 5e ff ff ff
48 c7
[  894.013842] RSP: 0018:b48681153e60 EFLAGS: 00010286
[  894.013858] RAX: 0026 RBX: 9bad16f1f028 RCX: 
[  894.013878] RDX: 0001 RSI: 9c8d06dc RDI: 
[  894.013897] RBP: 9bba663f5600 R08:  R09: b48681153d10
[  894.013916] R10: 0003 R11: 9bbaae2fffe8 R12: 9bba663fc800
[  894.013934] R13: 9bab93fcab40 R14: 9bba663fc805 R15: 9bad16f1f030
[  894.013954] FS:  () GS:9bba6620()
knlGS:
[  894.013975] CS:  0010 DS:  ES:  CR0: 80050033
[  894.013991] CR2: 1aa46b2ec008 CR3: 000101516000 CR4: 00350ee0
[  894.014011] Call Trace:
[  894.014022]  
[  894.014030]  process_one_work+0x2a0/0x600
[  894.014051]  worker_thread+0x4f/0x3a0
[  894.014065]  ? process_one_work+0x600/0x600
[  894.014079]  kthread+0xf5/0x120
[  894.014092]  ? kthread_complete_and_exit+0x20/0x20
[  894.014109]  ret_from_fork+0x22/0x30
[  894.014129]  
[  894.014137] irq event stamp: 5802
[  894.014148] hardirqs last  enabled at (5801): []
_raw_spin_unlock_irq+0x24/0x50
[  894.014178] hardirqs last disabled at (5802): []
_

[BUG][5.20] refcount_t: underflow; use-after-free

2022-08-14 Thread Mikhail Gavrilov

Hi folks.
Joined testing 5.20 today (7ebfc85e2cd7).
I encountered a frequently GPU freeze, after which a message appears
in the kernel logs:
[ 220.280990] [ cut here ]
[ 220.281000] refcount_t: underflow; use-after-free.
[ 220.281019] WARNING: CPU: 1 PID: 3746 at lib/refcount.c:28
refcount_warn_saturate+0xba/0x110
[ 220.281029] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep sunrpc snd_seq_midi snd_seq_midi_event vfat intel_rapl_msr
fat intel_rapl_common snd_hda_codec_realtek mt76x2u
snd_hda_codec_generic snd_hda_codec_hdmi mt76x2_common iwlmvm
mt76x02_usb edac_mce_amd mt76_usb snd_hda_intel snd_intel_dspcfg
mt76x02_lib snd_intel_sdw_acpi snd_usb_audio snd_hda_codec mt76
kvm_amd uvcvideo mac80211 snd_hda_core btusb eeepc_wmi snd_usbmidi_lib
videobuf2_vmalloc videobuf2_memops kvm btrtl snd_rawmidi asus_wmi
snd_hwdep videobuf2_v4l2 btbcm iwlwifi ledtrig_audio libarc4 btintel
snd_seq videobuf2_common sparse_keymap btmtk irqbypass videodev
snd_seq_device joydev xpad iwlmei platform_profile bluetooth
ff_memless snd_pcm mc rapl
[ 220.281185] video snd_timer cfg80211 wmi_bmof snd pcspkr soundcore
k10temp i2c_piix4 rfkill mei asus_ec_sensors acpi_cpufreq zram
hid_logitech_hidpp amdgpu igb dca drm_ttm_helper ttm crct10dif_pclmul
iommu_v2 crc32_pclmul gpu_sched crc32c_intel ucsi_ccg drm_buddy nvme
typec_ucsi ghash_clmulni_intel drm_display_helper ccp nvme_core typec
sp5100_tco cec wmi ip6_tables ip_tables fuse
[ 220.281258] Unloaded tainted modules: amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1
[ 220.281388] pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1
fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
[ 220.281415] CPU: 1 PID: 3746 Comm: chrome:cs0 Tainted: G W L ---
--- 5.20.0-0.rc0.20220812git7ebfc85e2cd7.10.fc38.x86_64 #1
[ 220.281421] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[ 220.281426] RIP: 0010:refcount_warn_saturate+0xba/0x110
[ 220.281431] Code: 01 01 e8 79 4a 6f 00 0f 0b e9 42 47 a5 00 80 3d de
7e be 01 00 75 85 48 c7 c7 f8 98 8e 98 c6 05 ce 7e be 01 01 e8 56 4a
6f 00 <0f> 0b e9 1f 47 a5 00 80 3d b9 7e be 01 00 0f 85 5e ff ff ff 48
c7
[ 220.281437] RSP: 0018:b4b0d18d7a80 EFLAGS: 00010282
[ 220.281443] RAX: 0026 RBX: 0003 RCX: 
[ 220.281448] RDX: 0001 RSI: 988d06dc RDI: 
[ 220.281452] RBP:  R08:  R09: b4b0d18d7930
[ 220.281457] R10: 0003 R11: a0672e2fffe8 R12: a058ca360400
[ 220.281461] R13: a05846c50a18 R14: fe00 R15: 0003
[ 220.281465] FS: 7f82683e06c0() GS:a066e2e0()
knlGS:
[ 220.281470] CS: 0010 DS:  ES:  CR0: 80050033
[ 220.281475] CR2: 3590005cc000 CR3: 0001fca46000 CR4: 00350ee0
[ 220.281480] Call Trace:
[ 220.281485] 
[ 220.281490] amdgpu_cs_ioctl+0x4e2/0x2070 [amdgpu]
[ 220.281806] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[ 220.282028] drm_ioctl_kernel+0xa4/0x150
[ 220.282043] drm_ioctl+0x21f/0x420
[ 220.282053] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[ 220.282275] ? lock_release+0x14f/0x460
[ 220.282282] ? _raw_spin_unlock_irqrestore+0x30/0x60
[ 220.282290] ? _raw_spin_unlock_irqrestore+0x30/0x60
[ 220.282297] ? lockdep_hardirqs_on+0x7d/0x100
[ 220.282305] ? _raw_spin_unlock_irqrestore+0x40/0x60
[ 220.282317] amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
[ 220.282534] __x64_sys_ioctl+0x90/0xd0
[ 220.282545] do_syscall_64+0x5b/0x80
[ 220.282551] ? futex_wake+0x6c/0x150
[ 220.282568] ? lock_is_held_type+0xe8/0x140
[ 220.282580] ? do_syscall_64+0x67/0x80
[ 220.282585] ? lockdep_hardirqs_on+0x7d/0x100
[ 220.282592] ? do_syscall_64+0x67/0x80
[ 220.282597] ? do_syscall_64+0x67/0x80
[ 220.282602] ?

Re: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu

2022-07-19 Thread Mikhail Gavrilov

On Tue, Jul 19, 2022 at 4:26 PM Mikhail Gavrilov
 wrote:
> In the kernel log there is no error so it is most likely a user space issue , 
> but I am not
> sure about it.

But I am confused by the message in the kernel log:
[ 1962.000909] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue
preemption time out
[ 1962.000912] amdgpu: Failed to evict process queues
[ 1962.000918] amdgpu: Failed to quiesce KFD
[ 1966.010395] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue
preemption time out
[ 1966.010406] amdgpu: Resetting wave fronts (cpsch) on dev b40e7982


-- 
Best Regards,
Mike Gavrilov.

Re: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu

2022-07-19 Thread Mikhail Gavrilov

On Tue, Jul 19, 2022 at 1:40 PM Mike Lothian  wrote:
>
> I was told that this patch replaces the patch you mentioned
> https://patchwork.freedesktop.org/series/106078/ and it the one
> that'll hopefully land in Linus's tree
>

Great, I confirm that both patches solve the issue.
As I understand the second patch [1] is more right and it should be
land merged 5.19 soon, right?

And since we are talking about clinfo, there is a question.
No one has encountered the problem that on configurations with two
GPUs, it hangs in a cycle since it completely occupies one processor
core. In my case, one GPU is in the RENOIR processor, and the other is
a discrete AMD Radeon 6800M. In the BIOS there is no ability to turn
off the integrated GPU in the processor, so there is no way to check
this configuration with each GPU separately. In the kernel log there
is no error so it is most likely a user space issue , but I am not
sure about it.

clinfo backtrace is here [2]

[1] https://patchwork.freedesktop.org/series/106078/
[2] https://pastebin.com/wv5iGibi

-- 
Best Regards,
Mike Gavrilov.

Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.

2022-07-18 Thread Mikhail Gavrilov

On Wed, Jul 13, 2022 at 5:38 PM Mikhail Gavrilov
 wrote:
> # first bad commit: [9cbbd694a58bdf24def2462276514c90cab7cf80] Merge
> drm/drm-next into drm-misc-next
>

Don't know who to thank but the issue disappeared in 5.19 rc7.

-- 
Best Regards,
Mike Gavrilov.

Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu

2022-07-18 Thread Mikhail Gavrilov

Hi guys I continue testing 5.19 rc7 and found the bug.
Command "clinfo" causes BUG: kernel NULL pointer dereference, address:
0008 on driver amdgpu.

Here is trace:
[ 1320.203332] BUG: kernel NULL pointer dereference, address: 0008
[ 1320.203338] #PF: supervisor read access in kernel mode
[ 1320.203340] #PF: error_code(0x) - not-present page
[ 1320.203341] PGD 0 P4D 0
[ 1320.203344] Oops:  [#1] PREEMPT SMP NOPTI
[ 1320.203346] CPU: 5 PID: 1226 Comm: kworker/5:2 Tainted: G W L
 --- 5.19.0-0.rc7.53.fc37.x86_64+debug #1
[ 1320.203348] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[ 1320.203350] Workqueue: events delayed_fput
[ 1320.203354] RIP: 0010:dma_resv_add_fence+0x5a/0x2d0
[ 1320.203358] Code: 85 c0 0f 84 43 02 00 00 8d 50 01 09 c2 0f 88 47
02 00 00 8b 15 73 10 99 01 49 8d 45 70 48 89 44 24 10 85 d2 0f 85 05
02 00 00 <49> 8b 44 24 08 48 3d 80 93 53 97 0f 84 06 01 00 00 48 3d 20
93 53
[ 1320.203360] RSP: 0018:af4cc1adfc68 EFLAGS: 00010246
[ 1320.203362] RAX: 976660408208 RBX: 975f545f2000 RCX: 
[ 1320.203363] RDX:  RSI:  RDI: 976660408198
[ 1320.203364] RBP: 976806f6e800 R08:  R09: 
[ 1320.203366] R10:  R11: 0001 R12: 
[ 1320.203367] R13: 976660408198 R14: 975f545f2000 R15: 976660408198
[ 1320.203368] FS: () GS:976de120()
knlGS:
[ 1320.203370] CS: 0010 DS:  ES:  CR0: 80050033
[ 1320.203371] CR2: 0008 CR3: 0007fb31c000 CR4: 00350ee0
[ 1320.203372] Call Trace:
[ 1320.203374] 
[ 1320.203378] amdgpu_amdkfd_gpuvm_destroy_cb+0x5d/0x1e0 [amdgpu]
[ 1320.203516] amdgpu_vm_fini+0x2f/0x4e0 [amdgpu]
[ 1320.203625] ? mutex_destroy+0x21/0x50
[ 1320.203629] amdgpu_driver_postclose_kms+0x1da/0x2b0 [amdgpu]
[ 1320.203734] drm_file_free.part.0+0x20d/0x260
[ 1320.203738] drm_release+0x6a/0x120
[ 1320.203741] __fput+0xab/0x270
[ 1320.203743] delayed_fput+0x1f/0x30
[ 1320.203745] process_one_work+0x2a0/0x600
[ 1320.203749] worker_thread+0x4f/0x3a0
[ 1320.203751] ? process_one_work+0x600/0x600
[ 1320.203753] kthread+0xf5/0x120
[ 1320.203755] ? kthread_complete_and_exit+0x20/0x20
[ 1320.203758] ret_from_fork+0x22/0x30
[ 1320.203764] 

Full kernel log is here:
https://pastebin.com/EeKh2LEr

And one hour later after a lot of messages "BUG: workqueue lockup" GPU
completely hung.

I will be glad to test patches that fix this bug.

-- 
Best Regards,
Mike Gavrilov.

Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.

2022-07-13 Thread Mikhail Gavrilov

On Sat, Jul 9, 2022 at 5:10 PM Mikhail Gavrilov
 wrote:

> Hi Christian,
> if you read my initial post. You should see that I tried to bisect the issue.
> But it is very problematic because on each step I see different symptomes.
> And if mark different symptoms with skip step we got at end lot of
> possible commits:
> Here is my bisect from initial post: https://pastebin.com/AhLMNfyv

> [8.291298] [ cut here ]
> [8.291309] kernel BUG at mm/page_alloc.c:1329!
> [8.291324] invalid opcode:  [#1] PREEMPT SMP NOPTI
> [8.291328] CPU: 8 PID: 599 Comm: systemd-udevd Not tainted
> 5.18.0-rc2-003-790b45f1bc6736a8dd48ba5731b6871e0217311e+ #361
> [8.291333] Hardware name: System manufacturer System Product
> Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
> [8.291338] RIP: 0010:free_pcp_prepare+0x58d/0x5a0

There will be a 5.19 release soon. I haven't got a working kernel
fresher than the fdaf9a5840ac commit on any machine (all machines have
AMD graphics).

Bisecting the kernel if we considered the mutex issue as "bad" state
and all other non working state as "skip" did not lead to anything
useful.

Even if we consider "bad" all commits in which the kernel does not
work, this also does not lead to anything good.
Below I did it:
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# good: [fdaf9a5840acaab18694a19e0eb0aa51162d] Merge tag
'folio-5.19' of git://git.infradead.org/users/willy/pagecache
git bisect good fdaf9a5840acaab18694a19e0eb0aa51162d
# status: waiting for bad commit, 1 good commit known
# bad: [babf0bb978e3c9fce6c4eba6b744c8754fd43d8e] Merge tag
'xfs-5.19-for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
git bisect bad babf0bb978e3c9fce6c4eba6b744c8754fd43d8e

# 01 - good: [86c87bea6b42100c67418af690919c44de6ede6e] Merge tag
'devicetree-for-5.19' of
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
git bisect good 86c87bea6b42100c67418af690919c44de6ede6e

# 02 - observed initial problem with mutex
# bad: [43ab20c599f4dc4c3972a8386ef4ca3943b5f9cd] drm/i915/gt: Fix
build error without CONFIG_PM
git bisect bad 43ab20c599f4dc4c3972a8386ef4ca3943b5f9cd

# 03 - observed invalid opcode:  [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x58d/0x5a0
# bad: [790b45f1bc6736a8dd48ba5731b6871e0217311e] drm/i915/bios: Parse
the seamless DRRS min refresh rate
git bisect bad 790b45f1bc6736a8dd48ba5731b6871e0217311e

# 04 - observed invalid opcode:  [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x455/0x650
# bad: [c6ed9f66eb70aeaac9998bd3552ada740d90e20c]
drm/nouveau/gr/gf100-: change gf108_gr_fwif from global to static
git bisect bad c6ed9f66eb70aeaac9998bd3552ada740d90e20c

# 05 good: [3123109284176b1532874591f7c81f3837bbdc17] Linux 5.18-rc1
git bisect good 3123109284176b1532874591f7c81f3837bbdc17

# 06 good: [711c7adc4687250deb550ee8a6994203f817b2ca] drm: exynos:
dsi: Use drm panel_bridge API
git bisect good 711c7adc4687250deb550ee8a6994203f817b2ca

# 07 - observed invalid opcode:  [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [047a1b877ed48098bed71fcfb1d4891e1b54441d] dma-buf &
drm/amdgpu: remove dma_resv workaround
git bisect bad 047a1b877ed48098bed71fcfb1d4891e1b54441d

# 08 good: [644704740b8282c9ee9483a38666ee4a4561c37c] drm/amdgpu: use
dma_resv_for_each_fence for CS workaround v2
git bisect good 644704740b8282c9ee9483a38666ee4a4561c37c

# 09 - observed invalid opcode:  [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [61fe0ab26e36998cebec48805d6873e31f0d79d7] drm/gma500: fix a
missing break in psb_intel_crtc_mode_set
git bisect bad 61fe0ab26e36998cebec48805d6873e31f0d79d7

# 10 good: [1c3b2a27def609473ed13b1cd668cb10deab49b4] drm/nouveau/clk:
Fix an incorrect NULL check on list iterator
git bisect good 1c3b2a27def609473ed13b1cd668cb10deab49b4

# 11 - observed invalid opcode:  [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [aa46154355e1e81ef746470d2e88bdb283508bff] drm/ingenic: Add
ingenic_drm_bridge_atomic_enable and disable
git bisect bad aa46154355e1e81ef746470d2e88bdb283508bff

# 12 good: [71d637823cac7748079a912e0373476c7cf6f985] dma-buf: finally
make dma_resv_excl_fence private v2
git bisect good 71d637823cac7748079a912e0373476c7cf6f985

# 13 - observed invalid opcode:  [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [33f2069fb6a9c2d6509accc39521d3f4d6369576] drm/nouveau: support
more than one write fence in fenv50_wndw_prepare_fb
git bisect bad 33f2069fb6a9c2d6509accc39521d3f4d6369576

# 14 - observed invalid opcode:  [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [9cbbd694a58bdf24def2462276514c90cab7cf80] Merge drm/drm-next
into drm-misc-next
git bisect bad 9cbbd694a58bdf24def2462276514c90cab7cf80

# first bad commit: [9cbbd694a58bdf24def2462276514c90cab7cf80] Merge
d

Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.

2022-07-09 Thread Mikhail Gavrilov

On Thu, Jul 7, 2022 at 2:50 PM Christian König
 wrote:
>
> Am 07.07.22 um 02:20 schrieb Mikhail Gavrilov:
> > On Tue, Jun 28, 2022 at 2:21 PM Mikhail Gavrilov
> >  wrote:
> > Christian can you look why
> > drm_aperture_remove_conflicting_pci_framebuffers cause this kernel bug
> > on my machine?
>
> That looks like a problem outside of the amdgpu driver.
>
> What happens is that during load amdgpu requests whatever driver
> (vesafb,vgafb or efifb) is currently handling the framebuffer to unload.
> This unload in turn now crashes for some reason.
>
> My best suggestion is to try to bisect this.

Hi Christian,
if you read my initial post. You should see that I tried to bisect the issue.
But it is very problematic because on each step I see different symptomes.
And if mark different symptoms with skip step we got at end lot of
possible commits:
Here is my bisect from initial post: https://pastebin.com/AhLMNfyv

If you want that I ended bisection successfully please help how to fix
this oops:
[8.291177] page:af2b6334 refcount:0 mapcount:0
mapping: index:0x0 pfn:0x102a000
[8.291202] head:af2b6334 order:0 compound_mapcount:-1226
compound_pincount:0
[8.291221] flags: 0x17c001(head|node=0|zone=2|lastcpupid=0x1f)
[8.291239] raw: 0017c001 fb35c0a80008 fb35c0a80008

[8.291257] raw:   

[8.291275] page dumped because: VM_BUG_ON_PAGE(compound &&
compound_order(page) != order)
[8.291298] [ cut here ]
[8.291309] kernel BUG at mm/page_alloc.c:1329!
[8.291324] invalid opcode:  [#1] PREEMPT SMP NOPTI
[8.291328] CPU: 8 PID: 599 Comm: systemd-udevd Not tainted
5.18.0-rc2-003-790b45f1bc6736a8dd48ba5731b6871e0217311e+ #361
[8.291333] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[8.291338] RIP: 0010:free_pcp_prepare+0x58d/0x5a0
[8.291343] Code: c6 18 a2 85 a7 e8 d3 b7 fc ff 0f 0b 31 f6 48 89
df e8 97 cf 06 00 e9 29 ff ff ff 48 c7 c6 00 f1 85 a7 48 89 df e8 b3
b7 fc ff <0f> 0b 48 c7 c6 58 92 85 a7 e8 a5 b7 fc ff 0f 0b 0f 1f 00 0f
1f 44
[8.291351] RSP: 0018:b07c023ab9d8 EFLAGS: 00010296
[8.291354] RAX: 004e RBX: fb35c0a8 RCX: 
[8.291358] RDX: 0001 RSI: a789dbaf RDI: 
[8.291361] RBP: 0009 R08:  R09: b07c023ab7c0
[8.291365] R10: 0003 R11: 92ee2e2fffe8 R12: 
[8.291368] R13: 92ee2a55d180 R14: fe00 R15: fb35c0a8
[8.291371] FS:  7f80aa398680() GS:92edda20()
knlGS:
[8.291376] CS:  0010 DS:  ES:  CR0: 80050033
[8.291379] CR2: 7f80aa38e616 CR3: 00017d726000 CR4: 00350ee0
[8.291382] Call Trace:
[8.291384]  
[8.291386]  ? find_held_lock+0x32/0x80
[8.291391]  free_unref_page+0x25/0x2a0
[8.291395]  __vunmap+0x261/0x3d0
[8.291399]  drm_fbdev_cleanup+0x6b/0xc0
[8.291403]  drm_fbdev_fb_destroy+0x15/0x30
[8.291407]  unregister_framebuffer+0x2e/0x40
[8.291411]  drm_client_dev_unregister+0x6e/0xe0
[8.291416]  drm_dev_unregister+0x34/0x90
[8.291419]  drm_dev_unplug+0x24/0x40
[8.291422]  simpledrm_remove+0x11/0x20
[8.291426]  platform_remove+0x1f/0x40
[8.291429]  device_release_driver_internal+0x1b8/0x220
[8.291433]  bus_remove_device+0xef/0x160
[8.291437]  device_del+0x18c/0x3f0
[8.291440]  platform_device_del.part.0+0x13/0x70
[8.291444]  platform_device_unregister+0x1c/0x30
[8.291447]  drm_aperture_detach_drivers+0xa3/0xd0
[8.291452]  drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x70
[8.291457]  amdgpu_pci_probe+0x126/0x3c0 [amdgpu]
[8.291599]  local_pci_probe+0x41/0x80
[8.291604]  pci_device_probe+0xaa/0x200
[8.291607]  really_probe+0x1a0/0x370
[8.291611]  __driver_probe_device+0xfb/0x170
[8.291615]  driver_probe_device+0x1f/0x90
[8.291618]  __driver_attach+0xbe/0x1a0
[8.291622]  ? __device_attach_driver+0xe0/0xe0
[8.291625]  bus_for_each_dev+0x65/0x90
[8.291629]  bus_add_driver+0x150/0x1f0
[8.291632]  driver_register+0x89/0xd0
[8.291636]  ? 0xc067b000
[8.291641]  do_one_initcall+0x69/0x350
[8.291645]  ? do_init_module+0x22/0x260
[8.291650]  ? rcu_read_lock_sched_held+0x3b/0x70
[8.291654]  ? trace_kmalloc+0x3b/0x100
[8.291658]  ? kmem_cache_alloc_trace+0x1eb/0x3a0
[8.291662]  do_init_module+0x4a/0x260
[8.291666]  __do_sys_finit_module+0x93/0xf0
[8.291673]  do_syscall_64+0x3a/0x80
[8.291677]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[8.291681] RIP: 0033:0x7f80aaf4507d
[8.291685] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e
fa 48 89 f8 48 89 f7 48 89

Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.

2022-07-06 Thread Mikhail Gavrilov

On Tue, Jun 28, 2022 at 2:21 PM Mikhail Gavrilov
 wrote:
>

Christian can you look why
drm_aperture_remove_conflicting_pci_framebuffers cause this kernel bug
on my machine?

[6.822385] amdgpu: Ignoring ACPI CRAT on non-APU system
[6.822462] amdgpu: Virtual CRAT table created for CPU
[6.822654] amdgpu: Topology: Add CPU node
[6.827643] Console: switching to colour dummy device 80x25
[6.845504] BUG: kernel NULL pointer dereference, address: 0038
[6.845509] #PF: supervisor read access in kernel mode
[6.845512] #PF: error_code(0x) - not-present page
[6.845515] PGD 0 P4D 0
[6.845518] Oops:  [#1] PREEMPT SMP NOPTI
[6.845522] CPU: 27 PID: 612 Comm: systemd-udevd Tainted: G
W  ---
5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64 #1
[6.845528] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[6.845533] RIP: 0010:kernfs_find_and_get_ns+0x11/0x70
[6.845539] Code: 78 e8 c3 fa 31 00 48 85 c0 75 e1 eb 93 66 66 2e
0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 49 89 d5 41 54 49 89
f4 55 53 <48> 8b 47 38 48 89 fb 48 85 c0 48 0f 44 c7 48 8b a8 80 00 00
00 48
[6.845546] RSP: 0018:a98c022f3aa0 EFLAGS: 00010246
[6.845550] RAX:  RBX: af52c3c0 RCX: 9e150147b640
[6.845553] RDX:  RSI: af52c508 RDI: 
[6.845557] RBP:  R08:  R09: 249249d4
[6.845560] R10: 0001 R11:  R12: af52c508
[6.845563] R13:  R14: 9e157aa93900 R15: 
[6.845567] FS:  7fabaafbf680() GS:9e23e6a0()
knlGS:
[6.845571] CS:  0010 DS:  ES:  CR0: 80050033
[6.845574] CR2: 0038 CR3: 00017cb56000 CR4: 00350ee0
[6.845578] Call Trace:
[6.845579]  
[6.845582]  sysfs_unmerge_group+0x18/0x60
[6.845585]  dpm_sysfs_remove+0x20/0x60
[6.845590]  device_del+0xa4/0x3f0
[6.845594]  platform_device_del.part.0+0x13/0x70
[6.845599]  platform_device_unregister+0x1c/0x30
[6.845602]  sysfb_disable+0x2d/0x60
[6.845605]  remove_conflicting_framebuffers+0x1b/0xc0
[6.845610]  remove_conflicting_pci_framebuffers+0xce/0x120
[6.845614]  drm_aperture_remove_conflicting_pci_framebuffers+0x57/0x80
[6.845620]  amdgpu_pci_probe+0xcb/0x360 [amdgpu]
[6.845760]  local_pci_probe+0x41/0x80
[6.845764]  pci_device_probe+0xaa/0x210
[6.845768]  really_probe+0x1bf/0x390
[6.845771]  __driver_probe_device+0xfc/0x170
[6.845775]  driver_probe_device+0x1f/0x90
[6.845778]  __driver_attach+0xbf/0x1b0
[6.845782]  ? __device_attach_driver+0xe0/0xe0
[6.845785]  bus_for_each_dev+0x65/0x90
[6.845789]  bus_add_driver+0x15c/0x200
[6.845792]  driver_register+0x89/0xe0
[6.845796]  ? 0xc0c8d000
[6.845801]  do_one_initcall+0x69/0x350
[6.845806]  ? rcu_read_lock_sched_held+0x3c/0x70
[6.845810]  ? trace_kmalloc+0x3c/0x100
[6.845814]  ? kmem_cache_alloc_trace+0x1e8/0x350
[6.845818]  do_init_module+0x4a/0x200
[6.845822]  __do_sys_init_module+0x13a/0x190
[6.845827]  do_syscall_64+0x5b/0x80
[6.845832]  ? asm_exc_page_fault+0x27/0x30
[6.845835]  ? lockdep_hardirqs_on+0x7d/0x100
[6.845839]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[6.845842] RIP: 0033:0x7fababb7463e
[6.845845] Code: 48 8b 0d e5 57 0c 00 f7 d8 64 89 01 48 83 c8 ff
c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 57 0c 00 f7 d8 64 89
01 48
[6.845852] RSP: 002b:7ffc6a6c9658 EFLAGS: 0246 ORIG_RAX:
00af
[6.845857] RAX: ffda RBX: 5620deef53f0 RCX: 7fababb7463e
[6.845860] RDX: 5620deeb2df0 RSI: 010bfac6 RDI: 7faba943e010
[6.845864] RBP: 5620deeb2df0 R08: 5620deef4880 R09: 
[6.845867] R10: 0005 R11: 0246 R12: 0002
[6.845870] R13: 5620deeb5330 R14:  R15: 5620deef0410
[6.845875]  
[6.845877] Modules linked in: amdgpu(+) drm_ttm_helper ttm
iommu_v2 crct10dif_pclmul gpu_sched crc32_pclmul crc32c_intel
drm_buddy drm_display_helper ucsi_ccg nvme igb typec_ucsi
ghash_clmulni_intel ccp cec typec sp5100_tco nvme_core dca wmi
ip6_tables ip_tables ipmi_devintf ipmi_msghandler fuse
[6.845898] CR2: 0038
[6.845900] ---[ end trace  ]---


$ 
/usr/src/kernels/5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/scripts/faddr2line
/lib/debug/lib/modules/5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.debug
amdgpu_pci_probe+0xcb
amdgpu_pci_probe+0xcb/0x360:
amdgpu_pci_probe at
/usr/src/debug/kernel-5.19-rc5-49-gc1084b6c5620/linux-5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/driv

[Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.

2022-06-28 Thread Mikhail Gavrilov

Hi guys.
Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in
graphic mode instead I see black screen with constantly glowing
cursor. Demonstration: https://youtu.be/rGL4LsHMae4
In the kernel logs there are references to hung processes:
[  149.363465] rfkill: input handler disabled
[  249.072478] INFO: task (brt-dbus):1645 blocked for more than 122 seconds.
[  249.072515]   Tainted: GWL     ---
5.19.0-0.rc0.20220526gitbabf0bb978e3.4.fc37.x86_64 #1
[  249.072520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.072524] task:(brt-dbus)  state:D stack:14384 pid: 1645
ppid: 1 flags:0x0002
[  249.072536] Call Trace:
[  249.072540]  
[  249.072551]  __schedule+0x492/0x1640
[  249.072560]  ? lock_is_held_type+0xe8/0x140
[  249.072569]  ? find_held_lock+0x32/0x80
[  249.072584]  schedule+0x4e/0xb0
[  249.072591]  schedule_preempt_disabled+0x14/0x20
[  249.072597]  __mutex_lock+0x423/0x890
[  249.072608]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.072818]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.073010]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.073207]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.074088]  filp_close+0x31/0x70
[  249.074097]  __close_range+0x130/0x320
[  249.074108]  __x64_sys_close_range+0x13/0x20
[  249.074113]  do_syscall_64+0x5b/0x80
[  249.074120]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.074127]  ? do_syscall_64+0x67/0x80
[  249.074135]  ? do_syscall_64+0x67/0x80
[  249.074140]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.074147]  ? do_syscall_64+0x67/0x80
[  249.074154]  ? lock_is_held_type+0xe8/0x140
[  249.074164]  ? asm_exc_page_fault+0x27/0x30
[  249.074171]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.074178]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.074184] RIP: 0033:0x7fd71f54f97b
[  249.074208] RSP: 002b:7fffc8e752a8 EFLAGS: 0246 ORIG_RAX:
01b4
[  249.074215] RAX: ffda RBX: 7fffc8e752b0 RCX: 7fd71f54f97b
[  249.074220] RDX:  RSI:  RDI: 0027
[  249.074224] RBP: 7fffc8e75330 R08:  R09: 7fffc8e75380
[  249.074228] R10: 7fffc8e751f0 R11: 0246 R12: 0002
[  249.074232] R13: 7fffc8e75340 R14:  R15: 0002
[  249.074252]  
[  249.074261] INFO: task (ostnamed):1718 blocked for more than 122 seconds.
[  249.074266]   Tainted: GWL     ---
5.19.0-0.rc0.20220526gitbabf0bb978e3.4.fc37.x86_64 #1
[  249.074285] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.074289] task:(ostnamed)  state:D stack:14552 pid: 1718
ppid: 1 flags:0x0006
[  249.074299] Call Trace:
[  249.074302]  
[  249.074310]  __schedule+0x492/0x1640
[  249.074316]  ? lock_is_held_type+0xe8/0x140
[  249.074324]  ? find_held_lock+0x32/0x80
[  249.074339]  schedule+0x4e/0xb0
[  249.074346]  schedule_preempt_disabled+0x14/0x20
[  249.074352]  __mutex_lock+0x423/0x890
[  249.074361]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.074564]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.074754]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.074950]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.075133]  filp_close+0x31/0x70
[  249.075140]  __close_range+0x130/0x320
[  249.075150]  __x64_sys_close_range+0x13/0x20
[  249.075154]  do_syscall_64+0x5b/0x80
[  249.075164]  ? lock_is_held_type+0xe8/0x140
[  249.075175]  ? do_syscall_64+0x67/0x80
[  249.075180]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.075187]  ? do_syscall_64+0x67/0x80
[  249.075194]  ? lock_is_held_type+0xe8/0x140
[  249.075204]  ? asm_exc_page_fault+0x27/0x30
[  249.075210]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.075217]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.075222] RIP: 0033:0x7fd71f54f97b
[  249.075231] RSP: 002b:7fffc8e752a8 EFLAGS: 0246 ORIG_RAX:
01b4
[  249.075237] RAX: ffda RBX: 7fffc8e752b0 RCX: 7fd71f54f97b
[  249.075241] RDX:  RSI: 00b9 RDI: 0027
[  249.075245] RBP: 7fffc8e75330 R08:  R09: 7fffc8e75380
[  249.075249] R10: 7fffc8e751f0 R11: 0246 R12: 0004
[  249.075253] R13: 7fffc8e75340 R14:  R15: 0003
[  249.075289]  
[  249.075294] INFO: task (pcscd):1749 blocked for more than 122 seconds.
[  249.075298]   Tainted: GWL     ---
5.19.0-0.rc0.20220526gitbabf0bb978e3.4.fc37.x86_64 #1
[  249.075302] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.075306] task:(pcscd) state:D stack:14256 pid: 1749
ppid: 1 flags:0x0002
[  249.075314] Call Trace:
[  249.075318]  
[  249.075325]  __schedule+0x492/0x1640
[  249.075331]  ? lock_is_held_type+0xe8/0x140
[  249.075339]  ? find_held_lock+0x32/0x80
[  249.075353]  schedule+0x4e/0xb0
[  249.075360]

Re: Screen corruption using radeon kernel driver

2022-05-16 Thread Mikhail Krylov

On Mon, Apr 25, 2022 at 01:22:04PM -0400, Alex Deucher wrote:
> + dri-devel
> 
> On Mon, Apr 25, 2022 at 3:33 AM Krylov Michael  wrote:
> >
> > Hello!
> >
> > After updating my Linux kernel from version 4.19 (Debian 10 version) to
> > 5.10 (packaged with Debian 11), I've noticed that the image
> > displayed on my older computer, 32-bit Pentium 4 using ATI Radeon X1950
> > AGP video card is severely corrupted in the graphical (Xorg and Wayland)
> > mode: all kinds of black and white stripes across the screen, some
> > letters missing, etc.
> >
> > I've checked several options (Xorg drivers, Wayland instead of
> > Xorg, radeon.agpmode=-1 in kernel command line and so on), but the
> > problem persisted. I've managed to find that the problem was in the
> > kernel, as everything worked well with 4.19 kernel with everything
> > else being from Debian 11.
> >
> > I have managed to find the culprit of that corruption, that is the
> > commit 33b3ad3788aba846fc8b9a065fe2685a0b64f713 on the linux kernel.
> > Reverting this commit and building the kernel with that commit reverted
> > fixes the problem. Disabling HIMEM also gets rid of that problem. But it
> > also leaves the system with less that 1G of RAM, which is, of course,
> > undesirable.
> >
> > Apparently this problem is somewhat known, as I can tell after googling
> > for the commit id, see this link for example:
> > https://lkml.org/lkml/2020/1/9/518
> >
> > Mageia distro, for example, reverted this commit in the kernel they are
> > building:
> >
> > http://sophie.zarb.org/distrib/Mageia/7/i586/by-pkgid/b9193a4f85192bc57f4d770fb9bb399c/files/32
> >
> > I've reported this bug to Debian bugtracker, checked the recent verion
> > of the kernel (5.17), bug still persists. Here's a link to the Debian
> > bug page:
> >
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=993670
> >
> > I'm not sure if reverting this commit is the correct way to go, so if
> > you need to check any changes/patches that I could apply and test on
> > the real hardware, I'll be glad to do that (but please keep in mind
> > that testing could take some time, I don't have access to this computer
> > 24/7, but I'll do my best to respond ASAP).
> 
> I would be happy to revert that commit.  I attempted to revert it a
> year or so ago, but Christoph didn't want to.  He was going to look
> further into it.  I was not able to repro the issue.  It seemed to be
> related to highmem support.  You might try disabling that.  Here is
> the previous thread for reference:
> https://lists.freedesktop.org/archives/amd-gfx/2020-September/053922.html
> 
> Alex

Yeah, I tried to disable HIMEM, and that indeed fixes the problem, but
it leaves me with less than 1G of available memory which is undesirable.


signature.asc
Description: PGP signature

Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga

2022-05-11 Thread Mikhail Gavrilov

On Fri, Apr 15, 2022 at 1:04 PM Christian König
 wrote:
>
> No, I just couldn't find time during all that bug fixing :)
>
> Sorry for the delay, going to take a look after the eastern holiday here.
>
> Christian.

The message is just for history. The issue was fixed between
b253435746d9a4a and 5.18rc4.

-- 
Best Regards,
Mike Gavrilov.

Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga

2022-04-14 Thread Mikhail Gavrilov

On Sat, Apr 9, 2022 at 7:27 PM Christian König
 wrote:
>
> That's unfortunately not the end of the story.
>
> This is fixing your problem, but reintroducing the original problem that
> we call the syncobj with a lock held which can crash badly as well.
>
> Going to take a closer look on Monday. I hope you can test a few more
> patches to help narrow down what's actually going wrong here.
>
> Thanks,
> Christian.
>

Hi Christian.
I'm sorry to trouble you.
Have you forgotten about this issue?

-- 
Best Regards,
Mike Gavrilov.

Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga

2022-04-08 Thread Mikhail Gavrilov

On Fri, 8 Apr 2022 at 19:27, Christian König  wrote:
>
> Please test the attached patch, it just re-introduce the lock without
> doing much else.
>
> And does your branch contain the following patch:
>
> commit d18b8eadd83e3d8d63a45f9479478640dbcfca02
> Author: Christian König 
> Date:   Wed Feb 23 14:35:31 2022 +0100
>
>  drm/amdgpu: install ctx entities with cmpxchg
>
>  Since we removed the context lock we need to make sure that not two
> threads
>  are trying to install an entity at the same time.
>
>  Signed-off-by: Christian König 
>  Fixes: 461fa7b0ac565e ("drm/amdgpu: remove ctx->lock")
>  Reviewed-by: Andrey Grodzovsky 
>  Signed-off-by: Alex Deucher 

The all listed games are now working with an attached patch.
Also flood messages "WARNING: CPU: 31 PID: 51848 at
drivers/dma-buf/dma-fence-array.c:191
dma_fence_array_create+0x101/0x120" has gone.

Thanks.

Tested-by: Mikhail Gavrilov 

-- 
Best Regards,
Mike Gavrilov.

Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga

2022-04-08 Thread Mikhail Gavrilov

On Fri, 8 Apr 2022 at 16:13, Christian König  wrote:

> I own you a beer.
>
> I still don't know what happens here, but that makes at least a bit more
> sense than a patch which only changes comments :)
>
> Looks like we are missing something here. Can I send you a patch to try
> something later today?

Yes, please feel free to send me a patch for testing.

-- 
Best Regards,
Mike Gavrilov.

Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga

2022-04-08 Thread Mikhail Gavrilov

Hi Christian

> those are two independent and already known problems.
>
> The warning triggered from the sync_file is already fixed in
> drm-misc-next-fixes, but so far I couldn't figure out why the games
> suddenly doesn't work any more.

I thought that these warnings are related to the stuck of the listed games.

> There is a bug report for that, but bisecting the changes didn't yielded
> anything valuable so far.
>
> So if you can come up with something that would be rather valuable.

I found how to fix my build problems. They are all related to gcc12.
And making again git bisect and found which commit lead to stuck the
games "Forza Horizon 5", "Forza Horizon 4", "Cyberpunk 2077".
At least it affected hardware Radeon 6900 XT, Radeon 6800M and Radeon VII.

$ git bisect log
git bisect start
# good: [ed4643521e6af8ab8ed1e467630a85884d2696cf] Merge tag
'arm-dt-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good ed4643521e6af8ab8ed1e467630a85884d2696cf
# bad: [34af78c4e616c359ed428d79fe4758a35d2c5473] Merge tag
'iommu-updates-v5.18' of
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
git bisect bad 34af78c4e616c359ed428d79fe4758a35d2c5473
# good: [4a0cb83ba6e0cd73a50fa4f84736846bf0029f2b] netdevice: add
missing dm_private kdoc
git bisect good 4a0cb83ba6e0cd73a50fa4f84736846bf0029f2b
# skip: [2ab82efeeed885c0210a0029df93bb95a316e8c7] Merge tag
'drm-intel-gt-next-2022-03-03' of
git://anongit.freedesktop.org/drm/drm-intel into drm-next
git bisect skip 2ab82efeeed885c0210a0029df93bb95a316e8c7
# good: [00598b056aa6d46c7a6819efa850ec9d0d690d76] scsi: smartpqi:
Expose SAS address for SATA drives
git bisect good 00598b056aa6d46c7a6819efa850ec9d0d690d76
# good: [00598b056aa6d46c7a6819efa850ec9d0d690d76] scsi: smartpqi:
Expose SAS address for SATA drives
git bisect good 00598b056aa6d46c7a6819efa850ec9d0d690d76
# skip: [c674c5b9342e5cb0f3d9e9bcaf37dbe2087845e5] drm/i915/xehp: CCS
should use RCS setup functions
git bisect skip c674c5b9342e5cb0f3d9e9bcaf37dbe2087845e5
# good: [f0d4ce59f4d48622044933054a0e0cefa91ba15e] drm/i915: Disable
DRRS on IVB/HSW port != A
git bisect good f0d4ce59f4d48622044933054a0e0cefa91ba15e
# skip: [6de7e4f02640fba2ffa6ac04e2be13785d614175] Merge tag
'drm-msm-next-2022-03-01' of https://gitlab.freedesktop.org/drm/msm
into drm-next
git bisect skip 6de7e4f02640fba2ffa6ac04e2be13785d614175
# bad: [868f4357ed0d1e2f96bbd67d4ac862aa6335effe] drm/amd/display: Add
DMUB support for DCN316
git bisect bad 868f4357ed0d1e2f96bbd67d4ac862aa6335effe
# good: [39da460fd4c0f8e7290dcc9cbfc9375de9d0eeca] drm/amd/display:
Fix DP LT sequence on EQ fail
git bisect good 39da460fd4c0f8e7290dcc9cbfc9375de9d0eeca
# good: [3f268ef06f8cf3c481dbd5843d564f5170c6df54] drm/ttm: add back a
reference to the bdev to the res manager
git bisect good 3f268ef06f8cf3c481dbd5843d564f5170c6df54
# bad: [123db17ddff007080d464e785689fb14f94cbc7a] Merge tag
'amd-drm-next-5.18-2022-02-11-1' of
https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect bad 123db17ddff007080d464e785689fb14f94cbc7a
# bad: [24992ab0b8b0d2521caa9c3dcbed0e2a56cbe3d0] drm/amdkfd: Fix
prototype warning for get_process_num_bos
git bisect bad 24992ab0b8b0d2521caa9c3dcbed0e2a56cbe3d0
# good: [1cbbc8d4f788af4c260ef3cae05902ef7b191197] drm/radeon/uvd: Fix
forgotten unmap buffer objects
git bisect good 1cbbc8d4f788af4c260ef3cae05902ef7b191197
# good: [69f915cc97c4bb82b34105a47abf613f7c87215d] drm/amdgpu: loose
check for umc poison mode
git bisect good 69f915cc97c4bb82b34105a47abf613f7c87215d
# good: [8bbd4d83a68beaf54ae01b2e2aa2024ff1dfc0ba] drm/amdgpu: Reset
OOB table error count info
git bisect good 8bbd4d83a68beaf54ae01b2e2aa2024ff1dfc0ba
# bad: [1915a433954262ac7466469d1a4684ac54218af4] drm/amdgpu: adjust
register address calculation
git bisect bad 1915a433954262ac7466469d1a4684ac54218af4
# bad: [461fa7b0ac565ef25c1da0ced31005dd437883a7] drm/amdgpu: remove ctx->lock
git bisect bad 461fa7b0ac565ef25c1da0ced31005dd437883a7
# first bad commit: [461fa7b0ac565ef25c1da0ced31005dd437883a7]
drm/amdgpu: remove ctx->lock

461fa7b0ac565ef25c1da0ced31005dd437883a7 is the first bad commit
commit 461fa7b0ac565ef25c1da0ced31005dd437883a7
Author: Ken Xue 
Date:   Fri Feb 11 16:18:46 2022 -0500

drm/amdgpu: remove ctx->lock

KMD reports a warning on holding a lock from drm_syncobj_find_fence,
when running amdgpu_test case “syncobj timeline test”.

ctx->lock was designed to prevent concurrent "amdgpu_ctx_wait_prev_fence"
calls and avoid dead reservation lock from GPU reset. since no reservation
lock is held in latest GPU reset any more, ctx->lock can be simply removed
and concurrent "amdgpu_ctx_wait_prev_fence" call also can be prevented by
PD root bo reservation lock.

call stacks:
=
//hold lock
amdgpu_cs_ioctl->amdgpu_cs_parser_init->mutex_lock(>ctx->lock);
…
//report warning
amdgpu_cs_dependencies->amdgpu_cs_process_syncobj_timeline_in_dep \

[Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some games

2022-04-03 Thread Mikhail Gavrilov

Hi,
Between commits ed4643521e6a and 34af78c4e616 something was broken.
I noted that kernel log flooded with warning message "WARNING: CPU: 31
PID: 51848 at drivers/dma-buf/dma-fence-array.c:191
dma_fence_array_create+0x101/0x120" when some games are running:
"Resident Evil Village", "Marvel's Avengers", "The Dark Pictures
Anthology: House of Ashes".

[16999.958726] [ cut here ]
[16999.958731] WARNING: CPU: 31 PID: 51848 at
drivers/dma-buf/dma-fence-array.c:191
dma_fence_array_create+0x101/0x120
[16999.958738] Modules linked in: xone_gip_chatpad(OE)
xone_gip_gamepad(OE) xone_gip_common(OE) ff_memless tls uinput rfcomm
snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_midi_event nft_objref
nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep
sunrpc binfmt_misc iwlmvm vfat intel_rapl_msr fat intel_rapl_common
snd_hda_codec_realtek mac80211 snd_hda_codec_generic ledtrig_audio
snd_hda_codec_hdmi libarc4 snd_hda_intel edac_mce_amd snd_intel_dspcfg
snd_usb_audio snd_intel_sdw_acpi btusb kvm_amd snd_hda_codec btrtl
btbcm iwlwifi btintel snd_hda_core snd_usbmidi_lib uvcvideo snd_hwdep
kvm iwlmei snd_rawmidi videobuf2_vmalloc xone_dongle(OE)
videobuf2_memops xone_gip_bus(OE) snd_seq btmtk videobuf2_v4l2
videobuf2_common snd_seq_device irqbypass bluetooth cfg80211 snd_pcm
rapl videodev
[16999.958799]  eeepc_wmi asus_wmi snd_timer sparse_keymap
platform_profile ecdh_generic video wmi_bmof pcspkr snd k10temp
i2c_piix4 joydev mc soundcore rfkill mei acpi_cpufreq zram
hid_logitech_hidpp hid_logitech_dj amdgpu drm_ttm_helper ttm
crct10dif_pclmul ccp crc32_pclmul ucsi_ccg iommu_v2 crc32c_intel
typec_ucsi gpu_sched ghash_clmulni_intel sp5100_tco drm_dp_helper
typec igb nvme nvme_core dca wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua
ip6_tables ip_tables dm_multipath ipmi_devintf ipmi_msghandler fuse
[16999.958862] CPU: 31 PID: 51848 Comm: GWT.exe Tainted: GB   W
OEL   - ---
5.18.0-0.rc0.20220401gite8b767f5e04097a.15.fc37.x86_64 #1
[16999.958865] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4204 02/24/2022
[16999.958867] RIP: 0010:dma_fence_array_create+0x101/0x120
[16999.958871] Code: 45 85 e4 75 10 eb 2a 48 81 fa c0 aa 52 ab 74 1a
83 e8 01 72 1c 48 63 d0 48 8b 54 d5 00 48 8b 52 08 48 81 fa 60 aa 52
ab 75 dd <0f> 0b 83 e8 01 73 e4 48 83 c4 08 48 89 d8 5b 5d 41 5c 41 5d
41 5e
[16999.958874] RSP: 0018:b03c071f7e08 EFLAGS: 00010246
[16999.958877] RAX: 0001 RBX: 98fdb03c6d00 RCX: 00510e99
[16999.958879] RDX: ab52aac0 RSI: 98fdb03c6d10 RDI: 98fdb03c6d00
[16999.958880] RBP: 98fa31c59e40 R08: 0001 R09: 
[16999.958882] R10:  R11:  R12: 0002
[16999.958883] R13:  R14: 98fdb03c6d40 R15: 0001
[16999.958885] FS:  4789f640() GS:9907ea60()
knlGS:29b7
[16999.958887] CS:  0010 DS:  ES:  CR0: 80050033
[16999.95] CR2: 7ff41eee8000 CR3: 2856a000 CR4: 00350ee0
[16999.958890] Call Trace:
[16999.958893]  
[16999.958897]  sync_file_ioctl+0x83d/0x9f0
[16999.958904]  __x64_sys_ioctl+0x8d/0xc0
[16999.958908]  do_syscall_64+0x3a/0x80
[16999.958913]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[16999.958917] RIP: 0033:0x7ff5e850b29f
[16999.958941] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24
10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00
00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28
00 00
[16999.958943] RSP: 002b:4789d540 EFLAGS: 0246 ORIG_RAX:
0010
[16999.958946] RAX: ffda RBX: 7ff5d5637040 RCX: 7ff5e850b29f
[16999.958948] RDX: 4789d740 RSI: c0303e03 RDI: 0260
[16999.958949] RBP: 0260 R08: 0001 R09: 
[16999.958951] R10:  R11: 0246 R12: 4789d740
[16999.958953] R13:  R14: c0303e03 R15: 
[16999.958958]  
[16999.958959] irq event stamp: 0
[16999.958961] hardirqs last  enabled at (0): [<>] 0x0
[16999.958964] hardirqs last disabled at (0): []
copy_process+0x9f1/0x1e20
[16999.958968] softirqs last  enabled at (0): []
copy_process+0x9f1/0x1e20
[16999.958971] softirqs last disabled at (0): [<>] 0x0
[16999.958974] ---[ end trace  ]---


The games "Forza Horizon 5", "Forza Horizon 4", "Cyberpunk 2077",
"Ghostwire: Tokyo" stopped working. When these games crashed I again
saw the same warning message as above [2]. Difference only in thead
name and addresses.

[  643.442353] [ cut here ]
[  643.442358] WARNING: CPU: 24 PID: 7824 at

Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K

2021-09-15 Thread Mikhail Gavrilov

On Wed, 15 Sept 2021 at 14:55, Christian König  wrote:
>
> Yes, absolutely. You should see GPU resets and recovery in the system log 
> after that.

Unfortunately, not one DE will survive a GPU reset. All applications
will terminate abnormally in fact this would be equivalent to reboot
(and denial of service). :(

-- 
Best Regards,
Mike Gavrilov.

Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K

2021-09-14 Thread Mikhail Gavrilov

On Wed, 14 Apr 2021 at 11:48, Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

>
> That is expected behavior, the application is just buggy and causing a
> page fault on the GPU.
>
> The kernel should just not crash with a backtrace.
>
> Regards,
> Christian.
>

If after it GPU hangs with the message "[drm:amdgpu_dm_atomic_commit_tail
[amdgpu]] *ERROR* Waiting for fences timed out!" is it also expected
behavior?
Kernel log: https://pastebin.com/WkhATKXX

-- 
Best Regards,
Mike Gavrilov.

Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K

2021-04-21 Thread Mikhail Gavrilov

On Wed, 21 Apr 2021 at 11:42, Christian König  wrote:
> I can try, but I'm not sure if we even have the full page fault handling
> for Navi in 5.12.
>

It would be great. For me this patch is working as expected and I
already for several days didn't see the panic "kernel BUG at
drivers/dma-buf/dma-resv.c:287!".
Anyway I will waiting for any news.

-- 
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K

2021-04-20 Thread Mikhail Gavrilov

On Wed, 14 Apr 2021 at 11:48, Christian König
 wrote:
>
> >> commit f63da9ae7584280582cbc834b20cc18bfb203b14
> >> Author: Philip Yang 
> >> Date:   Thu Apr 1 00:22:23 2021 -0400
> >>
> >>   drm/amdgpu: reserve fence slot to update page table
> >>
>
> That is expected behavior, the application is just buggy and causing a
> page fault on the GPU.
>
> The kernel should just not crash with a backtrace.
>

Any chance to see this commit to be backported to 5.12?
I plan to submit a bug report to OBS devs and don't want my system to
hang again and again when I would test their patches.

-- 
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K

2021-04-14 Thread Mikhail Gavrilov

On Wed, 14 Apr 2021 at 03:22, Leo Liu  wrote:
>
> This is decode command line, are you seeing issue with encode or
> decode?

I was means that described above the kernel panic happens only when
OBS record or stream video with VAAPI encoder.
Grabbing and encoding video with ffmpeg (given command example) is
free from this issue, but result video encoded with ffmpeg not played
properly. And I believe that this is not a bug of ffmpeg itself,
because with CPU encoder (libx264), the resulting video is played
properly.

> you also said `ffmpeg -f x11grab -framerate 60 -video_size
> 3840x2160 -i :0.0 -vf 'format=nv12,hwupload' -vaapi_device
> /dev/dri/renderD128 -vcodec h264_vaapi output3.mp4` doesn't cause such
> issue, right?

This command does not cause described kernel panic, but the resulting
video looks like 0.01 FPS.

>
> Yes.
>

I filled bugreport about VAAPI encoder in ffmpeg here:
https://gitlab.freedesktop.org/drm/amd/-/issues/1570

We can continue there.

-- 
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K

2021-04-13 Thread Mikhail Gavrilov

On Tue, 13 Apr 2021 at 04:55, Leo Liu  wrote:
>
> >It curious why ffmpeg does not cause such issues.
> >For example such command not cause kernel panic:
> >$ ffmpeg -f x11grab -framerate 60 -video_size 3840x2160 -i :0.0 -vf
> >'format=nv12,hwupload' -vaapi_device /dev/dri/renderD128 -vcodec
> >h264_vaapi output3.mp4
>
> What command are you using to see the issue or how can the issue be 
> reproduced?
$ mpv output4.mp4

And of course, I know how it should works because when I encode video
with CPU encoder (libx264) all fine.
$ ffmpeg -f x11grab -framerate 60 -video_size 3840x2160 -i :0.0
-vcodec libx264 output3.mp4

> Please file a freedesktop gitlab issue, so we can keep track of it.
Here? https://gitlab.freedesktop.org/drm/amd/-/issues

Also, I found that other users face the same problem.
https://bbs.archlinux.org/viewtopic.php?id=261965

-- 
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [BUG] VAAPI encoder cause kernel panic if encoded video in 4K

2021-04-13 Thread Mikhail Gavrilov

On Tue, 13 Apr 2021 at 12:29, Christian König  wrote:
>
> Hi Mikhail,
>
> the crash is a known issue and should be fixed by:
>
> commit f63da9ae7584280582cbc834b20cc18bfb203b14
> Author: Philip Yang 
> Date:   Thu Apr 1 00:22:23 2021 -0400
>
>  drm/amdgpu: reserve fence slot to update page table
>

Unfortunately, this commit couldn't fix the initial problem.
1. Result video is jerky if it grabbed and encoded with ffmpeg
(h264_vaapi codec).
2. OBS still crashed if I try to record or stream video.
3. In the kernel log still appears the message "amdgpu: [mmhub] page
fault (src_id:0 ring:0 vmid:4 pasid:32770, for process obs" if I tried
to record or stream video by OBS.

-- 
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[BUG] VAAPI encoder cause kernel panic if encoded video in 4K

2021-04-12 Thread Mikhail Gavrilov

Video demonstration: https://youtu.be/3nkvUeB0GSw

How looks kernel traces.

1.
[ 7315.156460] amdgpu :0b:00.0: amdgpu: [mmhub] page fault
(src_id:0 ring:0 vmid:6 pasid:32779, for process obs pid 23963 thread
obs:cs0 pid 23977)
[ 7315.156490] amdgpu :0b:00.0: amdgpu:   in page starting at
address 0x80011fdf5000 from client 18
[ 7315.156495] amdgpu :0b:00.0: amdgpu:
MMVM_L2_PROTECTION_FAULT_STATUS:0x00641A51
[ 7315.156500] amdgpu :0b:00.0: amdgpu: Faulty UTCL2 client ID: VCN1 (0xd)
[ 7315.156503] amdgpu :0b:00.0: amdgpu: MORE_FAULTS: 0x1
[ 7315.156505] amdgpu :0b:00.0: amdgpu: WALKER_ERROR: 0x0
[ 7315.156509] amdgpu :0b:00.0: amdgpu: PERMISSION_FAULTS: 0x5
[ 7315.156510] amdgpu :0b:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 7315.156513] amdgpu :0b:00.0: amdgpu: RW: 0x1
[ 7315.156545] amdgpu :0b:00.0: amdgpu: [mmhub] page fault
(src_id:0 ring:0 vmid:6 pasid:32779, for process obs pid 23963 thread
obs:cs0 pid 23977)
[ 7315.156549] amdgpu :0b:00.0: amdgpu:   in page starting at
address 0x80011fdf6000 from client 18
[ 7315.156551] amdgpu :0b:00.0: amdgpu:
MMVM_L2_PROTECTION_FAULT_STATUS:0x00641A51
[ 7315.156554] amdgpu :0b:00.0: amdgpu: Faulty UTCL2 client ID: VCN1 (0xd)
[ 7315.156556] amdgpu :0b:00.0: amdgpu: MORE_FAULTS: 0x1
[ 7315.156559] amdgpu :0b:00.0: amdgpu: WALKER_ERROR: 0x0
[ 7315.156561] amdgpu :0b:00.0: amdgpu: PERMISSION_FAULTS: 0x5
[ 7315.156564] amdgpu :0b:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 7315.156566] amdgpu :0b:00.0: amdgpu: RW: 0x1

This is a harmless panic, but nevertheless VAAPI does not work and the
application that tried to use the encoder crashed.

2.
If we tries again and again encode 4K stream through VAAPI we can
encounter the next trace:
[12341.860944] [ cut here ]
[12341.860961] kernel BUG at drivers/dma-buf/dma-resv.c:287!
[12341.860968] invalid opcode:  [#1] SMP NOPTI
[12341.860972] CPU: 28 PID: 18261 Comm: kworker/28:0 Tainted: G
W- ---  5.12.0-0.rc5.180.fc35.x86_64+debug #1
[12341.860977] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 3402 01/13/2021
[12341.860981] Workqueue: events amdgpu_irq_handle_ih_soft [amdgpu]
[12341.861102] RIP: 0010:dma_resv_add_shared_fence+0x2ab/0x2c0
[12341.861108] Code: fd ff ff be 01 00 00 00 e8 e2 74 dc ff e9 ac fd
ff ff 48 83 c4 18 be 03 00 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f e9 c5
74 dc ff <0f> 0b 31 ed e9 73 fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
90 0f
[12341.861112] RSP: 0018:b2f084c87bb0 EFLAGS: 00010246
[12341.861115] RAX: 0002 RBX: 9f9551184998 RCX: 
[12341.861119] RDX: 0002 RSI:  RDI: 9f9551184a50
[12341.861122] RBP: 0002 R08:  R09: 
[12341.861124] R10:  R11:  R12: 9f91b9a18140
[12341.861127] R13: 9f91c9020740 R14: 9f91c9020768 R15: 
[12341.861130] FS:  () GS:9f984a20()
knlGS:
[12341.861133] CS:  0010 DS:  ES:  CR0: 80050033
[12341.861136] CR2: 144e080d8000 CR3: 00010e98c000 CR4: 00350ee0
[12341.861139] Call Trace:
[12341.861143]  amdgpu_vm_sdma_commit+0x182/0x220 [amdgpu]
[12341.861251]  amdgpu_vm_bo_update_mapping.constprop.0+0x278/0x3c0 [amdgpu]
[12341.861356]  amdgpu_vm_handle_fault+0x145/0x290 [amdgpu]
[12341.861461]  gmc_v10_0_process_interrupt+0xb3/0x250 [amdgpu]
[12341.861571]  ? _raw_spin_unlock_irqrestore+0x37/0x40
[12341.861577]  ? lock_acquire+0x179/0x3a0
[12341.861583]  ? lock_acquire+0x179/0x3a0
[12341.861587]  ? amdgpu_irq_dispatch+0xc6/0x240 [amdgpu]
[12341.861692]  amdgpu_irq_dispatch+0xc6/0x240 [amdgpu]
[12341.861796]  amdgpu_ih_process+0x90/0x110 [amdgpu]
[12341.861900]  process_one_work+0x2b0/0x5e0
[12341.861906]  worker_thread+0x55/0x3c0
[12341.861910]  ? process_one_work+0x5e0/0x5e0
[12341.861915]  kthread+0x13a/0x150
[12341.861918]  ? __kthread_bind_mask+0x60/0x60
[12341.861922]  ret_from_fork+0x22/0x30
[12341.861928] Modules linked in: uinput snd_seq_dummy rfcomm
snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
nf_tables nfnetlink cmac bnep sunrpc vfat fat hid_logitech_hidpp
joydev hid_logitech_dj mt76x2u mt76x2_common mt76x02_usb mt76_usb
mt76x02_lib intel_rapl_msr intel_rapl_common mt76 iwlmvm mac80211
snd_hda_codec_realtek edac_mce_amd snd_hda_codec_generic ledtrig_audio
snd_hda_codec_hdmi btusb kvm_amd snd_hda_intel btrtl snd_intel_dspcfg
btbcm snd_intel_sdw_acpi snd_usb_audio uvcvideo btintel snd_hda_codec
videobuf2_vmalloc snd_usbmidi_lib videobuf2_memops iwlwifi kvm
bluetooth snd_rawmidi snd_hda_core snd_seq videobuf2_v4l2 snd_hwdep
videobuf2_common snd_seq_device eeepc_wmi snd_pcm videodev asus_wmi

Re: Unexpected multihop in swaput - likely driver bug.

2021-04-07 Thread Mikhail Gavrilov

On Wed, 7 Apr 2021 at 15:46, Christian König
 wrote:
>
> What hardware are you using

$ inxi -bM
System:Host: fedora Kernel: 5.12.0-0.rc6.184.fc35.x86_64+debug
x86_64 bits: 64 Desktop: GNOME 40.0
   Distro: Fedora release 35 (Rawhide)
Machine:   Type: Desktop Mobo: ASUSTeK model: ROG STRIX X570-I GAMING
v: Rev X.0x serial: 
   UEFI: American Megatrends v: 3603 date: 03/20/2021
Battery:   ID-1: hidpp_battery_0 charge: N/A condition: N/A
CPU:   Info: 16-Core (2-Die) AMD Ryzen 9 3950X [MT MCP MCM] speed:
2365 MHz min/max: 2200/3500 MHz
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Navi 21 [Radeon
RX 6800/6800 XT / 6900 XT] driver: amdgpu v: kernel
   Device-2: AVerMedia Live Streamer CAM 513 type: USB driver:
hid-generic,usbhid,uvcvideo
   Device-3: AVerMedia Live Gamer Ultra-Video type: USB
driver: hid-generic,snd-usb-audio,usbhid,uvcvideo
   Display: wayland server: X.Org 1.21.1 driver: loaded:
amdgpu,ati unloaded: fbdev,modesetting,radeon,vesa
   resolution: 3840x2160~60Hz
   OpenGL: renderer: AMD SIENNA_CICHLID (DRM 3.40.0
5.12.0-0.rc6.184.fc35.x86_64+debug LLVM 12.0.0)
   v: 4.6 Mesa 21.1.0-devel
Network:   Device-1: Intel Wi-Fi 6 AX200 driver: iwlwifi
   Device-2: Intel I211 Gigabit Network driver: igb
Drives:Local Storage: total: 11.35 TiB used: 10.82 TiB (95.3%)
Info:  Processes: 805 Uptime: 12h 56m Memory: 31.18 GiB used:
21.88 GiB (70.2%) Shell: Bash inxi: 3.3.02


> and how do you exactly trigger this?

I am running heavy games like "Zombie Army 4: Dead War" and switching
to Gnome Activities and other applications while the game is running.


-- 
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Unexpected multihop in swaput - likely driver bug.

2021-04-07 Thread Mikhail Gavrilov

Hi!
During the 5.12 testing cycle I observed the repeatable bug when
launching heavy graphic applications.
The kernel log is flooded with the message "Unexpected multihop in
swaput - likely driver bug.".

Trace:
[ 8707.814899] [ cut here ]
[ 8707.814920] Unexpected multihop in swaput - likely driver bug.
[ 8707.814998] WARNING: CPU: 19 PID: 28231 at
drivers/gpu/drm/ttm/ttm_bo.c:1484 ttm_bo_swapout+0x40b/0x420 [ttm]
[ 8707.815011] Modules linked in: tun uinput snd_seq_dummy rfcomm
snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
nf_tables nfnetlink cmac bnep sunrpc vfat fat hid_logitech_hidpp
hid_logitech_dj intel_rapl_msr snd_hda_codec_realtek intel_rapl_common
mt76x2u snd_hda_codec_generic mt76x2_common mt76x02_usb iwlmvm
ledtrig_audio snd_hda_codec_hdmi mt76_usb mt76x02_lib snd_hda_intel
mt76 snd_intel_dspcfg snd_intel_sdw_acpi mac80211 joydev snd_usb_audio
snd_hda_codec uvcvideo edac_mce_amd videobuf2_vmalloc snd_hda_core
snd_usbmidi_lib videobuf2_memops snd_hwdep iwlwifi snd_rawmidi btusb
videobuf2_v4l2 kvm_amd snd_seq videobuf2_common btrtl btbcm videodev
btintel snd_seq_device kvm mc cfg80211 bluetooth snd_pcm libarc4
eeepc_wmi snd_timer asus_wmi irqbypass xpad sp5100_tco
[ 8707.815065]  sparse_keymap ecdh_generic ff_memless video ecc
wmi_bmof i2c_piix4 snd rapl k10temp soundcore rfkill acpi_cpufreq
ip_tables amdgpu drm_ttm_helper ttm iommu_v2 gpu_sched drm_kms_helper
crct10dif_pclmul crc32_pclmul crc32c_intel cec drm ghash_clmulni_intel
igb ccp nvme dca nvme_core i2c_algo_bit wmi pinctrl_amd fuse
[ 8707.815096] CPU: 19 PID: 28231 Comm: kworker/u64:1 Tainted: G
 W- ---  5.12.0-0.rc6.184.fc35.x86_64+debug #1
[ 8707.815101] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 3603 03/20/2021
[ 8707.815106] Workqueue: ttm_swap ttm_shrink_work [ttm]
[ 8707.815114] RIP: 0010:ttm_bo_swapout+0x40b/0x420 [ttm]
[ 8707.815122] Code: 10 00 00 48 c1 e2 0c 48 c1 e6 0c e8 3f 37 fa c8
e9 71 fe ff ff 83 f8 b8 0f 85 a9 fe ff ff 48 c7 c7 28 32 37 c0 e8 02
2b 98 c9 <0f> 0b e9 96 fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
00 0f
[ 8707.815126] RSP: 0018:a306d20e7d58 EFLAGS: 00010292
[ 8707.815130] RAX: 0032 RBX: c0379260 RCX: 0027
[ 8707.815133] RDX: 918c091daae8 RSI: 0001 RDI: 918c091daae0
[ 8707.815136] RBP: 918602210058 R08:  R09: 
[ 8707.815138] R10: a306d20e7b90 R11: 918c2e2fffe8 R12: ffb8
[ 8707.815141] R13: c03792a0 R14: 9186022102c0 R15: 0001
[ 8707.815145] FS:  () GS:918c0900()
knlGS:
[ 8707.815148] CS:  0010 DS:  ES:  CR0: 80050033
[ 8707.815151] CR2: 325c84d12000 CR3: 000776c28000 CR4: 00350ee0
[ 8707.815154] Call Trace:
[ 8707.815164]  ttm_shrink+0xa6/0xe0 [ttm]
[ 8707.815171]  ttm_shrink_work+0x36/0x40 [ttm]
[ 8707.815177]  process_one_work+0x2b0/0x5e0
[ 8707.815185]  worker_thread+0x55/0x3c0
[ 8707.815188]  ? process_one_work+0x5e0/0x5e0
[ 8707.815192]  kthread+0x13a/0x150
[ 8707.815196]  ? __kthread_bind_mask+0x60/0x60
[ 8707.815199]  ret_from_fork+0x22/0x30
[ 8707.815207] irq event stamp: 0
[ 8707.815209] hardirqs last  enabled at (0): [<>] 0x0
[ 8707.815213] hardirqs last disabled at (0): []
copy_process+0x91b/0x1e10
[ 8707.815218] softirqs last  enabled at (0): []
copy_process+0x91b/0x1e10
[ 8707.815222] softirqs last disabled at (0): [<>] 0x0
[ 8707.815224] ---[ end trace 29252aa87289bbaa ]---

Full kernel log: https://pastebin.com/mmAxwBYc

$ /usr/src/kernels/`uname -r`/scripts/faddr2line
/lib/debug/lib/modules/`uname
-r`/kernel/drivers/gpu/drm/ttm/ttm.ko.debug ttm_bo_swapout+0x40b
ttm_bo_swapout+0x40b/0x420:
ttm_bo_swapout at
/usr/src/debug/kernel-5.12-rc6/linux-5.12.0-0.rc6.184.fc35.x86_64/drivers/gpu/drm/ttm/ttm_bo.c:1484
(discriminator 1)


$ git blame drivers/gpu/drm/ttm/ttm_bo.c -L 1475,1494
Blaming lines:   1% (20/1530), done.
ebdf565169af0 (Dave Airlie  2020-10-29 13:58:52 +1000 1475)
 memset(, 0, sizeof(hop));
ba4e7d973dd09 (Thomas Hellstrom 2009-06-10 15:20:19 +0200 1476)
ba4e7d973dd09 (Thomas Hellstrom 2009-06-10 15:20:19 +0200 1477)
 evict_mem = bo->mem;
ba4e7d973dd09 (Thomas Hellstrom 2009-06-10 15:20:19 +0200 1478)
 evict_mem.mm_node = NULL;
ce65b874001d7 (Christian König  2020-09-30 16:44:16 +0200 1479)
 evict_mem.placement = 0;
ba4e7d973dd09 (Thomas Hellstrom 2009-06-10 15:20:19 +0200 1480)
 evict_mem.mem_type = TTM_PL_SYSTEM;
ba4e7d973dd09 (Thomas Hellstrom 2009-06-10 15:20:19 +0200 1481)
ebdf565169af0 (Dave Airlie  2020-10-29 13:58:52 +1000 1482)
 ret = ttm_bo_handle_move_mem(bo, _mem, true, ,
);

Re: [bug] 5.11-rc5 brought page allocation failure issue [ttm][amdgpu]

2021-02-09 Thread Mikhail Gavrilov

On Mon, 8 Feb 2021 at 14:18, Christian König
 wrote:
>
> Are the other problems gone as well?
>

And yes and no.
The issue with monitor turns off was gone after rc6 (git3aaf0a27ffc2)
But both traces
1) BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:196 (kernel 5.11 specific)
2) WARNING: CPU: 14 PID: 504 at kernel/locking/lockdep.c:4618
lockdep_init_map_waits+0x18b/0x210 (Navi specific)
are still happening on every boot.

1)
[5.806032] BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:196
[5.806048] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid:
504, name: systemd-udevd
[5.806064] 1 lock held by systemd-udevd/504:
[5.806073]  #0: 9c5ac2e4f258 (>mutex){}-{3:3}, at:
device_driver_attach+0x3b/0xb0
[5.806097] CPU: 14 PID: 504 Comm: systemd-udevd Not tainted
5.11.0-0.rc6.20210204git61556703b610.145.fc34.x86_64 #1
[5.806117] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 3402 01/13/2021
[5.806135] Call Trace:
[5.806142]  dump_stack+0x8b/0xb0
[5.806153]  ___might_sleep.cold+0xb6/0xc6
[5.806163]  ? dcn30_clock_source_create+0x34/0xb0 [amdgpu]
[5.806338]  kmem_cache_alloc_trace+0x204/0x230
[5.806353]  dcn30_clock_source_create+0x34/0xb0 [amdgpu]
[5.806516]  dcn30_create_resource_pool+0x1de/0x13b0 [amdgpu]
[5.806678]  ? rcu_read_lock_sched_held+0x3f/0x80
[5.806690]  ? trace_kmalloc+0xb2/0xe0
[5.806699]  ? __kmalloc+0x191/0x280
[5.806710]  ? dc_create_resource_pool+0x110/0x1d0 [amdgpu]
[5.806869]  dc_create_resource_pool+0x110/0x1d0 [amdgpu]
[5.807026]  dc_create+0x205/0x790 [amdgpu]
[5.807181]  ? trace_kmalloc+0xb2/0xe0
[5.807190]  ? kmem_cache_alloc_trace+0x174/0x230
[5.807203]  amdgpu_dm_init.isra.0+0x1b9/0x250 [amdgpu]
[5.807369]  ? dev_vprintk_emit+0x171/0x195
[5.807385]  ? dev_printk_emit+0x3e/0x40
[5.807403]  dm_hw_init+0xe/0x20 [amdgpu]
[5.807563]  amdgpu_device_init.cold+0x179f/0x1afd [amdgpu]
[5.807728]  ? pci_conf1_read+0x9b/0xf0
[5.807741]  amdgpu_driver_load_kms+0x68/0x280 [amdgpu]
[5.807877]  amdgpu_pci_probe+0x129/0x1b0 [amdgpu]
[5.808009]  local_pci_probe+0x42/0x80
[5.808020]  pci_device_probe+0xd9/0x1a0
[5.808031]  really_probe+0xf2/0x440
[5.808042]  driver_probe_device+0xe1/0x150
[5.808053]  device_driver_attach+0xa8/0xb0
[5.808063]  __driver_attach+0x8c/0x150
[5.808071]  ? device_driver_attach+0xb0/0xb0
[5.808080]  ? device_driver_attach+0xb0/0xb0
[5.808090]  bus_for_each_dev+0x67/0x90
[5.808101]  bus_add_driver+0x12e/0x1f0
[5.808111]  driver_register+0x8f/0xe0
[5.808119]  ? 0xc0c02000
[5.808128]  do_one_initcall+0x67/0x320
[5.808138]  ? rcu_read_lock_sched_held+0x3f/0x80
[5.808148]  ? trace_kmalloc+0xb2/0xe0
[5.808157]  ? kmem_cache_alloc_trace+0x174/0x230
[5.808169]  do_init_module+0x5c/0x270
[5.808179]  __do_sys_init_module+0x130/0x190
[5.808196]  do_syscall_64+0x33/0x40
[5.808205]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[5.808216] RIP: 0033:0x7f4d133aa40e
[5.808225] Code: 48 8b 0d 65 1a 0c 00 f7 d8 64 89 01 48 83 c8 ff
c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 32 1a 0c 00 f7 d8 64 89
01 48
[5.808256] RSP: 002b:7ffc81317fb8 EFLAGS: 0246 ORIG_RAX:
00af
[5.808272] RAX: ffda RBX: 563f79509ee0 RCX: 7f4d133aa40e
[5.808285] RDX: 563f7951daa0 RSI: 00b8a85e RDI: 563f79f03db0
[5.808298] RBP: 563f79f03db0 R08: 563f79509fd0 R09: 7ffc813146be
[5.808311] R10: 563a1aa70959 R11: 0246 R12: 563f7951daa0
[5.808324] R13: 563f7950e9c0 R14:  R15: 563f7951f100


2)
[6.064107] BUG: key 9c5adb339148 has not been registered!
[6.064119] [ cut here ]
[6.064121] DEBUG_LOCKS_WARN_ON(1)
[6.064124] WARNING: CPU: 14 PID: 504 at
kernel/locking/lockdep.c:4618 lockdep_init_map_waits+0x18b/0x210
[6.064131] Modules linked in: amdgpu(+) drm_ttm_helper ttm
iommu_v2 gpu_sched drm_kms_helper crct10dif_pclmul crc32_pclmul
crc32c_intel cec igb drm ghash_clmulni_intel ccp nvme dca i2c_algo_bit
nvme_core wmi pinctrl_amd fuse
[6.064147] CPU: 14 PID: 504 Comm: systemd-udevd Tainted: G
W- ---
5.11.0-0.rc6.20210204git61556703b610.145.fc34.x86_64 #1
[6.064152] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 3402 01/13/2021
[6.064156] RIP: 0010:lockdep_init_map_waits+0x18b/0x210
[6.064159] Code: 00 85 c0 0f 84 77 ff ff ff 8b 3d 08 5e f1 01 85
ff 0f 85 69 ff ff ff 48 c7 c6 cc 98 60 9a 48 c7 c7 7d d4 5a 9a e8 51
3a b7 00 <0f> 0b e9 4f ff ff ff e8 c9 82 bd 00 85 c0 74 21 44 8b 15 d6
5d f1
[6.064165] RSP: 0018:bba701be78c8 EFLAGS: 00010292
[6.064168] RAX: 0016 RBX: 9a247b80

Re: [bug] 5.11-rc5 brought page allocation failure issue [ttm][amdgpu]

2021-02-06 Thread Mikhail Gavrilov

On Sun, 31 Jan 2021 at 22:22, Christian König
 wrote:
>
>
> Yeah, known issue. I already pushed Michel's fix to drm-misc-fixes.
> Should land in the next -rc by the weekend.
>
> Regards,
> Christian.

I checked this patch [1] for several days.
And I can confirm that the reported issue was gone.

[1] https://lore.kernel.org/lkml/20210128095346.2421-1-mic...@daenzer.net/

-- 
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[bug] 5.11-rc5 brought page allocation failure issue [ttm][amdgpu]

2021-01-30 Thread Mikhail Gavrilov

The 5.11-rc5 (git 76c057c84d28) brought a new issue.
Now the kernel log is flooded with the message "page allocation failure".

Trace:
msedge:cs0: page allocation failure: order:10,
mode:0x190cc2(GFP_HIGHUSER|__GFP_NORETRY|__GFP_NOMEMALLOC),
nodemask=(null),cpuset=/,mems_allowed=0
CPU: 18 PID: 4540 Comm: msedge:cs0 Tainted: GW
- ---  5.11.0-0.rc5.20210128git76c057c84d28.138.fc34.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 3402 01/13/2021
Call Trace:
 dump_stack+0x8b/0xb0
 warn_alloc.cold+0x72/0xd6
 ? _cond_resched+0x16/0x50
 ? __alloc_pages_direct_compact+0x1a1/0x210
 __alloc_pages_slowpath.constprop.0+0xf64/0xf90
 ? kmem_cache_alloc+0x299/0x310
 ? lock_acquire+0x173/0x380
 ? trace_hardirqs_on+0x1b/0xe0
 ? lock_release+0x1e9/0x400
 __alloc_pages_nodemask+0x37d/0x400
 ttm_pool_alloc+0x2a3/0x630 [ttm]
 ttm_tt_populate+0x37/0xe0 [ttm]
 ttm_bo_handle_move_mem+0x142/0x180 [ttm]
 ttm_bo_evict+0x12e/0x1b0 [ttm]
 ? kfree+0xeb/0x660
 ? amdgpu_vram_mgr_new+0x34d/0x3d0 [amdgpu]
 ttm_mem_evict_first+0x101/0x4d0 [ttm]
 ttm_bo_mem_space+0x2c8/0x330 [ttm]
 ttm_bo_validate+0x163/0x1c0 [ttm]
 amdgpu_cs_bo_validate+0x82/0x190 [amdgpu]
 amdgpu_cs_list_validate+0x105/0x150 [amdgpu]
 amdgpu_cs_ioctl+0x803/0x1ef0 [amdgpu]
 ? trace_hardirqs_off_caller+0x41/0xd0
 ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
 drm_ioctl_kernel+0x8c/0xe0 [drm]
 drm_ioctl+0x20f/0x3c0 [drm]
 ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
 ? selinux_file_ioctl+0x147/0x200
 ? lock_acquired+0x1fa/0x380
 ? lock_release+0x1e9/0x400
 ? trace_hardirqs_on+0x1b/0xe0
 amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
 __x64_sys_ioctl+0x82/0xb0
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f829c36c11b
Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c
c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d 25 bd 0c 00 f7 d8 64 89 01 48
RSP: 002b:7f8282c14f38 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 7f8282c14fa0 RCX: 7f829c36c11b
RDX: 7f8282c14fa0 RSI: c0186444 RDI: 0018
RBP: c0186444 R08: 7f8282c15640 R09: 7f8282c14f80
R10:  R11: 0246 R12: 1f592c0fe088
R13: 0018 R14:  R15: fffd
Mem-Info:
active_anon:24325 inactive_anon:3569299 isolated_anon:0
 active_file:704540 inactive_file:2709725 isolated_file:0
 unevictable:1230 dirty:256317 writeback:7074
 slab_reclaimable:222328 slab_unreclaimable:112852
 mapped:838359 shmem:469422 pagetables:47722 bounce:0
 free:107165 free_pcp:1298 free_cma:0
Node 0 active_anon:97300kB inactive_anon:14277196kB
active_file:2818160kB inactive_file:10838900kB unevictable:4920kB
isolated(anon):0kB isolated(file):0kB mapped:3353436kB dirty:1025268kB
writeback:28296kB shmem:1877688kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB kernel_stack:62528kB
pagetables:190888kB all_unreclaimable? no
Node 0 DMA free:11800kB min:32kB low:44kB high:56kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB
present:15992kB managed:15900kB mlocked:0kB bounce:0kB free_pcp:0kB
local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 3056 31787 31787 31787
Node 0 DMA32 free:303044kB min:6492kB low:9620kB high:12748kB
reserved_highatomic:0KB active_anon:20kB inactive_anon:1322808kB
active_file:5136kB inactive_file:483136kB unevictable:0kB
writepending:220876kB present:3314552kB managed:3246620kB mlocked:0kB
bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 0 28731 28731 28731
Node 0 Normal free:113816kB min:61052kB low:90472kB high:119892kB
reserved_highatomic:0KB active_anon:97280kB inactive_anon:12953852kB
active_file:2812656kB inactive_file:10355000kB unevictable:4920kB
writepending:832688kB present:30133248kB managed:29421044kB
mlocked:4920kB bounce:0kB free_pcp:5180kB local_pcp:4kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0 0
Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U)
1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11800kB
Node 0 DMA32: 1009*4kB (UME) 724*8kB (UME) 488*16kB (UME) *32kB
(UME) 950*64kB (UME) 620*128kB (UME) 223*256kB (UME) 74*512kB (M)
11*1024kB (M) 2*2048kB (ME) 0*4096kB = 303684kB
Node 0 Normal: 964*4kB (UME) 719*8kB (ME) 379*16kB (UME) 192*32kB
(UME) 127*64kB (UME) 130*128kB (UME) 122*256kB (UME) 18*512kB (UME)
4*1024kB (UM) 11*2048kB (UM) 0*4096kB = 113656kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=1048576kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
3881804 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 67108860kB
Total swap = 67108860kB
8365948 pages RAM
0 pages HighMem/MovableOnly
195057 pages reserved
0 pages cma reserved
0 pages hwpoisoned

Full kernel log:

Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] ERROR Failed to pin framebuffer with error -12

2021-01-24 Thread Mikhail Gavrilov

On Thu, 21 Jan 2021 at 18:27, Christian König  wrote:
>
> I still have no idea what's going on here.
>
> The KASAN messages from the DC code are completely unrelated.
>
> Please add the full dmesg to your bug report.
>

I did it.
https://gitlab.freedesktop.org/drm/amd/-/issues/1439#note_776267

-- 
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] ERROR Failed to pin framebuffer with error -12

2021-01-19 Thread Mikhail Gavrilov

On Fri, 15 Jan 2021 at 03:43, Mikhail Gavrilov
 wrote:
>

In rc4, the number of warnings has dropped dramatically.
No more errors "kasan slab-out-of-bounds" and no "DMA-API device
driver failed to check map error".
But still not fixed "sleeping function called from invalid context at
include/linux/sched/mm.h:196" and "BUG: key 88810b0d9148 has not
been registered!"
Second issue Navi specific because it started to happen in 5.10 kernel
after replacing Radeon VII to 6900XT.

1.
BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:196
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 500, name: systemd-udevd
1 lock held by systemd-udevd/500:
 #0: 888107690258 (>mutex){}-{3:3}, at:
device_driver_attach+0xa3/0x250
CPU: 9 PID: 500 Comm: systemd-udevd Not tainted
5.11.0-0.rc4.129.fc34.x86_64+debug #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 2802 10/21/2020
Call Trace:
 dump_stack+0xae/0xe5
 ___might_sleep.cold+0x150/0x17e
 ? dcn30_clock_source_create+0x53/0x110 [amdgpu]
 kmem_cache_alloc_trace+0x23f/0x270
 dcn30_clock_source_create+0x53/0x110 [amdgpu]
 dcn30_create_resource_pool+0x998/0x4890 [amdgpu]
 ? dcn30_calc_max_scaled_time+0x40/0x40 [amdgpu]
 ? lock_is_held_type+0xb8/0xf0
 ? unpoison_range+0x3a/0x60
 ? kasan_kmalloc.constprop.0+0x84/0xa0
 ? dc_create_resource_pool+0x26e/0x5e0 [amdgpu]
 dc_create_resource_pool+0x26e/0x5e0 [amdgpu]
 dc_create+0x636/0x1bc0 [amdgpu]
 ? lock_acquire+0x2dd/0x7a0
 ? sched_clock+0x5/0x10
 ? sched_clock_cpu+0x18/0x170
 ? find_held_lock+0x33/0x110
 ? dc_create_state+0xa0/0xa0 [amdgpu]
 ? lock_downgrade+0x6b0/0x6b0
 ? module_assert_mutex_or_preempt+0x3e/0x70
 ? lock_is_held_type+0xb8/0xf0
 ? unpoison_range+0x3a/0x60
 ? kasan_kmalloc.constprop.0+0x84/0xa0
 amdgpu_dm_init.isra.0+0x479/0x640 [amdgpu]
 ? vprintk_emit+0x1c0/0x460
 ? dev_vprintk_emit+0x2d8/0x31a
 ? sched_clock+0x5/0x10
 ? dm_resume+0x13b0/0x13b0 [amdgpu]
 ? dev_attr_show.cold+0x35/0x35
 ? lock_downgrade+0x6b0/0x6b0
 ? dev_printk_emit+0x8c/0xa8
 ? dev_vprintk_emit+0x31a/0x31a
 ? wait_for_completion_io+0x240/0x240
 ? __dev_printk+0x71/0xdf
 ? smu_hw_init.cold+0x16b/0x18a [amdgpu]
 ? smu_suspend+0x240/0x240 [amdgpu]
 ? navi10_ih_irq_init+0xea3/0x2420 [amdgpu]
 dm_hw_init+0xe/0x20 [amdgpu]
 amdgpu_device_init.cold+0x3031/0x4940 [amdgpu]
 ? amdgpu_device_cache_pci_state+0xf0/0xf0 [amdgpu]
 ? pci_bus_read_config_byte+0x140/0x140
 ? do_pci_enable_device+0x1f8/0x260
 ? pci_find_saved_ext_cap+0x110/0x110
 ? pci_enable_bridge+0xf9/0x1e0
 ? pci_dev_check_d3cold+0x107/0x250
 ? pci_enable_device_flags+0x201/0x340
 amdgpu_driver_load_kms+0x167/0x8a0 [amdgpu]
 amdgpu_pci_probe+0x235/0x360 [amdgpu]
 ? amdgpu_pci_remove+0xd0/0xd0 [amdgpu]
 local_pci_probe+0xd8/0x170
 pci_device_probe+0x318/0x5c0
 ? kernfs_create_link+0x16c/0x230
 ? pci_device_remove+0x1d0/0x1d0
 really_probe+0x224/0xc40
 driver_probe_device+0x1f2/0x380
 device_driver_attach+0x1df/0x250
 __driver_attach+0xf6/0x260
 ? device_driver_attach+0x250/0x250
 bus_for_each_dev+0x114/0x180
 ? subsys_dev_iter_exit+0x10/0x10
 bus_add_driver+0x352/0x570
 driver_register+0x20f/0x390
 ? __pci_register_driver+0x13a/0x210
 ? 0xc1d8d000
 do_one_initcall+0xfb/0x530
 ? perf_trace_initcall_level+0x3d0/0x3d0
 ? __memset+0x2b/0x30
 ? unpoison_range+0x3a/0x60
 do_init_module+0x1ce/0x7a0
 load_module+0x9841/0xa380
 ? module_frob_arch_sections+0x20/0x20
 ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0
 ? sched_clock_cpu+0x18/0x170
 ? sched_clock+0x5/0x10
 ? lock_acquire+0x2dd/0x7a0
 ? sched_clock+0x5/0x10
 ? lock_is_held_type+0xb8/0xf0
 ? __do_sys_init_module+0x18b/0x220
 __do_sys_init_module+0x18b/0x220
 ? load_module+0xa380/0xa380
 ? ktime_get_coarse_real_ts64+0x12f/0x160
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f2c109da07e
Code: 48 8b 0d f5 1d 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d c2 1d 0c 00 f7 d8 64 89 01 48
RSP: 002b:7ffc84d33f88 EFLAGS: 0246 ORIG_RAX: 00af
RAX: ffda RBX: 55b87f8260a0 RCX: 7f2c109da07e
RDX: 55b87f834060 RSI: 01e2cbf6 RDI: 7f2c0b7e0010
RBP: 7f2c0b7e0010 R08: 55b87f8281e0 R09: 7ffc84d30a26
R10: 55bd2404cc18 R11: 0246 R12: 55b87f834060
R13: 55b87f831ca0 R14:  R15: 55b87f832640
[drm] Display Core initialized with v3.2.116!
[drm] DMUB hardware initialized: version=0x0201
usb 1-3.2: Device not responding to setup address.
usb 1-3.2: device not accepting address 5, error -71
[drm] REG_WAIT timeout 1us * 10 tries - mpc2_assert_idle_mpcc line:480


2.
BUG: key 88810b0d9148 has not been registered!
[ cut here ]
DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 25 PID: 500 at kernel/locking/lockdep.c:4618
lockdep_init_map_waits+0x592/0x770
Modules linked

Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] ERROR Failed to pin framebuffer with error -12

2021-01-14 Thread Mikhail Gavrilov

On Thu, 14 Jan 2021 at 18:56, Christian König  wrote:
> Unfortunately not of hand.
>
> I also don't see any bug reports from other people and can't reproduce
> the last backtrace you send out TTM here.

Because only the most desperate will install kernels with enabled
debug flags and then load the system by opening a huge number of
programs and tabs. So you shouldn't be surprised that I'm the only one
here.
This is what my desktop looks like every day: https://imgur.com/a/Kxlmrem

> Do you have any local modifications or special setup in your system?
> Like bpf scripts or something like that?

No, my I didn't write any bpf scripts, but looks like my distribution
Fedora Rawhide uses some bpf scripts by default out of box:

# bpftool prog
20: cgroup_device  tag 40ddf486530245f5  gpl
loaded_at 2021-01-15T01:30:04+0500  uid 0
xlated 504B  jited 309B  memlock 4096B
21: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:04+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
22: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:04+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
23: cgroup_device  tag ca8e50a3c7fb034b  gpl
loaded_at 2021-01-15T01:30:05+0500  uid 0
xlated 496B  jited 307B  memlock 4096B
24: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:05+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
25: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:05+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
26: cgroup_device  tag be31ae23198a0378  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 464B  jited 288B  memlock 4096B
27: cgroup_device  tag ee0e253c78993a24  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 416B  jited 255B  memlock 4096B
28: cgroup_device  tag 438c5618576e5b0c  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 568B  jited 354B  memlock 4096B
29: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
30: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
31: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
32: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
33: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:14+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
34: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:14+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
35: cgroup_device  tag ee0e253c78993a24  gpl
loaded_at 2021-01-15T01:30:14+0500  uid 0
xlated 416B  jited 255B  memlock 4096B
38: cgroup_device  tag 3a0ef5414c2f6fca  gpl
loaded_at 2021-01-15T01:30:14+0500  uid 0
xlated 744B  jited 447B  memlock 4096B
39: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:14+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
40: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:14+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
41: cgroup_device  tag ee0e253c78993a24  gpl
loaded_at 2021-01-15T01:30:18+0500  uid 0
xlated 416B  jited 255B  memlock 4096B
42: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:18+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
43: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:18+0500  uid 0
xlated 64B  jited 54B  memlock 4096B

I catched yet another couples of leaks , but nothing new:
https://pastebin.com/2EgvYJdz

[1] do_detailed_mode+0x7c1/0x13d0 [drm]
[2] drm_mode_duplicate+0x45/0x220 [drm]
[3] do_seccomp+0x215/0x2280
[4] __vmalloc_node_range+0x464/0x7b0
[5] bpf_prog_alloc_no_stats+0xa2/0x2b0
[6] bpf_prog_store_orig_filter+0x7b/0x1c0
[7] kmemdup+0x1a/0x40

Did the following trace message confuse anyone?
==
BUG: KASAN: slab-out-of-bounds in
kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu]
Read of size 1 at addr 88812a6b4181 by task systemd-udevd/491

CPU: 20 PID: 491 Comm: systemd-udevd Not tainted
5.11.0-0.rc3.20210114git65f0d2414b70.125.fc34.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 2802 10/21/2020
Call Trace:
 dump_stack+0xae/0xe5
 print_address_description.constprop.0+0x18/0x160
 ? kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu]
 kasan_report.cold+0x7f/0x10e
 ? kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu]
 kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu]
 ? kfd_create_crat_image_acpi+0x340/0x340 [amdgpu]
 ? __raw_spin_lock_init+0x39/0x110
 kfd_topology_init+0x2ac/0x400 [amdgpu]
 ? kfd_create_topology_device+0x320/0x320 [amdgpu]
 ? __class_register+0x2ad/0x430
 ? __class_create+0xc5/0x130
 kgd2kfd_init+0x95/0xf0 [amdgpu]

Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] ERROR Failed to pin framebuffer with error -12

2021-01-13 Thread Mikhail Gavrilov

On Tue, 12 Jan 2021 at 01:45, Christian König  wrote:
>
> But what you have in your logs so far are only unrelated symptoms, the
> root of the problem is that somebody is leaking memory.
>
> What you could do as well is to try to enable kmemleak

I captured some memleaks.
Do they contain any useful information?

[1] https://pastebin.com/n0FE7Hsu
[2] https://pastebin.com/MUX55L1k
[3] https://pastebin.com/a3FT7DVG
[4] https://pastebin.com/1ALvJKz7

--
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] ERROR Failed to pin framebuffer with error -12

2021-01-11 Thread Mikhail Gavrilov

Hi Christian,

On Tue, 12 Jan 2021 at 01:45, Christian König  wrote:
>
> Hi Mike,
>
> Unfortunately not, that's DC stuff. Easiest is to assign this as a bug
> tracker to our DC team.
Ok

> At least some progress. Any objections that I add your e-mail address as
> tested-by tag?
Yes, feel free add me.

> I can take a look at this one here. Looks like some missing error
> handling when allocating memory.
> Can you decode to which line number ttm_tt_swapin+0x34 points to?
$ /usr/src/kernels/`uname -r`/scripts/faddr2line
/lib/debug/lib/modules/`uname
-r`/kernel/drivers/gpu/drm/ttm/ttm.ko.debug ttm_tt_swapin+0x34
ttm_tt_swapin+0x34/0xd0:
mapping_gfp_mask at
/usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/./include/linux/pagemap.h:105
(discriminator 2)
(inlined by) ttm_tt_swapin at
/usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/drivers/gpu/drm/ttm/ttm_tt.c:210
(discriminator 2)

$ cat -s -n 
/usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/drivers/gpu/drm/ttm/ttm_tt.c
| head -220 | tail -20
   201  struct page *from_page;
   202  struct page *to_page;
   203  gfp_t gfp_mask;
   204  int i, ret;
   205
   206  swap_storage = ttm->swap_storage;
   207  BUG_ON(swap_storage == NULL);
   208
   209  swap_space = swap_storage->f_mapping;
   210  gfp_mask = mapping_gfp_mask(swap_space);
   211
   212  for (i = 0; i < ttm->num_pages; ++i) {
   213  from_page = shmem_read_mapping_page_gfp(swap_space, i,
   214  gfp_mask);
   215  if (IS_ERR(from_page)) {
   216  ret = PTR_ERR(from_page);
   217  goto out_err;
   218  }
   219  to_page = ttm->pages[i];
   220  if (unlikely(to_page == NULL)) {

> Please use this one here:
> https://gitlab.freedesktop.org/drm/amd/-/issues/new
>
> If you can't find the DC guys of hand in the assignee list just assign
> to me and I will forward.
https://gitlab.freedesktop.org/drm/amd/-/issues/1439
Ok, let's continue there.

--
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] ERROR Failed to pin framebuffer with error -12

2021-01-11 Thread Mikhail Gavrilov

On Mon, 11 Jan 2021 at 19:01, Christian König  wrote:

> Changing the page table attributes while releasing memory might sleep.
> So we can't use a spinlock here.
>
> Thanks for the report, a patch to fix this is on the mailing list now.

Can you look also the first trace?
Here a same error message "sleeping function called from invalid
context" and a lot of [amdgpu] code.

BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:196
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 501, name: systemd-udevd
1 lock held by systemd-udevd/501:
 #0: 978e0278d258 (>mutex){}-{3:3}, at:
device_driver_attach+0x3b/0xb0
CPU: 25 PID: 501 Comm: systemd-udevd Not tainted
5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 2802 10/21/2020
Call Trace:
 dump_stack+0x8b/0xb0
 ___might_sleep.cold+0xb6/0xc6
 ? dcn30_clock_source_create+0x34/0xb0 [amdgpu]
 kmem_cache_alloc_trace+0x204/0x230
 dcn30_clock_source_create+0x34/0xb0 [amdgpu]
 dcn30_create_resource_pool+0x1d9/0x13a0 [amdgpu]
 ? rcu_read_lock_sched_held+0x3f/0x80
 ? trace_kmalloc+0xb2/0xe0
 ? __kmalloc+0x191/0x280
 ? dc_create_resource_pool+0x110/0x1d0 [amdgpu]
 dc_create_resource_pool+0x110/0x1d0 [amdgpu]
 dc_create+0x205/0x790 [amdgpu]
 ? trace_kmalloc+0xb2/0xe0
 ? kmem_cache_alloc_trace+0x174/0x230
 amdgpu_dm_init.isra.0+0x1b9/0x250 [amdgpu]
 ? dev_vprintk_emit+0x171/0x195
 ? dev_printk_emit+0x3e/0x40
 dm_hw_init+0xe/0x20 [amdgpu]
 amdgpu_device_init.cold+0x179f/0x1afd [amdgpu]
 ? pci_conf1_read+0xa4/0x100
 amdgpu_driver_load_kms+0x68/0x280 [amdgpu]
 amdgpu_pci_probe+0x129/0x1b0 [amdgpu]
 local_pci_probe+0x42/0x80
 pci_device_probe+0xd9/0x1a0
 really_probe+0x205/0x460
 driver_probe_device+0xe1/0x150
 device_driver_attach+0xa8/0xb0
 __driver_attach+0x8c/0x150
 ? device_driver_attach+0xb0/0xb0
 ? device_driver_attach+0xb0/0xb0
 bus_for_each_dev+0x67/0x90
 bus_add_driver+0x12e/0x1f0
 driver_register+0x8f/0xe0
 ? 0xc0d9c000
 do_one_initcall+0x67/0x320
 ? rcu_read_lock_sched_held+0x3f/0x80
 ? trace_kmalloc+0xb2/0xe0
 ? kmem_cache_alloc_trace+0x174/0x230
 do_init_module+0x5c/0x270
 __do_sys_init_module+0x130/0x190
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f363661deee
Code: 48 8b 0d 85 1f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d 52 1f 0c 00 f7 d8 64 89 01 48
RSP: 002b:7ffeb7191588 EFLAGS: 0246 ORIG_RAX: 00af
RAX: ffda RBX: 561b94563170 RCX: 7f363661deee
RDX: 561b94579df0 RSI: 00b8a356 RDI: 7f3633b9e010
RBP: 7f3633b9e010 R08: 561b94565240 R09: 7ffeb718d786
R10: 561ef5ef1595 R11: 0246 R12: 561b94579df0
R13: 561b9457a3e0 R14:  R15: 561b94576530
[drm] Display Core initialized with v3.2.116!
[drm] DMUB hardware initialized: version=0x0201
usb 1-3.2: new high-speed USB device number 5 using xhci_hcd
[drm] REG_WAIT timeout 1us * 10 tries - mpc2_assert_idle_mpcc line:480

> > -12 is just -ENOMEM. Looks like a memory leak to me, maybe caused by
> > the problem above, maybe something completely unrelated.
> >
> > I will take a look.
>
> The looks like a completely unrelated memory leak to me.
>
> Probably best if you open up a bug report for this.

Yes, the monitor still turns off after applying patch "make the pool
shrinker lock a mutex".
Anyway patch fixed the issue with flood of message "BUG: sleeping
function called from invalid context at mm/vmalloc.c:1756" so kernel
log became cleaner.
Now the issue with turns off monitor looks in logs so:

DMA-API: cacheline tracking ENOMEM, dma-debug disabled
amdgpu :0b:00.0: amdgpu: 6b791523 pin failed
[drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin
framebuffer with error -12
BUG: kernel NULL pointer dereference, address: 0060
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD 0 P4D 0
Oops:  [#1] SMP NOPTI
CPU: 20 PID: 3780 Comm: brave:cs0 Tainted: GW-
---  5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 2802 10/21/2020
RIP: 0010:ttm_tt_swapin+0x34/0x1b0 [ttm]
Code: 55 41 54 55 53 48 83 ec 10 48 8b 47 20 48 89 44 24 08 48 85 c0
0f 84 86 01 00 00 48 8b 44 24 08 49 89 fc 4c 8b a8 e0 01 00 00 <41> 8b
45 60 89 44 24 04 8b 47 0c 85 c0 0f 84 df 00 00 00 31 db 65
RSP: 0018:a7400532b9c0 EFLAGS: 00010286
RAX: 978e2ae25800 RBX: 97910ec12058 RCX: 978e12caac70
RDX: 8010 RSI:  RDI: 97912c3d99c0
RBP: 97912c3d99c0 R08:  R09: 70b3a000
R10: 0002 R11:  R12: 97912c3d99c0
R13:  R14: a7400532ba90 R15: 978e182c6350
FS:  7f070bb1b640()

[drm:dm_plane_helper_prepare_fb [amdgpu]] ERROR Failed to pin framebuffer with error -12

2021-01-10 Thread Mikhail Gavrilov

Hi folks,
today I joined to testing Kernel 5.11 and saw that the kernel log was
flooded with BUG messages:
BUG: sleeping function called from invalid context at mm/vmalloc.c:1756
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 266, name: kswapd0
INFO: lockdep is turned off.
CPU: 15 PID: 266 Comm: kswapd0 Tainted: GW-
---  5.11.0-0.rc2.20210108gitf5e6c330254a.119.fc34.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 2802 10/21/2020
Call Trace:
 dump_stack+0x8b/0xb0
 ___might_sleep.cold+0xb6/0xc6
 vm_unmap_aliases+0x21/0x40
 change_page_attr_set_clr+0x9e/0x190
 set_memory_wb+0x2f/0x80
 ttm_pool_free_page+0x28/0x90 [ttm]
 ttm_pool_shrink+0x45/0xb0 [ttm]
 ttm_pool_shrinker_scan+0xa/0x20 [ttm]
 do_shrink_slab+0x177/0x3a0
 shrink_slab+0x9c/0x290
 shrink_node+0x2e6/0x700
 balance_pgdat+0x2f5/0x650
 kswapd+0x21d/0x4d0
 ? do_wait_intr_irq+0xd0/0xd0
 ? balance_pgdat+0x650/0x650
 kthread+0x13a/0x150
 ? __kthread_bind_mask+0x60/0x60
 ret_from_fork+0x22/0x30

But the most unpleasant thing is that after a while the monitor turns
off and does not go on again until the restart.
This is accompanied by an entry in the kernel log:

amdgpu :0b:00.0: amdgpu: ff7d8b94 pin failed
[drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin
framebuffer with error -12

$ grep "Failed to pin framebuffer with error" -Rn .
./drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:5816:
DRM_ERROR("Failed to pin framebuffer with error %d\n", r);

$ git blame -L 5811,5821 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
Blaming lines:   0% (11/9167), done.
5d43be0ccbc2f (Christian König 2017-10-26 18:06:23 +0200 5811)
 domain = AMDGPU_GEM_DOMAIN_VRAM;
e7b07ceef2a65 (Harry Wentland  2017-08-10 13:29:07 -0400 5812)
7b7c6c81b3a37 (Junwei Zhang2018-06-25 12:51:14 +0800 5813)  r =
amdgpu_bo_pin(rbo, domain);
e7b07ceef2a65 (Harry Wentland  2017-08-10 13:29:07 -0400 5814)  if
(unlikely(r != 0)) {
30b7c6147d18d (Harry Wentland  2017-10-26 15:35:14 -0400 5815)
 if (r != -ERESTARTSYS)
30b7c6147d18d (Harry Wentland  2017-10-26 15:35:14 -0400 5816)
 DRM_ERROR("Failed to pin framebuffer with error %d\n", r);
0f257b09531b4 (Chunming Zhou   2019-05-07 19:45:31 +0800 5817)
 ttm_eu_backoff_reservation(, );
e7b07ceef2a65 (Harry Wentland  2017-08-10 13:29:07 -0400 5818)
 return r;
e7b07ceef2a65 (Harry Wentland  2017-08-10 13:29:07 -0400 5819)  }
e7b07ceef2a65 (Harry Wentland  2017-08-10 13:29:07 -0400 5820)
bb812f1ea87dd (Junwei Zhang2018-06-25 13:32:24 +0800 5821)  r =
amdgpu_ttm_alloc_gart(>tbo);

Who knows how to fix it?

Full kernel logs is here:
[1] https://pastebin.com/fLasjDHX
[2] https://pastebin.com/g3wR2r9e

--
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

BUG: key ffff8b521bda9148 has not been registered!

2021-01-09 Thread Mikhail Gavrilov

Hi folks!
I started to see this message every boot after replacing Radeon VII to 6900XT.

$ journalctl | grep "BUG: key"
Dec 31 05:19:42 localhost.localdomain kernel: BUG: key
98b59ab01148 has not been registered!
Dec 31 05:25:44 localhost.localdomain kernel: BUG: key
8d425ba01148 has not been registered!
Jan 02 17:36:25 localhost.localdomain kernel: BUG: key
935e5a959148 has not been registered!
Jan 03 03:29:08 localhost.localdomain kernel: BUG: key
8d425b0b9148 has not been registered!
Jan 03 03:33:35 localhost.localdomain kernel: BUG: key
8bc35aef9148 has not been registered!
Jan 03 16:47:44 localhost.localdomain kernel: BUG: key
9a3cdb959148 has not been registered!
Jan 06 14:59:58 localhost.localdomain kernel: BUG: key
97b6db9f9148 has not been registered!
Jan 07 14:51:49 localhost.localdomain kernel: BUG: key
8f2dda569148 has not been registered!
Jan 07 15:08:23 localhost.localdomain kernel: BUG: key
a0849bd31148 has not been registered!
Jan 08 18:07:28 localhost.localdomain kernel: BUG: key
89721a0e9148 has not been registered!
Jan 08 18:12:51 localhost.localdomain kernel: BUG: key
8b521bda9148 has not been registered!

Here is trace:
[6.333672] [drm] REG_WAIT timeout 1us * 10 tries -
mpc2_assert_idle_mpcc line:480
[6.335258] BUG: key 8b521bda9148 has not been registered!
[6.335271] [ cut here ]
[6.335273] DEBUG_LOCKS_WARN_ON(1)
[6.335279] WARNING: CPU: 18 PID: 525 at
kernel/locking/lockdep.c:4618 lockdep_init_map_waits+0x18b/0x210
[6.335284] Modules linked in: fjes(-) amdgpu(+) iommu_v2 gpu_sched
ttm drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel cec drm
ghash_clmulni_intel ccp igb nvme nvme_core dca i2c_algo_bit wmi
pinctrl_amd fuse
[6.335298] CPU: 18 PID: 525 Comm: systemd-udevd Not tainted
5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64 #1
[6.335302] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020
[6.335306] RIP: 0010:lockdep_init_map_waits+0x18b/0x210
[6.335309] Code: 00 85 c0 0f 84 75 ff ff ff 8b 3d 18 c4 f1 01 85
ff 0f 85 67 ff ff ff 48 c7 c6 68 43 60 97 48 c7 c7 1d 90 5a 97 e8 70
1f b6 00 <0f> 0b e9 4d ff ff ff e8 19 59 bc 00 85 c0 74 21 44 8b 1d e6
c3 f1
[6.335315] RSP: 0018:9e5a013d3910 EFLAGS: 00010282
[6.335317] RAX: 0016 RBX: 97247d80 RCX: 8b5908fdb238
[6.335320] RDX: ffd8 RSI: 0027 RDI: 8b5908fdb230
[6.335322] RBP: 8b520e2a7978 R08:  R09: 
[6.335325] R10: 9e5a013d3740 R11: 8b592e2fffe8 R12: 8b521bda9148
[6.335327] R13:  R14: 8b521bc30330 R15: 8b521bc30330
[6.335330] FS:  7fe019eb9140() GS:8b5908e0()
knlGS:
[6.335333] CS:  0010 DS:  ES:  CR0: 80050033
[6.335336] CR2: 7fe018f5e000 CR3: 0001142ee000 CR4: 00350ee0
[6.335338] Call Trace:
[6.335342]  __kernfs_create_file+0x7b/0x100
[6.335344]  sysfs_add_file_mode_ns+0xa3/0x190
[6.335347]  sysfs_create_bin_file+0x50/0x70
[6.335428]  hdcp_create_workqueue+0x3bd/0x410 [amdgpu]
[6.335499]  amdgpu_dm_init.isra.0.cold+0x136/0x126d [amdgpu]
[6.335570]  ? psp_set_srm+0xb0/0xb0 [amdgpu]
[6.335637]  ? hdcp_update_display+0x1f0/0x1f0 [amdgpu]
[6.335641]  ? dev_printk_emit+0x3e/0x40
[6.335709]  dm_hw_init+0xe/0x20 [amdgpu]
[6.335776]  amdgpu_device_init.cold+0x18c3/0x1bbc [amdgpu]
[6.335781]  ? pci_bus_read_config_word+0x39/0x50
[6.335831]  amdgpu_driver_load_kms+0x2b/0x1f0 [amdgpu]
[6.335879]  amdgpu_pci_probe+0x129/0x1b0 [amdgpu]
[6.335889]  local_pci_probe+0x42/0x80
[6.335891]  pci_device_probe+0xd9/0x1a0
[6.335896]  really_probe+0x205/0x460
[6.335898]  driver_probe_device+0xe1/0x150
[6.335901]  device_driver_attach+0xa8/0xb0
[6.335904]  __driver_attach+0x8c/0x150
[6.335907]  ? device_driver_attach+0xb0/0xb0
[6.335909]  ? device_driver_attach+0xb0/0xb0
[6.335911]  bus_for_each_dev+0x67/0x90
[6.335914]  bus_add_driver+0x12e/0x1f0
[6.335917]  driver_register+0x8b/0xe0
[6.335919]  ? 0xc0e4c000
[6.335922]  do_one_initcall+0x67/0x320
[6.335925]  ? rcu_read_lock_sched_held+0x3f/0x80
[6.335928]  ? trace_kmalloc+0xb2/0xe0
[6.335930]  ? kmem_cache_alloc_trace+0x157/0x270
[6.335934]  do_init_module+0x5c/0x260
[6.335936]  __do_sys_init_module+0x13d/0x1a0
[6.335940]  do_syscall_64+0x33/0x40
[6.335943]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[6.335945] RIP: 0033:0x7fe01aab2efe
[6.335948] Code: 48 8b 0d 7d 1f 0c 00 f7 d8 64 89 01 48 83 c8 ff
c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4a 1f 0c 00 f7 d8 64 89
01 48
[6.335953] RSP: 002b:7ffdf4879928 EFLAGS: 0246 ORIG_RAX:
00af
[6.335957] RAX:

1 2 >

1 - 100 of 156 matches

Mail list logo