https://bugs.kde.org/show_bug.cgi?id=453147

            Bug ID: 453147
           Summary: amdgpu: GPU reset crash loop
           Product: kwin
           Version: 5.24.4
          Platform: Archlinux Packages
                OS: Linux
            Status: REPORTED
          Severity: crash
          Priority: NOR
         Component: core
          Assignee: kwin-bugs-n...@kde.org
          Reporter: matteo....@gmail.com
  Target Milestone: ---

SUMMARY
I have a Vulkan application (a non famous game engine) that sometimes triggers
a gpu reset. A gpu fence timeout seems to be my latest issue, but I think other
misuses might have happened other times.
When this happens, the GPU attempts to reset, but kwin seems to be unble to
recover and trips a gpu reset in an almost infinite loop. After the first reset
the mouse input is unresponsive, probably because video buffers do not update.


STEPS TO REPRODUCE
1. Run kde in wayland mode on AMD open graphic stack
2. Reset the gpu (accidentally, can I manually reset it to test?)
3. enjoy the corruption loop

OBSERVED RESULT
The screens display are displaying corrupted images with random pixels in a
grid pattern going full green or red. The corruption increases at every gpu
reset. After a bunch of resets the screen goes into tty mode (kwin crashed
completely I guess).

dmesg output:
```
[25894.802524] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for
fences timed out!
[25895.015855] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for
fences timed out!
[25899.719088] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0
timeout, signaled seq=2308869, emitted seq=2308871
[25899.719267] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process Tests pid 33837 thread Tests:cs0 pid 33839
[25899.719426] amdgpu 0000:44:00.0: amdgpu: GPU reset begin!
[25900.195085] amdgpu 0000:44:00.0: [drm:amdgpu_ring_test_helper [amdgpu]]
*ERROR* ring kiq_2.1.0 test failed (-110)
[25900.195246] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[25900.419828] amdgpu 0000:44:00.0: [drm:amdgpu_ring_test_helper [amdgpu]]
*ERROR* ring kiq_2.1.0 test failed (-110)
[25900.419986] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[25900.644591] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[25900.651532] [drm] free PSP TMR buffer
[25900.683662] amdgpu 0000:44:00.0: amdgpu: BACO reset
[25902.770356] amdgpu 0000:44:00.0: amdgpu: GPU reset succeeded, trying to
resume
[25902.770506] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[25902.770522] [drm] VRAM is lost due to GPU reset!
[25902.770523] [drm] PSP is resuming...
[25902.816962] [drm] reserve 0x900000 from 0x81fe400000 for PSP TMR
[25902.858907] amdgpu 0000:44:00.0: amdgpu: RAS: optional ras ta ucode is not
available
[25902.864919] amdgpu 0000:44:00.0: amdgpu: RAP: optional rap ta ucode is not
available
[25902.864921] amdgpu 0000:44:00.0: amdgpu: SECUREDISPLAY: securedisplay ta
ucode is not available
[25902.864924] amdgpu 0000:44:00.0: amdgpu: SMU is resuming...
[25902.867331] amdgpu 0000:44:00.0: amdgpu: SMU is resumed successfully!
[25903.171192] [drm] kiq ring mec 2 pipe 1 q 0
[25903.172711] [drm] VCN decode and encode initialized successfully(under DPG
Mode).
[25903.172796] [drm] JPEG decode initialized successfully.
[25903.172812] amdgpu 0000:44:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on
hub 0
[25903.172815] amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1
on hub 0
[25903.172816] amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4
on hub 0
[25903.172817] amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5
on hub 0
[25903.172818] amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6
on hub 0
[25903.172819] amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7
on hub 0
[25903.172820] amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8
on hub 0
[25903.172821] amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9
on hub 0
[25903.172822] amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10
on hub 0
[25903.172823] amdgpu 0000:44:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11
on hub 0
[25903.172824] amdgpu 0000:44:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on
hub 0
[25903.172825] amdgpu 0000:44:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on
hub 0
[25903.172826] amdgpu 0000:44:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on
hub 1
[25903.172826] amdgpu 0000:44:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on
hub 1
[25903.172827] amdgpu 0000:44:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on
hub 1
[25903.172828] amdgpu 0000:44:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on
hub 1
[25903.175542] amdgpu 0000:44:00.0: amdgpu: recover vram bo from shadow start
[25903.175638] amdgpu 0000:44:00.0: amdgpu: recover vram bo from shadow done
[25903.175640] [drm] Skip scheduling IBs!
...
[26254.330249] [drm] Skip scheduling IBs!
[26254.330170] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize
parser -125!
[26254.993088] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize
parser -125!
[26254.993864] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize
parser -125!
[26264.998580] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize
parser -125!
[26265.033825] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize
parser -125!
[26265.034217] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize
parser -125!
[26265.036125] amdgpu 0000:44:00.0: amdgpu: [gfxhub] page fault (src_id:0
ring:160 vmid:6 pasid:32770, for process kwin_wayland pid 1105 thread
kwin_wayla:cs0 pid 1119)
[26265.036132] amdgpu 0000:44:00.0: amdgpu:   in page starting at address
0x000080093582b000 from client 0x1b (UTCL2)
[26265.036135] amdgpu 0000:44:00.0: amdgpu:
GCVM_L2_PROTECTION_FAULT_STATUS:0x00640D40
[26265.036136] amdgpu 0000:44:00.0: amdgpu:      Faulty UTCL2 client ID: CPG
(0x6)
[26265.036137] amdgpu 0000:44:00.0: amdgpu:      MORE_FAULTS: 0x0
[26265.036139] amdgpu 0000:44:00.0: amdgpu:      WALKER_ERROR: 0x0
[26265.036139] amdgpu 0000:44:00.0: amdgpu:      PERMISSION_FAULTS: 0x4
[26265.036140] amdgpu 0000:44:00.0: amdgpu:      MAPPING_ERROR: 0x1
[26265.036141] amdgpu 0000:44:00.0: amdgpu:      RW: 0x1
[26265.704684] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize
parser -125!
[26265.705057] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize
parser -125!
[26266.412182] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize
parser -125!
[26266.412557] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize
parser -125!
[26267.192539] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize
parser -125!
[26267.192934] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize
parser -125!
[26271.297885] kauditd_printk_skb: 3 callbacks suppressed
[26271.297889] audit: type=1701 audit(1651156238.938:197): auid=1000 uid=1000
gid=1000 ses=2 pid=1294 comm="kded5" exe="/usr/bin/kded5" sig=11 res=1
[26271.310976] audit: type=1334 audit(1651156238.951:198): prog-id=45 op=LOAD
[26271.311303] audit: type=1334 audit(1651156238.951:199): prog-id=46 op=LOAD
[26271.311561] audit: type=1334 audit(1651156238.951:200): prog-id=47 op=LOAD
[26271.312456] audit: type=1130 audit(1651156238.954:201): pid=1 uid=0
auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@4-35394-0
comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
res=success'
[26275.178552] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0
timeout, signaled seq=2360748, emitted seq=2360751
[26275.178734] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process kwin_wayland pid 1105 thread kwin_wayla:cs0 pid 1119
[26275.178900] amdgpu 0000:44:00.0: amdgpu: GPU reset begin!
[26275.664675] amdgpu 0000:44:00.0: [drm:amdgpu_ring_test_helper [amdgpu]]
*ERROR* ring kiq_2.1.0 test failed (-110)
[26275.664810] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[26275.888265] amdgpu 0000:44:00.0: [drm:amdgpu_ring_test_helper [amdgpu]]
*ERROR* ring kiq_2.1.0 test failed (-110)
[26275.888400] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[26276.111834] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[26276.118636] [drm] free PSP TMR buffer
[26276.150717] amdgpu 0000:44:00.0: amdgpu: BACO reset
[26278.236347] amdgpu 0000:44:00.0: amdgpu: GPU reset succeeded, trying to
resumee
[26278.236619] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[26278.236635] [drm] VRAM is lost due to GPU reset!
[26278.236636] [drm] PSP is resuming...
[26278.283089] [drm] reserve 0x900000 from 0x81fe400000 for PSP TMR
[26278.324995] amdgpu 0000:44:00.0: amdgpu: RAS: optional ras ta ucode is not
available
[26278.330950] amdgpu 0000:44:00.0: amdgpu: RAP: optional rap ta ucode is not
available
[26278.330952] amdgpu 0000:44:00.0: amdgpu: SECUREDISPLAY: securedisplay ta
ucode is not available
[26278.330954] amdgpu 0000:44:00.0: amdgpu: SMU is resuming...
[26278.333221] amdgpu 0000:44:00.0: amdgpu: SMU is resumed successfully!
[26278.637193] [drm] kiq ring mec 2 pipe 1 q 0
[26278.638522] [drm] VCN decode and encode initialized successfully(under DPG
Mode).
[26278.638609] [drm] JPEG decode initialized successfully.
[26278.638625] amdgpu 0000:44:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on
hub 0
[26278.638627] amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1
on hub 0
[26278.638629] amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4
on hub 0
[26278.638630] amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5
on hub 0
[26278.638631] amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6
on hub 0
[26278.638632] amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7
on hub 0
[26278.638634] amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8
on hub 0
[26278.638634] amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9
on hub 0
[26278.638635] amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10
on hub 0
[26278.638636] amdgpu 0000:44:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11
on hub 0
[26278.638637] amdgpu 0000:44:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on
hub 0
[26278.638638] amdgpu 0000:44:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on
hub 0
[26278.638639] amdgpu 0000:44:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on
hub 1
[26278.638640] amdgpu 0000:44:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on
hub 1
[26278.638640] amdgpu 0000:44:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on
hub 1
[26278.638641] amdgpu 0000:44:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on
hub 1
[26278.640920] amdgpu 0000:44:00.0: amdgpu: recover vram bo from shadow start
[26278.641188] amdgpu 0000:44:00.0: amdgpu: recover vram bo from shadow done
[26278.641189] [drm] Skip scheduling IBs!
[26278.641190] [drm] Skip scheduling IBs!
[26278.641260] [drm] Skip scheduling IBs!
[26278.641267] [drm] Skip scheduling IBs!
[26278.641271] [drm] Skip scheduling IBs!
[26278.641275] [drm] Skip scheduling IBs!
[26278.641281] [drm] Skip scheduling IBs!
[26278.641319] amdgpu 0000:44:00.0: amdgpu: GPU reset(23) succeeded!
[26278.641382] [drm] Skip scheduling IBs!
...
```

EXPECTED RESULT
The screen could flash, but I expect kwin to resume it's normal activities
after a gpu reset.

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: Archlinux 5.17.4-arch1-1
KDE Plasma Version: 5.24.4-1
KDE Frameworks Version: 5.93.0-2
Qt Version: 5.15.3+kde+r145-1
Mesa Version: 22.0.2-1

ADDITIONAL INFORMATION
AMD 5700xt GPU
I can attach the full dmesg if required, just need to be sure there are not
sensible informations inside before.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to