[Bug 2153941] [NEW] amdgpu: MES wedges after rapid s2idle suspend/resume cascade on gfx1152 (Krackan Point, Radeon 860M)

Marco Hernandez Thu, 21 May 2026 13:35:42 -0700

Public bug reported:

# amdgpu: MES wedges after rapid s2idle suspend/resume cascade on
gfx1152 (Krackan Point, Radeon 860M)


## Summary

After a cluster of rapid s2idle suspend/resume cycles triggered by USB-C
dock hotplug with the laptop lid closed, the AMD MES (Micro Engine
Scheduler) firmware on a Radeon 860M (gfx1152 / Krackan Point) wedges.
The GPU appears functional for some time after the cascade (minutes to
hours), then the kernel begins logging `MES failed to respond to
msg=MISC (WAIT_REG_MEM)`, escalating to `MES ring buffer is full.`
repeating every 2-3 seconds, eventually a `Fence fallback timer expired`
and kernel hung-task watchdog firing on TTM kworkers, terminating in a
hard system freeze requiring physical power cycle. No oops, no panic;
journal stops abruptly mid-stream.

This trigger pathway (s2idle suspend/resume cascade) does not appear to
be documented upstream — existing reports of the same MES error
signature on related silicon cite sustained compute load (vLLM, Ollama,
Chromium GPU acceleration) as the trigger, not suspend/resume.

## Hardware

- **Machine**: Lenovo ThinkPad P16s Gen 4 AMD, machine type 21QRCTO1WW. This 
shares the LVFS system-firmware capsule (`com.lenovo.ThinkPadR2XSG.firmware`, 
GUID `5fd8765e-5bd9-4eaa-bb4f-cc22bb1e764a`) with the entire ThinkPad T14 Gen 6 
/ T16 Gen 4 / P14s Gen 6 / P16s Gen 4 AMD family — machine types 21QJ, 21QK, 
21QL, 21QM, 21QN, 21QQ, 21QR, 21QS, 21RV, 21RW, 21RX, 21RY. Other affected 
users on this report are likely on related machine types within this family.
- **CPU**: AMD Ryzen AI 7 PRO 350 w/ Radeon 860M
- **iGPU**: PCI `1002:1114` (rev d2), Lenovo subsystem `17aa:512f`, identified 
by amdgpu as `mes_v11_0` IP block (Krackan Point, GC 11.5.x, gfx1152)
- **VBIOS**: `113-STRIXEMU-001` (standard retail string for this silicon 
family, confirmed via independent reports)
- **BIOS**: Lenovo `R2XET37W (1.17 )`, release date 2025-11-07
- **Sleep**: `/sys/power/mem_sleep` reports `[s2idle]` only — `deep` (S3) is 
not exposed by platform firmware

## Software

- **OS**: Ubuntu 24.04.4 LTS
- **Kernel**: `6.17.0-1023-oem` (Ubuntu OEM, based on mainline 6.17.13; `Linux 
6.17.0-1023-oem #23-Ubuntu SMP PREEMPT_DYNAMIC Fri May 8 06:02:38 UTC 2026`)
- **linux-firmware**: `20240318.git3b128b60-0ubuntu2.27`
- **Desktop**: GNOME Shell 46.0 on Wayland (`ubuntu:GNOME`)
- **GPU firmware versions** (per 
`/sys/kernel/debug/dri/0000:c4:00.0/amdgpu_firmware_info`):
  - ME feature version: 35, firmware version: 0x0000000d
  - PFP feature version: 35, firmware version: 0x00000011
  - MEC feature version: 35, firmware version: 0x00000011
  - RLC firmware version: 0x1152050b
  - IMU firmware version: 0x0b342000
  - SMC firmware version: 0x0b650e00 (101.14.0)
  - SDMA0 feature version: 60, firmware version: 0x0000000c
  - VCN firmware version: 0x0911800d
  - DMCUB firmware version: 0x09002c01
  - MES_KIQ feature version: 6, firmware version: 0x0000007b
  - **MES feature version: 1, firmware version: 0x00000086** (above the 0x7f 
threshold AMD's lr_compute_wa fix requires)
  - VPE feature version: 60, firmware version: 0x0000000f

## Reproduction

The trigger is reproducible (in the abstract — has reproduced twice in
24 days of normal use):

1. Boot, log in, normal desktop usage; lid open, laptop on AC or battery.
2. At some point, **close the lid and plug in a USB-C dock** (in this case a 
Dell WD19 attached via USB-C, driving external displays).
3. GNOME `gnome-settings-daemon` does not detect external displays before 
responding to the lid-close event, so the lid-close-action fires (`suspend`) 
despite `lid-close-suspend-with-external-monitor = false`.
4. System enters s2idle. USB/Thunderbolt activity from the dock (enumeration, 
display detect) raises a wake event almost immediately.
5. System resumes. Lid is still closed. GNOME re-triggers lid-close action. → 
suspend again.
6. Steps 4-5 repeat, producing 10+ suspend/resume cycles in a few minutes (e.g. 
12 cycles in 5 minutes observed).
7. System then runs apparently normally for ~1-16 hours.
8. amdgpu begins logging the failure sequence below; system becomes 
unresponsive within minutes; hard reset required.

## Observed kernel log signature (most recent occurrence)

```
May 21 11:48:29 kernel: amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to 
msg=MISC (WAIT_REG_MEM)
May 21 11:48:29 kernel: amdgpu 0000:c4:00.0: amdgpu: failed to 
reg_write_reg_wait
[repeats every 2-3 seconds, ~15 times]
May 21 11:49:10 kernel: amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
[repeats every 2-3 seconds, dozens of times]
May 21 11:52:24 kernel: amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
[journal cut off — system hard-reset]
```

On the prior occurrence (after ~23 days of uptime including many suspend
cycles), the journal also captured:

```
May 19 23:57:34 kernel: INFO: task kworker/R-ttm:1479 blocked for more than 121 
seconds.
May 19 23:57:34 kernel: INFO: task kworker/R-ttm:1479 is blocked on a mutex 
likely owned by task kworker/13:38:1493596.
[10+ other kthreads queued behind the same mutex]
May 20 10:33:52 kernel: amdgpu 0000:c4:00.0: amdgpu: Fence fallback timer 
expired on ring gfx_0.0.0
```

No kernel panic, no oops, no entries in `/sys/fs/pstore/` or
`/var/crash/`. systemd-journal terminates abruptly (no clean shutdown
trace), consistent with a hard freeze necessitating manual power-off.

The full preceding s2idle cascade is visible in `journalctl -b -1`:

```
[overnight: 1 long s2idle cycle 01:42 → 10:19 next day]
May 21 10:19:32 kernel: amdgpu 0000:c4:00.0: amdgpu: SMU is resumed 
successfully!
[then dock plug-in at ~10:20]
May 21 10:20:00 → 10:25:47 — 12 s2idle entry/exit cycles, intervals 9-30 seconds
May 21 11:48:29 — first "MES failed to respond" (1h23min after the cascade ends)
```

The first `amdgpu: MES failed to respond` is **always** preceded by a
successful resume sequence and a subsequent quiet period. No errors are
logged during the suspend storm itself.

## What's been tried / current understanding

This bug appears to be a previously-undocumented trigger pathway into
the same upstream MES wedge class tracked at:

- `freedesktop.org/drm/amd/issues/4749` (cited as canonical by NixOS-hardware 
#1801; access bot-protected)
- ROCm/ROCm #5151, #5590, #5724, **#5844** (Krackan Point gfx1152 specifically 
— open / triage)
- Framework community thread *AMD GPU MES Timeouts Causing System Hangs on 
Framework Laptop 13 (AMD AI 300 Series)*
- amd-gfx mailing list, Dec 2025 ("Long-run MES hang leads to global fence 
starvation under sustained compute (gfx11 / Strix gfx1150)")

All of the above describe sustained compute work as the trigger. None
mentions s2idle suspend/resume.

AMD's Mario Limonciello has stated in the Framework thread that a
kernel-side workaround landed in 6.17.2 / 6.18-rc1 (commit `1fb7107`
"drm/amdgpu: Enable MES lr_compute_wa by default", gated on MES FW >=
0x7f). This system is on Ubuntu OEM 6.17.13 (well past 6.17.2) with MES
FW 0x86 (above 0x7f), so that fix is presumably active — but the wedge
still occurs, suggesting either the fix is incomplete or the s2idle-
cascade pathway hits a code path the workaround doesn't cover.
Subsequent commit `6b0d812` "Disable MES LR compute W/A" (Feb 2026)
suggests the lr_compute_wa fix was itself problematic and is being
reverted in the 6.19 cycle.

`amdgpu.cwsr_enable=0` has been a working workaround for some users on
gfx1150 (Strix Point) but is reported to NOT work on gfx1152 (per ROCm
#5844 reporter, who had `cwsr_enable=0` on the cmdline and still hit the
hang during plain desktop use). It has been added to the kernel command
line on this system (via `/etc/default/grub` + `update-grub`) and is
awaiting reboot at the time of filing; outcome pending.

## Workarounds applied locally

- `gsettings set org.gnome.settings-daemon.plugins.power lid-close-ac-action 
'nothing'`
- `gsettings set org.gnome.settings-daemon.plugins.power 
lid-close-battery-action 'nothing'`
- `gsettings set org.gnome.settings-daemon.plugins.power 
sleep-inactive-battery-type 'nothing'`
- (`sleep-inactive-ac-type` was already `'nothing'`)

These break the trigger cascade by removing the lid-close-triggered
suspend. After applying, dock + closed lid no longer initiates suspend
cycles. This is a workaround for the GNOME race condition (`gnome-
settings-daemon` lid-close inhibitor is not installed in time when the
external monitor list hasn't been refreshed yet — separately documented
at LP #2075983), and indirectly avoids the amdgpu MES wedge by avoiding
the rapid suspend/resume cascade entirely.

## What I'd expect to see investigated

1. Whether the s2idle resume path on gfx1152 is leaking MES queue state across 
rapid cycles (e.g. queue IDs not being released cleanly, or KIQ reinit racing 
with subsequent suspend entry).
2. Whether 10+ s2idle cycles within ~5 minutes hits some accumulated-state 
limit that single resumes never trigger in normal testing.
3. Whether the latent "successful for hours, then wedges" pattern is the GPU 
executing on a queue state that became inconsistent during the suspend cascade.

## Diagnostic data available on request

- Full `journalctl -b -1` from both freeze occurrences
- `amd_s2idle.py` capture (not yet run; available on request)
- `/proc/acpi/wakeup`, `/proc/cmdline`, full `amdgpu_firmware_info`

## Tracker cross-references

- ROCm/ROCm#5844 (gfx1152, "just using the desktop" trigger, open) — most 
directly relevant for hardware match
- ROCm/ROCm#5590, #5724 (gfx1150/1151 with related signatures)
- Framework community thread for AI 300 series (sustained compute trigger): 
https://community.frame.work/t/amd-gpu-mes-timeouts-causing-system-hangs-on-framework-laptop-13-amd-ai-300-series/71364
- CachyOS forum: "Computer locking up / freezing with AMD Ryzen AI 7 PRO 350 / 
Radeon 860M in Thinkpad X13 Gen 6" — sibling ThinkPad chassis (X13 G6) with the 
*exact same APU* (Ryzen AI 7 PRO 350 / Radeon 860M / gfx1152). Confirms the bug 
is not chassis-specific within this APU family. 
https://discuss.cachyos.org/t/computer-locking-up-freezing-with-amd-ryzen-ai-7-pro-350-radeon-860m-in-thinkpad-x13-gen-6/28736
- Linux kernel commits: `1fb7107` (Sept 2025, enabled the lr_compute_wa) and 
`6b0d812` (Feb 2026, reverting it) — `drivers/gpu/drm/amd/amdgpu/mes_v11_0.c`
- LP #2075983 (gnome-settings-daemon lid behavior gsetting documented as dead 
code by Canonical engineer Daniel van Vugt)
- GNOME gnome-settings-daemon issue #375 
(https://gitlab.gnome.org/GNOME/gnome-settings-daemon/-/issues/375) — same 
dock-with-closed-lid race against monitor enumeration. Open, unfixed.
- LP #2135274 (STRIXEMU VBIOS string observed on retail ASUS ProArt — 
confirming this is the normal retail VBIOS string for Strix-family silicon, not 
engineering-sample)
- fwupd/firmware-lenovo#566 
(https://github.com/fwupd/firmware-lenovo/issues/566) — *separate* suspend 
regression on Lenovo BIOS 1.17 same machine family. Affected users there report 
`suspend (s2idle) reproducibly enters sleep successfully... but never resumes`. 
This is a *different* failure mode from the one in this report — here suspend 
resumes correctly and the GPU wedges later. Mentioned for completeness so 
triage doesn't conflate the two.
- Fedora discussion 178657 (T14 Gen 6 AMD, BIOS 1.17, suspend-doesn't-resume 
regression — same caveat as above)

ProblemType: Bug
DistroRelease: Ubuntu 24.04
Package: linux-image-6.17.0-1023-oem 6.17.0-1023.23
ProcVersionSignature: Ubuntu 6.17.0-1023.23-oem 6.17.13
Uname: Linux 6.17.0-1023-oem x86_64
ApportVersion: 2.28.1-0ubuntu3.8
Architecture: amd64
CRDA: N/A
CasperMD5CheckResult: pass
CurrentDesktop: ubuntu:GNOME
Date: Thu May 21 13:14:39 2026
InstallationDate: Installed on 2026-02-03 (107 days ago)
InstallationMedia: Ubuntu 24.04.3 LTS "Noble Numbat" - Release amd64 
(20250805.1)
MachineType: LENOVO 21QRCTO1WW
ProcEnviron:
 LANG=en_US.UTF-8
 PATH=(custom, no user)
 SHELL=/bin/bash
 TERM=xterm-256color
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-6.17.0-1023-oem 
root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-6.17.0-1023-oem N/A
 linux-backports-modules-6.17.0-1023-oem  N/A
 linux-firmware                           20240318.git3b128b60-0ubuntu2.27
SourcePackage: linux-oem-6.17
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 11/07/2025
dmi.bios.release: 1.17
dmi.bios.vendor: LENOVO
dmi.bios.version: R2XET37W (1.17 )
dmi.board.asset.tag: Not Available
dmi.board.name: 21QRCTO1WW
dmi.board.vendor: LENOVO
dmi.board.version: Not Defined
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.ec.firmware.release: 1.9
dmi.modalias: 
dmi:bvnLENOVO:bvrR2XET37W(1.17):bd11/07/2025:br1.17:efr1.9:svnLENOVO:pn21QRCTO1WW:pvrThinkPadP16sGen4AMD:rvnLENOVO:rn21QRCTO1WW:rvrNotDefined:cvnLENOVO:ct10:cvrNone:skuLENOVO_MT_21QR_BU_Think_FM_ThinkPadP16sGen4AMD:
dmi.product.family: ThinkPad P16s Gen 4 AMD
dmi.product.name: 21QRCTO1WW
dmi.product.sku: LENOVO_MT_21QR_BU_Think_FM_ThinkPad P16s Gen 4 AMD
dmi.product.version: ThinkPad P16s Gen 4 AMD
dmi.sys.vendor: LENOVO

** Affects: linux-oem-6.17 (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amdgpu krackan-point mes oem-priority regression s2idle strix suspend

** Attachment added: "the description from above but as a markdown in case the 
formatting gets messed up"
   
https://bugs.launchpad.net/bugs/2153941/+attachment/5972336/+files/amdgpu-mes-suspend-cascade-bugreport.md

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2153941

Title:
  amdgpu: MES wedges after rapid s2idle suspend/resume cascade on
  gfx1152 (Krackan Point, Radeon 860M)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-oem-6.17/+bug/2153941/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2153941] [NEW] amdgpu: MES wedges after rapid s2idle suspend/resume cascade on gfx1152 (Krackan Point, Radeon 860M)

Reply via email to