Public bug reported:

## Summary

Framework Laptop 16 with AMD Phoenix APU (Radeon 780M, gfx1103, device
0x15bf) experiences intermittent MES (Micro Engine Scheduler) firmware
timeouts over a period of hours, eventually culminating in a gfxhub page
fault triggered by Chrome/Chromium GPU workloads. The subsequent ring
reset fails, a MODE2 GPU reset is attempted and reports success, but the
GPU never actually recovers — producing an endless stream of
`wait_for_completion_timeout` errors. The display goes permanently
black, requiring a hard reboot. Audio over USB continues to work,
indicating the CPU is still running.

This has been occurring for approximately 3 months across OEM kernel
versions (6.14 and 6.17 series).

## Hardware

- **Laptop:** Framework Laptop 16
- **APU (iGPU):** AMD Phoenix, Radeon 780M (PCI 0000:c4:00.0, device 0x15bf, 
DCN 3.1.4, gfx_v11_0)
- **dGPU:** AMD Navi 33, Radeon RX 7700S (PCI 0000:03:00.0, device 0x7480, DCN 
3.2.1)
- **Display setup:** Internal panel + 2x external 4K monitors, one via Caldigit 
TBT4 Element Hub (Thunderbolt 4), one direct USB-C
- **RAM:** Shared VRAM 8192M (iGPU), 8176M GDDR6 (dGPU)

## Software

- **OS:** Ubuntu 24.04 LTS
- **Kernel:** 6.17.0-1017-oem (linux-image-oem-24.04)
- **Mesa:** 25.2.8-0ubuntu0.24.04.1
- **RADV:** Mesa 25.2.8
- **Desktop:** GNOME on Wayland
- **Display Core:** v3.2.340

## Crash Pattern

The crash follows a consistent multi-stage pattern:

### Stage 1: Intermittent MES timeouts (hours before crash)

The iGPU's MES firmware intermittently fails to respond. These appear
throughout the session, hours before the fatal crash, and are not
associated with any specific userspace workload:

```
amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
amdgpu 0000:c4:00.0: amdgpu: failed to reg_write_reg_wait
```

In the attached log, these occur at: 14:55, 15:27, 16:30, 21:02 on Apr
3, with the fatal crash not occurring until 10:44 on Apr 6 — suggesting
a slowly degrading MES state.

### Stage 2: Page fault triggered by Chrome GPU workload

```
amdgpu 0000:c4:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:4 
pasid:32773)
amdgpu 0000:c4:00.0: amdgpu:  Process chrome pid 5545 thread chrome:cs0 pid 5579
amdgpu 0000:c4:00.0: amdgpu:   in page starting at address 0x000000003f800000 
from client 10
amdgpu 0000:c4:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00401430
amdgpu 0000:c4:00.0: amdgpu:          Faulty UTCL2 client ID: SQC (data) (0xa)
amdgpu 0000:c4:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
amdgpu 0000:c4:00.0: amdgpu:          MAPPING_ERROR: 0x0
```

### Stage 3: Ring timeout and failed ring reset

```
amdgpu 0000:c4:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=13348850, 
emitted seq=13348852
amdgpu 0000:c4:00.0: amdgpu: Starting gfx_0.0.0 ring reset
amdgpu 0000:c4:00.0: amdgpu: Ring gfx_0.0.0 reset failed
```

### Stage 4: GPU reset — reports success but fails to recover

```
amdgpu 0000:c4:00.0: amdgpu: GPU reset begin!
amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
amdgpu 0000:c4:00.0: amdgpu: failed to unmap legacy queue
[drm:gfx_v11_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx
amdgpu 0000:c4:00.0: amdgpu: MODE2 reset
amdgpu 0000:c4:00.0: amdgpu: GPU reset succeeded, trying to resume
```

### Stage 5: GPU never actually recovers

Despite reporting success, the GPU enters an unrecoverable state with
repeating errors every ~10 seconds:

```
amdgpu 0000:c4:00.0: amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout!
```

Display is permanently black. System requires hard reboot.

## Additional observations

- The dGPU (0000:03:00.0) shows `SMU driver if version not matched` on every 
resume but does not exhibit MES failures.
- Frequent `DMUB HPD IRQ callback: link_index=5` events (~every 2-10 minutes) 
suggest the Thunderbolt dock's DisplayPort tunnel is intermittently 
renegotiating. This may be contributing to MES firmware stress.
- `gnome-shell: Cursor update failed: drmModeAtomicCommit: Invalid argument` 
appears occasionally throughout the session.
- The crash occurs on the iGPU which drives the displays. The dGPU has no CRTC 
connected.
- Occurs with both ANGLE-on-OpenGL and ANGLE-on-Vulkan Chrome configurations.
- `amdgpu.dcdebugmask=0x410` kernel parameter did not resolve the issue.

## Attached files

1. `fw16-crash-log.txt` — Filtered journal from the crash boot 
(amdgpu/drm/MES/reset related lines)
2. `fw16-full-journal-crash-boot.txt` — Full unfiltered journal from the crash 
boot
3. `about-gpu.txt` — chrome://gpu output showing driver/rendering configuration

## Potentially related upstream issues

- https://gitlab.freedesktop.org/drm/amd/-/issues/4296 (amdgpu thread safety / 
MES issues)
- 
https://community.frame.work/t/amdgpu-gfxhub-page-fault-display-freezing-in-ubuntu-25-10/80712

ProblemType: Bug
DistroRelease: Ubuntu 24.04
Package: linux-image-6.17.0-1017-oem 6.17.0-1017.17
ProcVersionSignature: Ubuntu 6.17.0-1017.17-oem 6.17.13
Uname: Linux 6.17.0-1017-oem x86_64
ApportVersion: 2.28.1-0ubuntu3.8
Architecture: amd64
AudioDevicesInUse:
 USER        PID ACCESS COMMAND
 /dev/snd/controlC2:  scotty     4339 F.... wireplumber
 /dev/snd/controlC0:  scotty     4339 F.... wireplumber
 /dev/snd/controlC1:  scotty     4339 F.... wireplumber
 /dev/snd/seq:        scotty     4333 F.... pipewire
CasperMD5CheckResult: pass
CurrentDesktop: ubuntu:GNOME
Date: Mon Apr  6 12:07:24 2026
InstallationDate: Installed on 2024-04-17 (720 days ago)
InstallationMedia: Ubuntu 22.04.4 LTS "Jammy Jellyfish" - Release amd64 
(20240220)
MachineType: Framework Laptop 16 (AMD Ryzen 7040 Series)
ProcEnviron:
 LANG=en_US.UTF-8
 PATH=(custom, no user)
 SHELL=/usr/bin/zsh
 TERM=xterm-256color
 XDG_RUNTIME_DIR=<set>
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-6.17.0-1017-oem 
root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-6.17.0-1017-oem N/A
 linux-backports-modules-6.17.0-1017-oem  N/A
 linux-firmware                           20240318.git3b128b60-0ubuntu2.25
SourcePackage: linux-oem-6.17
UpgradeStatus: Upgraded to noble on 2024-09-13 (570 days ago)
dmi.bios.date: 12/22/2025
dmi.bios.release: 4.3
dmi.bios.vendor: INSYDE Corp.
dmi.bios.version: 04.03
dmi.board.asset.tag: *
dmi.board.name: FRANMZCP09
dmi.board.vendor: Framework
dmi.board.version: A9
dmi.chassis.asset.tag: FRAGACCPA94083000M
dmi.chassis.type: 10
dmi.chassis.vendor: Framework
dmi.chassis.version: A9
dmi.modalias: 
dmi:bvnINSYDECorp.:bvr04.03:bd12/22/2025:br4.3:svnFramework:pnLaptop16(AMDRyzen7040Series):pvrA9:rvnFramework:rnFRANMZCP09:rvrA9:cvnFramework:ct10:cvrA9:skuFRAGACCP09:
dmi.product.family: 16in Laptop
dmi.product.name: Laptop 16 (AMD Ryzen 7040 Series)
dmi.product.sku: FRAGACCP09
dmi.product.version: A9
dmi.sys.vendor: Framework

** Affects: linux-oem-6.17 (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug noble wayland-session

** Attachment added: "journalctl -b -2 | grep -E "(amdgpu|gfx|drm|page 
fault|reset|wedged|MES|timeout)" > ~/fw16-crash-log.txt"
   
https://bugs.launchpad.net/bugs/2147367/+attachment/5959054/+files/fw16-crash-log.txt

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2147367

Title:
  amdgpu: MES firmware intermittently unresponsive on Phoenix APU
  (gfx1103), leading to unrecoverable GPU hang on Framework 16

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-oem-6.17/+bug/2147367/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to