** Description changed:
I have a Lenovo ThinkPad P14s Gen 6 AMD Ryzen AI 7 pro 350
I have installed
- 24.04 LTS
- 6.17.0-1017-oem
- linux-image-6.17.0-1017-oem 6.17.0-1017.17
Regularly (at this point, seems to happen once per day), the kernel
locks up (actually, let me amend that; I just noticed that the key
combination I thought would go into terminal mode doesn't do that, so I
don't know if the kernel has locked up or gnome has frozen). I haven't
been able to identify particular workloads that are causing this
behavior. At any given instance, I am running firefox, brave,
mattermost, vs code, multiple terminal sessions, obsidian, perhaps some
qemu vms (no GUI).
Looking at the journal, after one recent lockup, I saw the following:
amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
workqueue: amdgpu_tlb_fence_work [amdgpu] hogged CPU for >13333us 35 times,
consider switching to WQ_UNBOUND
amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
INFO: task kworker/7:0:37823 blocked for more than 122 seconds.
Tainted: P O 6.17.0-1017-oem #17-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/7:0 state:D stack:0 pid:37823 tgid:37823 ppid:2
task_flags:0x4208060 flags:0x00004000
Workqueue: events amdgpu_tlb_fence_work [amdgpu]
Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_timeout+0x104/0x110
dma_fence_default_wait+0x1f0/0x250
? __pfx_dma_fence_default_wait_cb+0x10/0x10
dma_fence_wait_timeout+0x13a/0x170
amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
process_one_work+0x18e/0x3e0
worker_thread+0x2e3/0x420
? __pfx_worker_thread+0x10/0x10
kthread+0x10a/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x121/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/15:0:39916 blocked for more than 122 seconds.
Tainted: P O 6.17.0-1017-oem #17-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/15:0 state:D stack:0 pid:39916 tgid:39916 ppid:2
task_flags:0x4208060 flags:0x00004000
Workqueue: events amdgpu_tlb_fence_work [amdgpu]
Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_timeout+0x104/0x110
dma_fence_default_wait+0x1f0/0x250
? __pfx_dma_fence_default_wait_cb+0x10/0x10
dma_fence_wait_timeout+0x13a/0x170
amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
process_one_work+0x18e/0x3e0
worker_thread+0x2e3/0x420
? __pfx_worker_thread+0x10/0x10
kthread+0x10a/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x121/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/12:4:40454 blocked for more than 122 seconds.
Tainted: P O 6.17.0-1017-oem #17-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/12:4 state:D stack:0 pid:40454 tgid:40454 ppid:2
task_flags:0x4208060 flags:0x00004000
Workqueue: events amdgpu_tlb_fence_work [amdgpu]
Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_timeout+0x104/0x110
dma_fence_default_wait+0x1f0/0x250
? __pfx_dma_fence_default_wait_cb+0x10/0x10
dma_fence_wait_timeout+0x13a/0x170
amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
process_one_work+0x18e/0x3e0
worker_thread+0x2e3/0x420
? __pfx_worker_thread+0x10/0x10
kthread+0x10a/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x121/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/5:4:40750 blocked for more than 122 seconds.
Tainted: P O 6.17.0-1017-oem #17-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/5:4 state:D stack:0 pid:40750 tgid:40750 ppid:2
task_flags:0x4208060 flags:0x00004000
Workqueue: events amdgpu_tlb_fence_work [amdgpu]
Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_timeout+0x104/0x110
dma_fence_default_wait+0x1f0/0x250
? __pfx_dma_fence_default_wait_cb+0x10/0x10
dma_fence_wait_timeout+0x13a/0x170
amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
process_one_work+0x18e/0x3e0
worker_thread+0x2e3/0x420
? __pfx_worker_thread+0x10/0x10
kthread+0x10a/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x121/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/15:3:40825 blocked for more than 122 seconds.
Tainted: P O 6.17.0-1017-oem #17-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/15:3 state:D stack:0 pid:40825 tgid:40825 ppid:2
task_flags:0x4208060 flags:0x00004000
Workqueue: events amdgpu_tlb_fence_work [amdgpu]
Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_timeout+0x104/0x110
dma_fence_default_wait+0x1f0/0x250
? __pfx_dma_fence_default_wait_cb+0x10/0x10
dma_fence_wait_timeout+0x13a/0x170
amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
process_one_work+0x18e/0x3e0
worker_thread+0x2e3/0x420
? __pfx_worker_thread+0x10/0x10
kthread+0x10a/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x121/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/10:8:40844 blocked for more than 122 seconds.
Tainted: P O 6.17.0-1017-oem #17-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/10:8 state:D stack:0 pid:40844 tgid:40844 ppid:2
task_flags:0x4208060 flags:0x00004000
Workqueue: events amdgpu_tlb_fence_work [amdgpu]
Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_timeout+0x104/0x110
dma_fence_default_wait+0x1f0/0x250
? __pfx_dma_fence_default_wait_cb+0x10/0x10
dma_fence_wait_timeout+0x13a/0x170
amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
process_one_work+0x18e/0x3e0
worker_thread+0x2e3/0x420
? __pfx_worker_thread+0x10/0x10
kthread+0x10a/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x121/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/10:9:40845 blocked for more than 122 seconds.
Tainted: P O 6.17.0-1017-oem #17-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/10:9 state:D stack:0 pid:40845 tgid:40845 ppid:2
task_flags:0x4208060 flags:0x00004000
Workqueue: events amdgpu_tlb_fence_work [amdgpu]
Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_timeout+0x104/0x110
dma_fence_default_wait+0x1f0/0x250
? __pfx_dma_fence_default_wait_cb+0x10/0x10
dma_fence_wait_timeout+0x13a/0x170
amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
process_one_work+0x18e/0x3e0
worker_thread+0x2e3/0x420
? __pfx_worker_thread+0x10/0x10
kthread+0x10a/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x121/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/4:2:41419 blocked for more than 122 seconds.
Tainted: P O 6.17.0-1017-oem #17-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/4:2 state:D stack:0 pid:41419 tgid:41419 ppid:2
task_flags:0x4208060 flags:0x00004000
Workqueue: events amdgpu_tlb_fence_work [amdgpu]
Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_timeout+0x104/0x110
dma_fence_default_wait+0x1f0/0x250
? __pfx_dma_fence_default_wait_cb+0x10/0x10
dma_fence_wait_timeout+0x13a/0x170
amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
process_one_work+0x18e/0x3e0
worker_thread+0x2e3/0x420
? __pfx_worker_thread+0x10/0x10
kthread+0x10a/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x121/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/10:0:41524 blocked for more than 122 seconds.
Tainted: P O 6.17.0-1017-oem #17-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/10:0 state:D stack:0 pid:41524 tgid:41524 ppid:2
task_flags:0x4208060 flags:0x00004000
Workqueue: events amdgpu_tlb_fence_work [amdgpu]
Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_timeout+0x104/0x110
dma_fence_default_wait+0x1f0/0x250
? __pfx_dma_fence_default_wait_cb+0x10/0x10
dma_fence_wait_timeout+0x13a/0x170
amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
process_one_work+0x18e/0x3e0
worker_thread+0x2e3/0x420
? _raw_spin_lock_irqsave+0xe/0x20
? __pfx_worker_thread+0x10/0x10
kthread+0x10a/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x121/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/4:0:42116 blocked for more than 122 seconds.
Tainted: P O 6.17.0-1017-oem #17-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/4:0 state:D stack:0 pid:42116 tgid:42116 ppid:2
task_flags:0x4208060 flags:0x00004000
Workqueue: events amdgpu_tlb_fence_work [amdgpu]
Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_timeout+0x104/0x110
dma_fence_default_wait+0x1f0/0x250
? __pfx_dma_fence_default_wait_cb+0x10/0x10
dma_fence_wait_timeout+0x13a/0x170
amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
process_one_work+0x18e/0x3e0
worker_thread+0x2e3/0x420
? __pfx_worker_thread+0x10/0x10
kthread+0x10a/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x121/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
Supervising 9 threads of 6 processes of 1 users.
Supervising 9 threads of 6 processes of 1 users.
amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
kauditd_printk_skb: 4 callbacks suppressed
+
+ SRU Justification:
+
+ [ Impact ]
+ Lenovo ThinkPad P14s Gen 6 AMD systems with Ryzen AI 7 PRO 350 can
+ freeze or become unusable on affected kernels. Logs show repeated
+ amdgpu MES failures, including "MES ring buffer is full" and hung
+ amdgpu_tlb_fence_work workers.
+
+ The regression is caused by f3854e04b708 ("drm/amdgpu: attach tlb
+ fence to the PTs update"), which is present in the affected Ubuntu
+ kernels. It attaches TLB fences too broadly and can flood KIQ/MES TLB
+ invalidation work.
+
+ Noble linux is not affected because the offending TLB-fence change is
+ not present there.
+
+ Resolute linux is not affected because the upstream fixes are already
+ present there.
+
+ [ Fix ]
+ Backport the upstream fixes:
+ - d967509651601
+ ("drm/amdgpu: make sure userqs are enabled in userq IOCTLs")
+ - e9f58ff991dd ("drm/amdgpu: rework how we handle TLB fences")
+
+ The second patch limits TLB fence attachment to VMs that need it: KFD
+ or KGD user queues.
+
+ [ Test Plan ]
+ 1. Boot an affected Lenovo ThinkPad P14s/T14 Gen 6 AMD Ryzen AI 7 PRO
+ 350 system.
+ 2. Run the desktop workload that previously triggered freezes.
+ 3. Verify the system no longer freezes and dmesg/journal no longer
+ shows repeated amdgpu MES ring buffer full or amdgpu_tlb_fence_work
+ hung task messages.
+
+ A test kernel containing these two patches was reported in LP #2148538
+ comment #29 to run for 12 hours without issues on an affected system.
+
+ [ Where problems could occur ]
+ The changes affect amdgpu user queue handling and VM TLB fence
+ attachment. Regressions could affect AMDGPU VM update synchronization,
+ user queue IOCTL behavior, or KFD/KGD user queue workloads. The main
+ risk is missing a TLB fence for a VM that requires one, but the upstream
+ fix enables need_tlb_fence for KFD compute VMs and when KGD user queues
+ are enabled.
+
+ [ Other Info ]
+ Target status plan:
+ - linux (Ubuntu Noble): Invalid; offending TLB-fence change is absent.
+ - linux (Ubuntu Questing): In Progress; fix is being submitted.
+ - linux (Ubuntu Resolute): Invalid; fixes are already present.
+ - linux-oem-6.17 (Ubuntu Noble): In Progress; fix is being submitted.
+
+ The conflicting SI-specific TLB fence guard is superseded by
+ need_tlb_fence. SI does not support PASID, KIQ/MES, or user queues, so
+ it will not set need_tlb_fence.
** Also affects: linux (Ubuntu Noble)
Importance: Undecided
Status: New
** Also affects: linux-oem-6.17 (Ubuntu Noble)
Importance: Undecided
Status: New
** Also affects: linux (Ubuntu Questing)
Importance: Undecided
Status: New
** Also affects: linux-oem-6.17 (Ubuntu Questing)
Importance: Undecided
Status: New
** Also affects: linux (Ubuntu Resolute)
Importance: Undecided
Status: New
** Also affects: linux-oem-6.17 (Ubuntu Resolute)
Importance: Undecided
Status: New
** Changed in: linux (Ubuntu Noble)
Status: New => Invalid
** Changed in: linux (Ubuntu Questing)
Status: New => In Progress
** Changed in: linux (Ubuntu Resolute)
Status: New => Invalid
** Changed in: linux-oem-6.17 (Ubuntu Noble)
Status: New => In Progress
** Changed in: linux-oem-6.17 (Ubuntu Questing)
Status: New => Invalid
** Changed in: linux-oem-6.17 (Ubuntu Resolute)
Status: New => Invalid
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2148538
Title:
Kernel lockup on 6.17.0-1017-oem Lenovo P14s gen 6 AMD Ryzen AI 7 pro
350
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2148538/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs