** Description changed:

  I have a Lenovo ThinkPad P14s Gen 6 AMD Ryzen AI 7 pro 350
  
  I have installed
  - 24.04 LTS
  - 6.17.0-1017-oem
  - linux-image-6.17.0-1017-oem 6.17.0-1017.17
  
  Regularly (at this point, seems to happen once per day), the kernel
  locks up (actually, let me amend that; I just noticed that the key
  combination I thought would go into terminal mode doesn't do that, so I
  don't know if the kernel has locked up or gnome has frozen). I haven't
  been able to identify particular workloads that are causing this
  behavior. At any given instance, I am running firefox, brave,
  mattermost, vs code, multiple terminal sessions, obsidian, perhaps some
  qemu vms (no GUI).
  
  Looking at the journal, after one recent lockup, I saw the following:
  
   amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
   workqueue: amdgpu_tlb_fence_work [amdgpu] hogged CPU for >13333us 35 times, 
consider switching to WQ_UNBOUND
   amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
   amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
   amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
   amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
   amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
   amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
   amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
   amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
   amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
   INFO: task kworker/7:0:37823 blocked for more than 122 seconds.
         Tainted: P           O        6.17.0-1017-oem #17-Ubuntu
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/7:0     state:D stack:0     pid:37823 tgid:37823 ppid:2      
task_flags:0x4208060 flags:0x00004000
   Workqueue: events amdgpu_tlb_fence_work [amdgpu]
   Call Trace:
    <TASK>
    __schedule+0x30d/0x7a0
    schedule+0x27/0x90
    schedule_timeout+0x104/0x110
    dma_fence_default_wait+0x1f0/0x250
    ? __pfx_dma_fence_default_wait_cb+0x10/0x10
    dma_fence_wait_timeout+0x13a/0x170
    amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
    process_one_work+0x18e/0x3e0
    worker_thread+0x2e3/0x420
    ? __pfx_worker_thread+0x10/0x10
    kthread+0x10a/0x230
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x121/0x140
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>
   INFO: task kworker/15:0:39916 blocked for more than 122 seconds.
         Tainted: P           O        6.17.0-1017-oem #17-Ubuntu
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/15:0    state:D stack:0     pid:39916 tgid:39916 ppid:2      
task_flags:0x4208060 flags:0x00004000
   Workqueue: events amdgpu_tlb_fence_work [amdgpu]
   Call Trace:
    <TASK>
    __schedule+0x30d/0x7a0
    schedule+0x27/0x90
    schedule_timeout+0x104/0x110
    dma_fence_default_wait+0x1f0/0x250
    ? __pfx_dma_fence_default_wait_cb+0x10/0x10
    dma_fence_wait_timeout+0x13a/0x170
    amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
    process_one_work+0x18e/0x3e0
    worker_thread+0x2e3/0x420
    ? __pfx_worker_thread+0x10/0x10
    kthread+0x10a/0x230
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x121/0x140
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>
   INFO: task kworker/12:4:40454 blocked for more than 122 seconds.
         Tainted: P           O        6.17.0-1017-oem #17-Ubuntu
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/12:4    state:D stack:0     pid:40454 tgid:40454 ppid:2      
task_flags:0x4208060 flags:0x00004000
   Workqueue: events amdgpu_tlb_fence_work [amdgpu]
   Call Trace:
    <TASK>
    __schedule+0x30d/0x7a0
    schedule+0x27/0x90
    schedule_timeout+0x104/0x110
    dma_fence_default_wait+0x1f0/0x250
    ? __pfx_dma_fence_default_wait_cb+0x10/0x10
    dma_fence_wait_timeout+0x13a/0x170
    amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
    process_one_work+0x18e/0x3e0
    worker_thread+0x2e3/0x420
    ? __pfx_worker_thread+0x10/0x10
    kthread+0x10a/0x230
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x121/0x140
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>
   INFO: task kworker/5:4:40750 blocked for more than 122 seconds.
         Tainted: P           O        6.17.0-1017-oem #17-Ubuntu
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/5:4     state:D stack:0     pid:40750 tgid:40750 ppid:2      
task_flags:0x4208060 flags:0x00004000
   Workqueue: events amdgpu_tlb_fence_work [amdgpu]
   Call Trace:
    <TASK>
    __schedule+0x30d/0x7a0
    schedule+0x27/0x90
    schedule_timeout+0x104/0x110
    dma_fence_default_wait+0x1f0/0x250
    ? __pfx_dma_fence_default_wait_cb+0x10/0x10
    dma_fence_wait_timeout+0x13a/0x170
    amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
    process_one_work+0x18e/0x3e0
    worker_thread+0x2e3/0x420
    ? __pfx_worker_thread+0x10/0x10
    kthread+0x10a/0x230
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x121/0x140
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>
   INFO: task kworker/15:3:40825 blocked for more than 122 seconds.
         Tainted: P           O        6.17.0-1017-oem #17-Ubuntu
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/15:3    state:D stack:0     pid:40825 tgid:40825 ppid:2      
task_flags:0x4208060 flags:0x00004000
   Workqueue: events amdgpu_tlb_fence_work [amdgpu]
   Call Trace:
    <TASK>
    __schedule+0x30d/0x7a0
    schedule+0x27/0x90
    schedule_timeout+0x104/0x110
    dma_fence_default_wait+0x1f0/0x250
    ? __pfx_dma_fence_default_wait_cb+0x10/0x10
    dma_fence_wait_timeout+0x13a/0x170
    amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
    process_one_work+0x18e/0x3e0
    worker_thread+0x2e3/0x420
    ? __pfx_worker_thread+0x10/0x10
    kthread+0x10a/0x230
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x121/0x140
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>
   INFO: task kworker/10:8:40844 blocked for more than 122 seconds.
         Tainted: P           O        6.17.0-1017-oem #17-Ubuntu
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/10:8    state:D stack:0     pid:40844 tgid:40844 ppid:2      
task_flags:0x4208060 flags:0x00004000
   Workqueue: events amdgpu_tlb_fence_work [amdgpu]
   Call Trace:
    <TASK>
    __schedule+0x30d/0x7a0
    schedule+0x27/0x90
    schedule_timeout+0x104/0x110
    dma_fence_default_wait+0x1f0/0x250
    ? __pfx_dma_fence_default_wait_cb+0x10/0x10
    dma_fence_wait_timeout+0x13a/0x170
    amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
    process_one_work+0x18e/0x3e0
    worker_thread+0x2e3/0x420
    ? __pfx_worker_thread+0x10/0x10
    kthread+0x10a/0x230
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x121/0x140
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>
   INFO: task kworker/10:9:40845 blocked for more than 122 seconds.
         Tainted: P           O        6.17.0-1017-oem #17-Ubuntu
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/10:9    state:D stack:0     pid:40845 tgid:40845 ppid:2      
task_flags:0x4208060 flags:0x00004000
   Workqueue: events amdgpu_tlb_fence_work [amdgpu]
   Call Trace:
    <TASK>
    __schedule+0x30d/0x7a0
    schedule+0x27/0x90
    schedule_timeout+0x104/0x110
    dma_fence_default_wait+0x1f0/0x250
    ? __pfx_dma_fence_default_wait_cb+0x10/0x10
    dma_fence_wait_timeout+0x13a/0x170
    amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
    process_one_work+0x18e/0x3e0
    worker_thread+0x2e3/0x420
    ? __pfx_worker_thread+0x10/0x10
    kthread+0x10a/0x230
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x121/0x140
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>
   INFO: task kworker/4:2:41419 blocked for more than 122 seconds.
         Tainted: P           O        6.17.0-1017-oem #17-Ubuntu
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/4:2     state:D stack:0     pid:41419 tgid:41419 ppid:2      
task_flags:0x4208060 flags:0x00004000
   Workqueue: events amdgpu_tlb_fence_work [amdgpu]
   Call Trace:
    <TASK>
    __schedule+0x30d/0x7a0
    schedule+0x27/0x90
    schedule_timeout+0x104/0x110
    dma_fence_default_wait+0x1f0/0x250
    ? __pfx_dma_fence_default_wait_cb+0x10/0x10
    dma_fence_wait_timeout+0x13a/0x170
    amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
    process_one_work+0x18e/0x3e0
    worker_thread+0x2e3/0x420
    ? __pfx_worker_thread+0x10/0x10
    kthread+0x10a/0x230
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x121/0x140
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>
   INFO: task kworker/10:0:41524 blocked for more than 122 seconds.
         Tainted: P           O        6.17.0-1017-oem #17-Ubuntu
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/10:0    state:D stack:0     pid:41524 tgid:41524 ppid:2      
task_flags:0x4208060 flags:0x00004000
   Workqueue: events amdgpu_tlb_fence_work [amdgpu]
   Call Trace:
    <TASK>
    __schedule+0x30d/0x7a0
    schedule+0x27/0x90
    schedule_timeout+0x104/0x110
    dma_fence_default_wait+0x1f0/0x250
    ? __pfx_dma_fence_default_wait_cb+0x10/0x10
    dma_fence_wait_timeout+0x13a/0x170
    amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
    process_one_work+0x18e/0x3e0
    worker_thread+0x2e3/0x420
    ? _raw_spin_lock_irqsave+0xe/0x20
    ? __pfx_worker_thread+0x10/0x10
    kthread+0x10a/0x230
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x121/0x140
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>
   INFO: task kworker/4:0:42116 blocked for more than 122 seconds.
         Tainted: P           O        6.17.0-1017-oem #17-Ubuntu
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/4:0     state:D stack:0     pid:42116 tgid:42116 ppid:2      
task_flags:0x4208060 flags:0x00004000
   Workqueue: events amdgpu_tlb_fence_work [amdgpu]
   Call Trace:
    <TASK>
    __schedule+0x30d/0x7a0
    schedule+0x27/0x90
    schedule_timeout+0x104/0x110
    dma_fence_default_wait+0x1f0/0x250
    ? __pfx_dma_fence_default_wait_cb+0x10/0x10
    dma_fence_wait_timeout+0x13a/0x170
    amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
    process_one_work+0x18e/0x3e0
    worker_thread+0x2e3/0x420
    ? __pfx_worker_thread+0x10/0x10
    kthread+0x10a/0x230
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x121/0x140
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>
   Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
   amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
   Supervising 9 threads of 6 processes of 1 users.
   Supervising 9 threads of 6 processes of 1 users.
   amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
   amdgpu 0000:c4:00.0: amdgpu: MES ring buffer is full.
   kauditd_printk_skb: 4 callbacks suppressed
+ 
+ SRU Justification:
+ 
+ [ Impact ]
+ Lenovo ThinkPad P14s Gen 6 AMD systems with Ryzen AI 7 PRO 350 can
+ freeze or become unusable on affected kernels. Logs show repeated
+ amdgpu MES failures, including "MES ring buffer is full" and hung
+ amdgpu_tlb_fence_work workers.
+ 
+ The regression is caused by f3854e04b708 ("drm/amdgpu: attach tlb
+ fence to the PTs update"), which is present in the affected Ubuntu
+ kernels. It attaches TLB fences too broadly and can flood KIQ/MES TLB
+ invalidation work.
+ 
+ Noble linux is not affected because the offending TLB-fence change is
+ not present there.
+ 
+ Resolute linux is not affected because the upstream fixes are already
+ present there.
+ 
+ [ Fix ]
+ Backport the upstream fixes:
+ - d967509651601
+   ("drm/amdgpu: make sure userqs are enabled in userq IOCTLs")
+ - e9f58ff991dd ("drm/amdgpu: rework how we handle TLB fences")
+ 
+ The second patch limits TLB fence attachment to VMs that need it: KFD
+ or KGD user queues.
+ 
+ [ Test Plan ]
+ 1. Boot an affected Lenovo ThinkPad P14s/T14 Gen 6 AMD Ryzen AI 7 PRO
+    350 system.
+ 2. Run the desktop workload that previously triggered freezes.
+ 3. Verify the system no longer freezes and dmesg/journal no longer
+    shows repeated amdgpu MES ring buffer full or amdgpu_tlb_fence_work
+    hung task messages.
+ 
+ A test kernel containing these two patches was reported in LP #2148538
+ comment #29 to run for 12 hours without issues on an affected system.
+ 
+ [ Where problems could occur ]
+ The changes affect amdgpu user queue handling and VM TLB fence
+ attachment. Regressions could affect AMDGPU VM update synchronization,
+ user queue IOCTL behavior, or KFD/KGD user queue workloads. The main
+ risk is missing a TLB fence for a VM that requires one, but the upstream
+ fix enables need_tlb_fence for KFD compute VMs and when KGD user queues
+ are enabled.
+ 
+ [ Other Info ]
+ Target status plan:
+ - linux (Ubuntu Noble): Invalid; offending TLB-fence change is absent.
+ - linux (Ubuntu Questing): In Progress; fix is being submitted.
+ - linux (Ubuntu Resolute): Invalid; fixes are already present.
+ - linux-oem-6.17 (Ubuntu Noble): In Progress; fix is being submitted.
+ 
+ The conflicting SI-specific TLB fence guard is superseded by
+ need_tlb_fence. SI does not support PASID, KIQ/MES, or user queues, so
+ it will not set need_tlb_fence.

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
       Status: New

** Also affects: linux-oem-6.17 (Ubuntu Noble)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Questing)
   Importance: Undecided
       Status: New

** Also affects: linux-oem-6.17 (Ubuntu Questing)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Resolute)
   Importance: Undecided
       Status: New

** Also affects: linux-oem-6.17 (Ubuntu Resolute)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu Noble)
       Status: New => Invalid

** Changed in: linux (Ubuntu Questing)
       Status: New => In Progress

** Changed in: linux (Ubuntu Resolute)
       Status: New => Invalid

** Changed in: linux-oem-6.17 (Ubuntu Noble)
       Status: New => In Progress

** Changed in: linux-oem-6.17 (Ubuntu Questing)
       Status: New => Invalid

** Changed in: linux-oem-6.17 (Ubuntu Resolute)
       Status: New => Invalid

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2148538

Title:
  Kernel lockup on 6.17.0-1017-oem Lenovo P14s gen 6 AMD Ryzen AI 7 pro
  350

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2148538/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to