Public bug reported:

[Impact]
System hangs on amdgpu during exec. systemd blocked for 368s in D state in
amdgpu_ctx_mgr_entity_flush (mutex_lock) via filp_close -> amdgpu_flush during
do_close_on_exec -> load_elf_binary. Reproduces on 6.17.0-1020-oem.

May 29 10:47:30 London-AMD-SIT5 kernel: INFO: task systemd:16426 blocked for 
more than 368 seconds.
May 29 10:47:30 London-AMD-SIT5 kernel:       Tainted: G           O        
6.17.0-1020-oem #20-Ubuntu
May 29 10:47:30 London-AMD-SIT5 kernel: task:systemd  state:D  pid:16426
May 29 10:47:30 London-AMD-SIT5 kernel:  mutex_lock
May 29 10:47:30 London-AMD-SIT5 kernel:  amdgpu_ctx_mgr_entity_flush+0x46/0x1f0 
[amdgpu]
May 29 10:47:30 London-AMD-SIT5 kernel:  amdgpu_flush+0x26/0x50 [amdgpu]
May 29 10:47:30 London-AMD-SIT5 kernel:  filp_flush+0x6f/0xb0
May 29 10:47:30 London-AMD-SIT5 kernel:  filp_close+0x14/0x30
May 29 10:47:30 London-AMD-SIT5 kernel:  do_close_on_exec+0xe7/0x140
May 29 10:47:30 London-AMD-SIT5 kernel:  begin_new_exec+0x1ab/0x420
May 29 10:47:30 London-AMD-SIT5 kernel:  load_elf_binary+0x32d/0xf40

[Fix]
Cherry-pick upstream commits:
- b18fc0ab837381 "drm/amdgpu: fix sync handling in amdgpu_dma_buf_move_notify"
- 930595df251c "drm/amdgpu: remove check for BO reservation add assert instead"

Pass NULL ticket to amdgpu_vm_handle_moved in move_notify so the clear=true
path is used, avoiding a PT update while another process's job is still
running — the contention that blocks amdgpu_ctx_mgr_entity_flush.

[Test Plan]
Boot oem-6.17 on dual-GPU AMD system without P2P PCI; glxgears on GPU0,
Xorg on GPU1 sharing a dmabuf. Without fix: systemd hung >368s in
amdgpu_ctx_mgr_entity_flush. With fix: no hang.

[Where problems could occur]
- Patch 1 changes only the ticket arg in amdgpu_dma_buf_move_notify; normal VM
  update path untouched. Risk limited to dmabuf move-notify sync on multi-GPU
  shared-BO setups.
- Patch 2 replaces a runtime warning + -EINVAL with a lockdep assert (no-op on
  production kernels with PROVE_LOCKING off). Low risk.

** Affects: linux-oem-6.17 (Ubuntu)
     Importance: Undecided
         Status: New

** Affects: linux-oem-6.17 (Ubuntu Noble)
     Importance: Undecided
         Status: New

** Package changed: ubuntu => linux-oem-6.17 (Ubuntu)

** Also affects: linux-oem-6.17 (Ubuntu Noble)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2158236

Title:
  amdgpu: hang in amdgpu_ctx_mgr_entity_flush during exec

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-oem-6.17/+bug/2158236/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to