** Tags added: jira-sutton-5174 oem-priority sutton

** Tags added: jira-sutton-5194

** Also affects: linux-oem-6.17 (Ubuntu Resolute)
   Importance: Undecided
       Status: New

** Also affects: linux-oem-6.17 (Ubuntu Stonking)
   Importance: Undecided
       Status: New

** Changed in: linux-oem-6.17 (Ubuntu Resolute)
       Status: New => Fix Released

** Changed in: linux-oem-6.17 (Ubuntu Stonking)
       Status: New => Fix Released

** Changed in: linux-oem-6.17 (Ubuntu Noble)
       Status: New => In Progress

** Changed in: hwe-next
       Status: New => In Progress

** Changed in: hwe-next
     Assignee: (unassigned) => AaronMa (mapengyu)

** Changed in: hwe-next
   Importance: Undecided => Medium

** Changed in: linux-oem-6.17 (Ubuntu Noble)
   Importance: Undecided => Medium

** Changed in: linux-oem-6.17 (Ubuntu Resolute)
   Importance: Undecided => Medium

** Changed in: linux-oem-6.17 (Ubuntu Stonking)
   Importance: Undecided => Medium

** Description changed:

  [Impact]
  System hangs on amdgpu during exec. systemd blocked for 368s in D state in
  amdgpu_ctx_mgr_entity_flush (mutex_lock) via filp_close -> amdgpu_flush during
  do_close_on_exec -> load_elf_binary. Reproduces on 6.17.0-1020-oem.
  
- May 29 10:47:30 London-AMD-SIT5 kernel: INFO: task systemd:16426 blocked for 
more than 368 seconds.
- May 29 10:47:30 London-AMD-SIT5 kernel:       Tainted: G           O        
6.17.0-1020-oem #20-Ubuntu
- May 29 10:47:30 London-AMD-SIT5 kernel: task:systemd  state:D  pid:16426
- May 29 10:47:30 London-AMD-SIT5 kernel:  mutex_lock
- May 29 10:47:30 London-AMD-SIT5 kernel:  
amdgpu_ctx_mgr_entity_flush+0x46/0x1f0 [amdgpu]
- May 29 10:47:30 London-AMD-SIT5 kernel:  amdgpu_flush+0x26/0x50 [amdgpu]
- May 29 10:47:30 London-AMD-SIT5 kernel:  filp_flush+0x6f/0xb0
- May 29 10:47:30 London-AMD-SIT5 kernel:  filp_close+0x14/0x30
- May 29 10:47:30 London-AMD-SIT5 kernel:  do_close_on_exec+0xe7/0x140
- May 29 10:47:30 London-AMD-SIT5 kernel:  begin_new_exec+0x1ab/0x420
- May 29 10:47:30 London-AMD-SIT5 kernel:  load_elf_binary+0x32d/0xf40
+  kernel: INFO: task systemd:16426 blocked for more than 368 seconds.
+  kernel:       Tainted: G           O        6.17.0-1020-oem #20-Ubuntu
+  kernel: task:systemd  state:D  pid:16426
+  kernel:  mutex_lock
+  kernel:  amdgpu_ctx_mgr_entity_flush+0x46/0x1f0 [amdgpu]
+  kernel:  amdgpu_flush+0x26/0x50 [amdgpu]
+  kernel:  filp_flush+0x6f/0xb0
+  kernel:  filp_close+0x14/0x30
+  kernel:  do_close_on_exec+0xe7/0x140
+  kernel:  begin_new_exec+0x1ab/0x420
+  kernel:  load_elf_binary+0x32d/0xf40
  
  [Fix]
  Cherry-pick upstream commits:
  - b18fc0ab837381 "drm/amdgpu: fix sync handling in amdgpu_dma_buf_move_notify"
  - 930595df251c "drm/amdgpu: remove check for BO reservation add assert 
instead"
  
  Pass NULL ticket to amdgpu_vm_handle_moved in move_notify so the clear=true
  path is used, avoiding a PT update while another process's job is still
  running — the contention that blocks amdgpu_ctx_mgr_entity_flush.
  
  [Test Plan]
  Boot oem-6.17 on dual-GPU AMD system without P2P PCI; glxgears on GPU0,
  Xorg on GPU1 sharing a dmabuf. Without fix: systemd hung >368s in
  amdgpu_ctx_mgr_entity_flush. With fix: no hang.
  
  [Where problems could occur]
  - Patch 1 changes only the ticket arg in amdgpu_dma_buf_move_notify; normal VM
-   update path untouched. Risk limited to dmabuf move-notify sync on multi-GPU
-   shared-BO setups.
+   update path untouched. Risk limited to dmabuf move-notify sync on multi-GPU
+   shared-BO setups.
  - Patch 2 replaces a runtime warning + -EINVAL with a lockdep assert (no-op on
-   production kernels with PROVE_LOCKING off). Low risk.
+   production kernels with PROVE_LOCKING off). Low risk.

** Description changed:

  [Impact]
  System hangs on amdgpu during exec. systemd blocked for 368s in D state in
  amdgpu_ctx_mgr_entity_flush (mutex_lock) via filp_close -> amdgpu_flush during
  do_close_on_exec -> load_elf_binary. Reproduces on 6.17.0-1020-oem.
  
-  kernel: INFO: task systemd:16426 blocked for more than 368 seconds.
-  kernel:       Tainted: G           O        6.17.0-1020-oem #20-Ubuntu
-  kernel: task:systemd  state:D  pid:16426
-  kernel:  mutex_lock
-  kernel:  amdgpu_ctx_mgr_entity_flush+0x46/0x1f0 [amdgpu]
-  kernel:  amdgpu_flush+0x26/0x50 [amdgpu]
-  kernel:  filp_flush+0x6f/0xb0
-  kernel:  filp_close+0x14/0x30
-  kernel:  do_close_on_exec+0xe7/0x140
-  kernel:  begin_new_exec+0x1ab/0x420
-  kernel:  load_elf_binary+0x32d/0xf40
+  kernel: INFO: task systemd:16426 blocked for more than 368 seconds.
+  kernel:       Tainted: G           O        6.17.0-1020-oem #20-Ubuntu
+  kernel: task:systemd  state:D  pid:16426
+  kernel:  mutex_lock
+  kernel:  amdgpu_ctx_mgr_entity_flush+0x46/0x1f0 [amdgpu]
+  kernel:  amdgpu_flush+0x26/0x50 [amdgpu]
+  kernel:  filp_flush+0x6f/0xb0
+  kernel:  filp_close+0x14/0x30
+  kernel:  do_close_on_exec+0xe7/0x140
+  kernel:  begin_new_exec+0x1ab/0x420
+  kernel:  load_elf_binary+0x32d/0xf40
  
  [Fix]
  Cherry-pick upstream commits:
  - b18fc0ab837381 "drm/amdgpu: fix sync handling in amdgpu_dma_buf_move_notify"
  - 930595df251c "drm/amdgpu: remove check for BO reservation add assert 
instead"
  
  Pass NULL ticket to amdgpu_vm_handle_moved in move_notify so the clear=true
  path is used, avoiding a PT update while another process's job is still
  running — the contention that blocks amdgpu_ctx_mgr_entity_flush.
  
  [Test Plan]
  Boot oem-6.17 on dual-GPU AMD system without P2P PCI; glxgears on GPU0,
  Xorg on GPU1 sharing a dmabuf. Without fix: systemd hung >368s in
  amdgpu_ctx_mgr_entity_flush. With fix: no hang.
  
  [Where problems could occur]
- - Patch 1 changes only the ticket arg in amdgpu_dma_buf_move_notify; normal VM
-   update path untouched. Risk limited to dmabuf move-notify sync on multi-GPU
-   shared-BO setups.
- - Patch 2 replaces a runtime warning + -EINVAL with a lockdep assert (no-op on
-   production kernels with PROVE_LOCKING off). Low risk.
+ changes only the ticket arg in amdgpu_dma_buf_move_notify; normal VM
+ update path untouched. Risk limited to dmabuf move-notify sync on multi-GPU
+ shared-BO setups.
+ It may break amdgpu driver.

** Summary changed:

- amdgpu: hang in amdgpu_ctx_mgr_entity_flush during exec
+ Fix amdgpu hang in amdgpu_ctx_mgr_entity_flush during exec

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2158236

Title:
  Fix amdgpu hang in amdgpu_ctx_mgr_entity_flush during exec

To manage notifications about this bug go to:
https://bugs.launchpad.net/hwe-next/+bug/2158236/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to