Hey Daniel,
got a freeze with !5020 in like 50min, so i try to get this nvidia bug fixed, 
otherwise i have to search for an alternative. nvidia is consuming so much of 
my lifetime, not only this bug, but many in the last year or so. 

AI Summary:

  I tested MR !5020 (never-fail-swap) against a compositor freeze I've been 
seeing
  on my system. Wanted to share the results and explain why I believe this is a
  different issue from what !5020 addresses.

  Setup:

  Single-GPU system: RTX 3060 Ti, driver 595.58.03 proprietary, Ubuntu 26.04,
  Wayland, dual 4K monitors (EIZO EV2785 via DisplayPort/KVM).

  What I built:

  Took the 50.0-0ubuntu4 source package, reversed the two !5008-based patches,
  applied both commits from your never-fail-swap branch (9713674437 + 
4fd35080d1)
  to meta-onscreen-native.c. Built as 50.0-0ubuntu4+mr5020. Did not include the
  secondary GPU buffer refactor (single-GPU system).

  Result:

  Compositor froze after ~50 minutes. Same deadlock as with the 0ubuntu3
patches.

  Why !5020 can't fix this specific deadlock:

  Your fix operates on the mutter side — it ensures correct EGL/GBM swap pairing
  and that pending frames are consumed before a new swap. This controls what
  happens before and after eglSwapBuffers is called.

  The deadlock I'm hitting is inside a single eglSwapBuffers call. The backtrace
  shows mutter correctly initiating a swap, but the driver never returns:

  #17 cogl_onscreen_swap_buffers_with_damage()   ← mutter calls swap
  #15 ??? () at libmutter-cogl-18.so.0           ← cogl calls eglSwapBuffers
  #14 ??? () at libEGL_nvidia.so.0               ← driver code from here
  ...
  #7  ??? () at libEGL_nvidia.so.0
  #6  pthread_cond_wait (mutex=0x57119e076570)   ← hangs forever

  8 additional threads are stuck in pthread_cond_wait on another mutex
  (0x7579524ed160).

  The root cause is a TOCTOU race on an unsynchronized needs_signal byte flag
  (offset 0x1f8) inside libEGL_nvidia.so:

  1. Waiter thread is about to call pthread_cond_wait, but hasn't set 
needs_signal
  = 1 yet
  2. Signaler thread checks needs_signal, sees 0, skips pthread_cond_broadcast
  3. Waiter sets needs_signal = 1 and enters pthread_cond_wait — but the signal
  already came and won't come again

  No matter how correctly mutter sequences its swaps, once eglSwapBuffers is
  called, the driver's internal synchronization is solely responsible. I have a
  minimal reproducer (191 lines, no compositor, single output, just 
eglSwapBuffers
  in a loop) that deadlocks in 2 iterations — confirming the bug is entirely 
inside
   the driver.

  Bug report: NVIDIA: Developer Forum #366254

  This is a different bug from the swap-pairing issues that !5020 and LP 
#2146782
  address. Full backtrace available if useful.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2147648

Title:
   gnome-shell freeze: NVIDIA EGL deadlock in eglSwapBuffers triggered
  by notification damage rects

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/mutter/+bug/2147648/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to