Public bug reported:

Summary
-------
A regression introduced between kernel 6.8.0-110-generic and 6.8.0-111-generic
(Ubuntu 24.04 LTS / noble) causes an NVIDIA Blackwell workstation GPU to fall
off the PCIe bus under sustained compute load within minutes. The same
userspace driver and identical workload run cleanly on 6.8.0-110-generic, so
the regression is in the kernel, not in the NVIDIA driver.

Affected hardware
-----------------
- GPU: NVIDIA RTX PRO 5000 Blackwell  (PCI ID 10de:2bb3, 48 GB, 300 W TDP)
- CPU/board: Intel Alder Lake-S (8086:460d root port at 00:01.0), ASUS
- PSU: 850 W (rated well above the 300 W card TDP)

Software
--------
- Ubuntu 24.04.4 LTS (noble)
- Bad kernel:  linux-image-6.8.0-111-generic   (6.8.0-111.111)
- Good kernel: linux-image-6.8.0-110-generic   (6.8.0-110.110)
- NVIDIA driver: nvidia-driver-580-open 580.159.03 (open kernel modules,
  required for Blackwell). Same driver version was tested against both kernels.
- DKMS-built nvidia kernel modules for both kernels.

Reproduction
------------
1. Boot 6.8.0-111-generic with nvidia-driver-580-open 580.159.03 loaded.
2. Place sustained compute load on the GPU (in this case, llama.cpp
   llama-server inference, ~26 GB model resident, GPU utilisation 90-100%,
   power draw 290-300 W, temperature 60-70 °C).
3. GPU falls off the bus within minutes (observed crashes at 3 min, 30 min,
   and ~6 hours after boot, all under load).
4. Booting 6.8.0-110-generic with the *same* userspace driver and the same
   workload runs cleanly and reaches the same thermal/power envelope (86 °C,
   300 W) with no errors over extended testing.

dmesg signature (truncated; happy to attach a full apport report)
-----------------------------------------------------------------
The crash always begins with PCIe correctable errors on the GPU's root port,
immediately followed by Xid 79 and Xid 154, then ~96 seconds of NVRM
assertion failures while the driver flails before the UVM subsystem reports
a fatal error:

  pcieport 0000:00:01.0: AER: Multiple Correctable error message received from 
0000:00:01.0
  pcieport 0000:00:01.0: PCIe Bus Error: severity=Correctable, type=Physical 
Layer, (Receiver ID)
  pcieport 0000:00:01.0:    [ 0] RxErr                  (First)
  NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
  NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) 
to 0x2 (Node Reboot Required)
  ... ~96s of NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || 
(status == NV_ERR_GPU_IN_FULLCHIP_RESET) ...
  NVRM: nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, 
requiring os reboot to recover.

Once in this state nvidia-smi returns "Unable to determine the device handle
for GPU0: 0000:01:00.0: Unknown Error" until the machine is rebooted.

What was ruled out
------------------
- Thermal: the card had been running 88-90 °C and 305-307 W under similar
  load every day for the prior 14 days without a single Xid (per a local
  5-second-interval metrics history). The crash on 6.8.0-111 happens at
  significantly lower stress (68 °C, 300 W).
- Power delivery / PSU sag: 850 W PSU vs 300 W TDP card; peak power at crash
  was below historical max. Same PSU, no hardware changes for >1 month prior
  to the regression.
- 12VHPWR / cable contact: untouched for over a month before crashes started;
  workstation card with conventional connector layout.
- NVIDIA driver: identical 580.159.03 driver works on 6.8.0-110.
- PCIe link integrity: AER reports only Correctable errors, on a fresh
  install with no slot/cable changes; the link is healthy on 6.8.0-110.

Expected behaviour
------------------
GPU remains functional under sustained load on 6.8.0-111-generic, equivalent
to 6.8.0-110-generic.

Actual behaviour
----------------
Under sustained compute load, the GPU disconnects from PCIe (Xid 79) within
minutes; recovery requires a reboot.

Workaround
----------
Pin 6.8.0-110-generic as the GRUB default (GRUB_DEFAULT=saved + 
grub-set-default). Future kernels need to be tested individually before 
promotion.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2152037

Title:
  [regression] kernel 6.8.0-111 causes Xid 79 "GPU has fallen off the
  bus" on RTX PRO 5000 Blackwell under load; 6.8.0-110 unaffected

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2152037/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to