Public bug reported:
Summary
-------
A regression introduced between kernel 6.8.0-110-generic and 6.8.0-111-generic
(Ubuntu 24.04 LTS / noble) causes an NVIDIA Blackwell workstation GPU to fall
off the PCIe bus under sustained compute load within minutes. The same
userspace driver and identical workload run cleanly on 6.8.0-110-generic, so
the regression is in the kernel, not in the NVIDIA driver.
Affected hardware
-----------------
- GPU: NVIDIA RTX PRO 5000 Blackwell (PCI ID 10de:2bb3, 48 GB, 300 W TDP)
- CPU/board: Intel Alder Lake-S (8086:460d root port at 00:01.0), ASUS
- PSU: 850 W (rated well above the 300 W card TDP)
Software
--------
- Ubuntu 24.04.4 LTS (noble)
- Bad kernel: linux-image-6.8.0-111-generic (6.8.0-111.111)
- Good kernel: linux-image-6.8.0-110-generic (6.8.0-110.110)
- NVIDIA driver: nvidia-driver-580-open 580.159.03 (open kernel modules,
required for Blackwell). Same driver version was tested against both kernels.
- DKMS-built nvidia kernel modules for both kernels.
Reproduction
------------
1. Boot 6.8.0-111-generic with nvidia-driver-580-open 580.159.03 loaded.
2. Place sustained compute load on the GPU (in this case, llama.cpp
llama-server inference, ~26 GB model resident, GPU utilisation 90-100%,
power draw 290-300 W, temperature 60-70 °C).
3. GPU falls off the bus within minutes (observed crashes at 3 min, 30 min,
and ~6 hours after boot, all under load).
4. Booting 6.8.0-110-generic with the *same* userspace driver and the same
workload runs cleanly and reaches the same thermal/power envelope (86 °C,
300 W) with no errors over extended testing.
dmesg signature (truncated; happy to attach a full apport report)
-----------------------------------------------------------------
The crash always begins with PCIe correctable errors on the GPU's root port,
immediately followed by Xid 79 and Xid 154, then ~96 seconds of NVRM
assertion failures while the driver flails before the UVM subsystem reports
a fatal error:
pcieport 0000:00:01.0: AER: Multiple Correctable error message received from
0000:00:01.0
pcieport 0000:00:01.0: PCIe Bus Error: severity=Correctable, type=Physical
Layer, (Receiver ID)
pcieport 0000:00:01.0: [ 0] RxErr (First)
NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None)
to 0x2 (Node Reboot Required)
... ~96s of NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) ||
(status == NV_ERR_GPU_IN_FULLCHIP_RESET) ...
NVRM: nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60,
requiring os reboot to recover.
Once in this state nvidia-smi returns "Unable to determine the device handle
for GPU0: 0000:01:00.0: Unknown Error" until the machine is rebooted.
What was ruled out
------------------
- Thermal: the card had been running 88-90 °C and 305-307 W under similar
load every day for the prior 14 days without a single Xid (per a local
5-second-interval metrics history). The crash on 6.8.0-111 happens at
significantly lower stress (68 °C, 300 W).
- Power delivery / PSU sag: 850 W PSU vs 300 W TDP card; peak power at crash
was below historical max. Same PSU, no hardware changes for >1 month prior
to the regression.
- 12VHPWR / cable contact: untouched for over a month before crashes started;
workstation card with conventional connector layout.
- NVIDIA driver: identical 580.159.03 driver works on 6.8.0-110.
- PCIe link integrity: AER reports only Correctable errors, on a fresh
install with no slot/cable changes; the link is healthy on 6.8.0-110.
Expected behaviour
------------------
GPU remains functional under sustained load on 6.8.0-111-generic, equivalent
to 6.8.0-110-generic.
Actual behaviour
----------------
Under sustained compute load, the GPU disconnects from PCIe (Xid 79) within
minutes; recovery requires a reboot.
Workaround
----------
Pin 6.8.0-110-generic as the GRUB default (GRUB_DEFAULT=saved +
grub-set-default). Future kernels need to be tested individually before
promotion.
** Affects: linux (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2152037
Title:
[regression] kernel 6.8.0-111 causes Xid 79 "GPU has fallen off the
bus" on RTX PRO 5000 Blackwell under load; 6.8.0-110 unaffected
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2152037/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs