[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper
The problem does not persist in newer versions of the driver. As such, it will be fixed once the new drivers are released. Additionally, in investigating the problem it seems that enabling the flag does NOT fix the problem. As such, there is no benefit to adding the flag in any case. Considering this done. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062380 Title: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper
Compiling the Nvidia drivers with -ffixed-x18 on affected versions is also sufficient to prevent this hang/panic: https://github.com/NVIDIA/open-gpu-kernel-modules diff --git a/src/nvidia-modeset/Makefile b/src/nvidia-modeset/Makefile index 66edbf4e..d49a3bfb 100644 --- a/src/nvidia-modeset/Makefile +++ b/src/nvidia-modeset/Makefile @@ -95,6 +95,7 @@ endif ifeq ($(TARGET_ARCH),aarch64) CFLAGS += -mgeneral-regs-only CFLAGS += -march=armv8-a + CFLAGS += -ffixed-x18 CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mno-outline-atomics) endif diff --git a/src/nvidia/Makefile b/src/nvidia/Makefile index e2f1c672..0f70514b 100644 --- a/src/nvidia/Makefile +++ b/src/nvidia/Makefile @@ -90,6 +90,7 @@ ifeq ($(TARGET_ARCH),aarch64) CFLAGS += -mgeneral-regs-only CFLAGS += -march=armv8-a CFLAGS += -mstrict-align + CFLAGS += -ffixed-x18 CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mno-outline-atomics) endif -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062380 Title: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper
In trying to determine if core count had any effect on this bug, I set maxcpus to 4 and tried loading the driver on the kernel with the shadow stack enabled (aka the standard -generic config). It looks like the same root issue occurred, but this time, I got a panic with a trace that corroborates the claim that this is related to the shadow stack: [ 391.736417] Internal error: Oops - FPAC: 7200 [#1] SMP [ 391.744257] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cdc_ether cdc_subset usbnet cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core ast i2c_algo_bit nvidia_cspmu arm_spe_pmu arm_smmuv3_pmu arm_cspmu_module uio_pdrv_genirq uio spi_nor acpi_ipmi mtd nls_iso8859_1 ipmi_ssif ipmi_devintf cppc_cpufreq ipmi_msghandler acpi_power_meter dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll i2c_smbus crct10dif_ce polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 nvme sha3_ce sha2_ce sha256_arm64 sha1_ce mlx5_core nvme_core mlxfw nvme_auth psample xhci_pci tls xhci_pci_renesas pci_hyperv_intf spi_tegra210_quad i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 391.826552] CPU: 0 PID: 14412 Comm: insmod Tainted: G OE 6.8.1+ #2 [ 391.834202] Hardware name: /, BIOS 01.02.01 20240207 [ 391.840074] pstate: 6349 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 391.847190] pc : __kmalloc+0x1e4/0x498 [ 391.851025] lr : 0xc040 [ 391.854605] sp : 8000a3ab3620 [ 391.857987] x29: 8000a3ab3620 x28: 0001 x27: 0001 [ 391.865282] x26: 01f8 x25: 00aa1d70 x24: 8feac028 [ 391.872577] x23: c040aab743f0 x22: 80008d4c5020 x21: 8000a3ab37f8 [ 391.879871] x20: 0038 x19: 8000a3ab3658 x18: 8000a3ab3614 [ 391.887165] x17: x16: x15: 0004 [ 391.894459] x14: x13: x12: [ 391.901753] x11: x10: 8000a3ab36a0 x9 : c040c0af8d48 [ 391.909049] x8 : 8edc3c40 x7 : x6 : [ 391.916343] x5 : x4 : x3 : [ 391.923637] x2 : x1 : 8e87c480 x0 : 8edc3c00 [ 391.930931] Call trace: [ 391.933427] __kmalloc+0x1e4/0x498 [ 391.936899] 0xc0007304e5f6c040 [ 391.940107] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf) [ 391.946336] ---[ end trace ]--- [ 391.977579] Kernel panic - not syncing: corrupted shadow stack detected inside scheduler [ 391.980605] kauditd_printk_skb: 98 callbacks suppressed [ 391.980607] audit: type=1400 audit(1713999301.128:108): apparmor="DENIED" operation="open" class="file" profile="rsyslogd" name="/run/systemd/sessions/" pid=801 comm=72733A6D61696E20513A526567 requested_mask="r" denied_mask="r" fsuid=103 ouid=0 [ 391.980674] audit: type=1400 audit(1713999301.128:109): apparmor="DENIED" operation="open" class="file" profile="rsyslogd" name="/run/systemd/sessions/" pid=801 comm=72733A6D61696E20513A526567 requested_mask="r" denied_mask="r" fsuid=103 ouid=0 [ 391.980679] audit: type=1400 audit(1713999301.128:110): apparmor="DENIED" operation="open" class="file" profile="rsyslogd" name="/run/systemd/sessions/" pid=801 comm=72733A6D61696E20513A526567 requested_mask="r" denied_mask="r" fsuid=103 ouid=0 [ 392.057603] SMP: stopping secondary CPUs [ 392.061632] Kernel Offset: 0x40404069 from 0x80008000 [ 392.067859] PHYS_OFFSET: 0x8000 [ 392.071420] CPU features: 0x0,,d002cd4a,2b67fea7 [ 392.076848] Memory Limit: none [ 392.106695] ---[ end Kernel panic - not syncing: corrupted shadow stack detected inside scheduler ]--- -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062380 Title: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper
It looks like this is the relevant option present in the upstream stable 6.8.1 defconfig but not in the 6.8.0-31-generic config that enables the defconfig kernel to load the Nvidia driver: CONFIG_SHADOW_CALL_STACK=n I suspect that the kernel team is not going to want to disable kernel support for the GCC shadow stack to fix this bug, so my guess is that we'll need to explore other potential fixes for this issue. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062380 Title: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper
** Changed in: nvidia-graphics-drivers-535-server (Ubuntu) Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin) ** Changed in: nvidia-graphics-drivers-550-server (Ubuntu) Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062380 Title: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper
This issue looks to be related to kernel configuration. Using upstream stable 6.8.1 which is what the current noble being tested is rebased on. Using 'make defconfig' the nvidia module loads successfully. But with same kernel using noble config, the nvidia module experiences the same hang as with noble kernel. I'm currently working through config comparison and testing changes. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2062380 Title: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper
** Summary changed: - Using a 6.8 kernel modprobe nvidia hangs on Grace Hopper + Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper ** Also affects: nvidia-graphics-drivers-535-server (Ubuntu) Importance: Undecided Status: New ** Changed in: nvidia-graphics-drivers-535-server (Ubuntu) Status: New => Confirmed ** Changed in: nvidia-graphics-drivers-550-server (Ubuntu) Status: New => Confirmed ** Description changed: Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I load the nvidia driver. + + $ sudo dmidecode -t 0 + # dmidecode 3.5 + Getting SMBIOS data from sysfs. + SMBIOS 3.6.0 present. + # SMBIOS implementations newer than version 3.5.0 are not + # fully supported by this version of dmidecode. + + Handle 0x0001, DMI type 0, 26 bytes + BIOS Information + Vendor: NVIDIA + Version: 01.02.01 + Release Date: 20240207 + ROM Size: 64 MB + Characteristics: + PCI is supported + PNP is supported + BIOS is upgradeable + BIOS shadowing is allowed + Boot from CD is supported + Selectable boot is supported + Serial services are supported (int 14h) + ACPI is supported + Targeted content distribution is supported + UEFI is supported + Firmware Revision: 0.0 [ 382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 382.946075] rcu: 53-...0: (4 ticks this GP) idle=1c2c/1/0x4000 softirq=4866/4868 fqs=14124 [ 382.955683] rcu: hardirqs softirqs csw/system [ 382.961378] rcu: number:0 00 [ 382.967071] rcu: cputime:0 00 ==> 30026(ms) [ 382.974189] rcu: (detected by 52, t=60034 jiffies, g=24469, q=1199 ncpus=72) [ 392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 392.992769] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior - After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1 KDUMP INFO WARNING: cpu 54: cannot find NT_PRSTATUS note - KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k [TAINTED] - DUMPFILE: /var/crash/202404172139/dump.202404172139 [PARTIAL DUMP] - CPUS: 72 - DATE: Wed Apr 17 21:39:13 UTC 2024 - UPTIME: 00:06:10 + KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k [TAINTED] + DUMPFILE: /var/crash/202404172139/dump.202404172139 [PARTIAL DUMP] + CPUS: 72 + DATE: Wed Apr 17 21:39:13 UTC 2024 + UPTIME: 00:06:10 LOAD AVERAGE: 0.68, 0.63, 0.28 -TASKS: 854 - NODENAME: hinyari - RELEASE: 6.8.0-1005-nvidia-64k - VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024 - MACHINE: aarch64 (unknown Mhz) - MEMORY: 479.7 GB -PANIC: "Kernel panic - not syncing: RCU Stall" - PID: 0 - COMMAND: "swapper/21" - TASK: 82026880 (1 of 72) [THREAD_INFO: 82026880] - CPU: 21 -STATE: TASK_RUNNING (PANIC) + TASKS: 854 + NODENAME: hinyari + RELEASE: 6.8.0-1005-nvidia-64k + VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024 + MACHINE: aarch64 (unknown Mhz) + MEMORY: 479.7 GB + PANIC: "Kernel panic - not syncing: RCU Stall" + PID: 0 + COMMAND: "swapper/21" + TASK: 82026880 (1 of 72) [THREAD_INFO: 82026880] + CPU: 21 + STATE: TASK_RUNNING (PANIC) [ 300.313144] nvidia: loading out-of-tree module taints kernel. [ 300.313153] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 - [ 300.316699] + [ 300.316699] [ 360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 360.331206] rcu: 54-...0: (24 ticks this GP) idle=742c/1/0x4000 softirq=4931/4933 fqs=13148 [ 360.340903] rcu: hardirqs softirqs csw/system [ 360.346597] rcu: number:0 00 [ 360.352291] rcu: cputime:0 00 ==> 30031(ms) [ 360.359408] rcu: (detected by 21, t=60038 jiffies, g=25009, q=1123 ncpus=72) [ 360.366704] Sending NMI from CPU 21 to CPUs 54: [ 370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 370.377983] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 370.387322] rcu: RCU grace-period kthread stack dump: [ 370.392482] task:rcu_preempt state:I stack:0 pid:17tgid:17 ppid:2 flags:0x0008 [ 370.392488] Call trace: [