[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

2024-06-13 Thread Eduard de Vidal Flores
The problem does not persist in newer versions of the driver. As such,
it will be fixed once the new drivers are released.

Additionally, in investigating the problem it seems that enabling the
flag does NOT fix the problem. As such, there is no benefit to adding
the flag in any case.

Considering this done.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2062380

Title:
  Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

2024-04-24 Thread Mitchell Augustin
Compiling the Nvidia drivers with -ffixed-x18 on affected versions is
also sufficient to prevent this hang/panic:

https://github.com/NVIDIA/open-gpu-kernel-modules

diff --git a/src/nvidia-modeset/Makefile b/src/nvidia-modeset/Makefile
index 66edbf4e..d49a3bfb 100644
--- a/src/nvidia-modeset/Makefile
+++ b/src/nvidia-modeset/Makefile
@@ -95,6 +95,7 @@ endif
 ifeq ($(TARGET_ARCH),aarch64)
   CFLAGS += -mgeneral-regs-only
   CFLAGS += -march=armv8-a
+  CFLAGS += -ffixed-x18
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mno-outline-atomics)
 endif
 
diff --git a/src/nvidia/Makefile b/src/nvidia/Makefile
index e2f1c672..0f70514b 100644
--- a/src/nvidia/Makefile
+++ b/src/nvidia/Makefile
@@ -90,6 +90,7 @@ ifeq ($(TARGET_ARCH),aarch64)
   CFLAGS += -mgeneral-regs-only
   CFLAGS += -march=armv8-a
   CFLAGS += -mstrict-align
+  CFLAGS += -ffixed-x18
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mno-outline-atomics)
 endif

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2062380

Title:
  Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

2024-04-24 Thread Mitchell Augustin
In trying to determine if core count had any effect on this bug, I set
maxcpus to 4 and tried loading the driver on the kernel with the shadow
stack enabled (aka the standard -generic config). It looks like the same
root issue occurred, but this time, I got a panic with a trace that
corroborates the claim that this is related to the shadow stack:

[  391.736417] Internal error: Oops - FPAC: 7200 [#1] SMP
[  391.744257] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cdc_ether 
cdc_subset usbnet cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core ast 
i2c_algo_bit nvidia_cspmu arm_spe_pmu arm_smmuv3_pmu arm_cspmu_module 
uio_pdrv_genirq uio spi_nor acpi_ipmi mtd nls_iso8859_1 ipmi_ssif ipmi_devintf 
cppc_cpufreq ipmi_msghandler acpi_power_meter dm_multipath efi_pstore nfnetlink 
dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon 
raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll 
i2c_smbus crct10dif_ce polyval_ce polyval_generic ghash_ce sm4_ce_gcm 
sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 nvme sha3_ce sha2_ce 
sha256_arm64 sha1_ce mlx5_core nvme_core mlxfw nvme_auth psample xhci_pci tls 
xhci_pci_renesas pci_hyperv_intf spi_tegra210_quad i2c_tegra aes_neon_bs 
aes_neon_blk aes_ce_blk aes_ce_cipher
[  391.826552] CPU: 0 PID: 14412 Comm: insmod Tainted: G   OE  
6.8.1+ #2
[  391.834202] Hardware name:  /, BIOS 01.02.01 20240207
[  391.840074] pstate: 6349 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[  391.847190] pc : __kmalloc+0x1e4/0x498
[  391.851025] lr : 0xc040
[  391.854605] sp : 8000a3ab3620
[  391.857987] x29: 8000a3ab3620 x28: 0001 x27: 0001
[  391.865282] x26: 01f8 x25: 00aa1d70 x24: 8feac028
[  391.872577] x23: c040aab743f0 x22: 80008d4c5020 x21: 8000a3ab37f8
[  391.879871] x20: 0038 x19: 8000a3ab3658 x18: 8000a3ab3614
[  391.887165] x17:  x16:  x15: 0004
[  391.894459] x14:  x13:  x12: 
[  391.901753] x11:  x10: 8000a3ab36a0 x9 : c040c0af8d48
[  391.909049] x8 : 8edc3c40 x7 :  x6 : 
[  391.916343] x5 :  x4 :  x3 : 
[  391.923637] x2 :  x1 : 8e87c480 x0 : 8edc3c00
[  391.930931] Call trace:
[  391.933427]  __kmalloc+0x1e4/0x498
[  391.936899]  0xc0007304e5f6c040
[  391.940107] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf) 
[  391.946336] ---[ end trace  ]---
[  391.977579] Kernel panic - not syncing: corrupted shadow stack detected 
inside scheduler
[  391.980605] kauditd_printk_skb: 98 callbacks suppressed
[  391.980607] audit: type=1400 audit(1713999301.128:108): apparmor="DENIED" 
operation="open" class="file" profile="rsyslogd" name="/run/systemd/sessions/" 
pid=801 comm=72733A6D61696E20513A526567 requested_mask="r" denied_mask="r" 
fsuid=103 ouid=0
[  391.980674] audit: type=1400 audit(1713999301.128:109): apparmor="DENIED" 
operation="open" class="file" profile="rsyslogd" name="/run/systemd/sessions/" 
pid=801 comm=72733A6D61696E20513A526567 requested_mask="r" denied_mask="r" 
fsuid=103 ouid=0
[  391.980679] audit: type=1400 audit(1713999301.128:110): apparmor="DENIED" 
operation="open" class="file" profile="rsyslogd" name="/run/systemd/sessions/" 
pid=801 comm=72733A6D61696E20513A526567 requested_mask="r" denied_mask="r" 
fsuid=103 ouid=0
[  392.057603] SMP: stopping secondary CPUs
[  392.061632] Kernel Offset: 0x40404069 from 0x80008000
[  392.067859] PHYS_OFFSET: 0x8000
[  392.071420] CPU features: 0x0,,d002cd4a,2b67fea7
[  392.076848] Memory Limit: none
[  392.106695] ---[ end Kernel panic - not syncing: corrupted shadow stack 
detected inside scheduler ]---

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2062380

Title:
  Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

2024-04-24 Thread Mitchell Augustin
It looks like this is the relevant option present in the upstream stable
6.8.1 defconfig but not in the 6.8.0-31-generic config that enables the
defconfig kernel to load the Nvidia driver:

CONFIG_SHADOW_CALL_STACK=n

I suspect that the kernel team is not going to want to disable kernel
support for the GCC shadow stack to fix this bug, so my guess is that
we'll need to explore other potential fixes for this issue.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2062380

Title:
  Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

2024-04-24 Thread Mitchell Augustin
** Changed in: nvidia-graphics-drivers-535-server (Ubuntu)
 Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

** Changed in: nvidia-graphics-drivers-550-server (Ubuntu)
 Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2062380

Title:
  Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

2024-04-19 Thread Ian May
This issue looks to be related to kernel configuration. Using upstream
stable 6.8.1 which is what the current noble being tested is rebased on.
Using 'make defconfig' the nvidia module loads successfully.  But with
same kernel using noble config, the nvidia module experiences the same
hang as with noble kernel.

I'm currently working through config comparison and testing changes.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2062380

Title:
  Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

2024-04-18 Thread Ian May
** Summary changed:

- Using a 6.8 kernel modprobe nvidia hangs on Grace Hopper
+ Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

** Also affects: nvidia-graphics-drivers-535-server (Ubuntu)
   Importance: Undecided
   Status: New

** Changed in: nvidia-graphics-drivers-535-server (Ubuntu)
   Status: New => Confirmed

** Changed in: nvidia-graphics-drivers-550-server (Ubuntu)
   Status: New => Confirmed

** Description changed:

  Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I
  load the nvidia driver.
+ 
+ $ sudo dmidecode -t 0
+ # dmidecode 3.5
+ Getting SMBIOS data from sysfs.
+ SMBIOS 3.6.0 present.
+ # SMBIOS implementations newer than version 3.5.0 are not
+ # fully supported by this version of dmidecode.
+ 
+ Handle 0x0001, DMI type 0, 26 bytes
+ BIOS Information
+   Vendor: NVIDIA
+   Version: 01.02.01
+   Release Date: 20240207
+   ROM Size: 64 MB
+   Characteristics:
+   PCI is supported
+   PNP is supported
+   BIOS is upgradeable
+   BIOS shadowing is allowed
+   Boot from CD is supported
+   Selectable boot is supported
+   Serial services are supported (int 14h)
+   ACPI is supported
+   Targeted content distribution is supported
+   UEFI is supported
+   Firmware Revision: 0.0
  
  [  382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [  382.946075] rcu: 53-...0: (4 ticks this GP) 
idle=1c2c/1/0x4000 softirq=4866/4868 fqs=14124
  [  382.955683] rcu:  hardirqs   softirqs   csw/system
  [  382.961378] rcu:  number:0  00
  [  382.967071] rcu: cputime:0  00   ==> 
30026(ms)
  [  382.974189] rcu: (detected by 52, t=60034 jiffies, g=24469, q=1199 
ncpus=72)
  [  392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
  [  392.992769] rcu: Unless rcu_preempt kthread gets sufficient CPU time, 
OOM is now expected behavior
  
- 
  After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1
  
  KDUMP INFO
  WARNING: cpu 54: cannot find NT_PRSTATUS note
-   KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k  [TAINTED]
- DUMPFILE: /var/crash/202404172139/dump.202404172139  [PARTIAL DUMP]
- CPUS: 72
- DATE: Wed Apr 17 21:39:13 UTC 2024
-   UPTIME: 00:06:10
+   KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k  [TAINTED]
+ DUMPFILE: /var/crash/202404172139/dump.202404172139  [PARTIAL DUMP]
+ CPUS: 72
+ DATE: Wed Apr 17 21:39:13 UTC 2024
+   UPTIME: 00:06:10
  LOAD AVERAGE: 0.68, 0.63, 0.28
-TASKS: 854
- NODENAME: hinyari
-  RELEASE: 6.8.0-1005-nvidia-64k
-  VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024
-  MACHINE: aarch64  (unknown Mhz)
-   MEMORY: 479.7 GB
-PANIC: "Kernel panic - not syncing: RCU Stall"
-  PID: 0
-  COMMAND: "swapper/21"
- TASK: 82026880  (1 of 72)  [THREAD_INFO: 82026880]
-  CPU: 21
-STATE: TASK_RUNNING (PANIC)
+    TASKS: 854
+ NODENAME: hinyari
+  RELEASE: 6.8.0-1005-nvidia-64k
+  VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024
+  MACHINE: aarch64  (unknown Mhz)
+   MEMORY: 479.7 GB
+    PANIC: "Kernel panic - not syncing: RCU Stall"
+  PID: 0
+  COMMAND: "swapper/21"
+ TASK: 82026880  (1 of 72)  [THREAD_INFO: 82026880]
+  CPU: 21
+    STATE: TASK_RUNNING (PANIC)
  
  [  300.313144] nvidia: loading out-of-tree module taints kernel.
  [  300.313153] nvidia: module verification failed: signature and/or required 
key missing - tainting kernel
  [  300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 506
- [  300.316699] 
+ [  300.316699]
  [  360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [  360.331206] rcu: 54-...0: (24 ticks this GP) 
idle=742c/1/0x4000 softirq=4931/4933 fqs=13148
  [  360.340903] rcu:  hardirqs   softirqs   csw/system
  [  360.346597] rcu:  number:0  00
  [  360.352291] rcu: cputime:0  00   ==> 
30031(ms)
  [  360.359408] rcu: (detected by 21, t=60038 jiffies, g=25009, q=1123 
ncpus=72)
  [  360.366704] Sending NMI from CPU 21 to CPUs 54:
  [  370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
  [  370.377983] rcu: Unless rcu_preempt kthread gets sufficient CPU time, 
OOM is now expected behavior.
  [  370.387322] rcu: RCU grace-period kthread stack dump:
  [  370.392482] task:rcu_preempt state:I stack:0 pid:17tgid:17
ppid:2  flags:0x0008
  [  370.392488] Call trace:
  [