Public bug reported:

Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I
load the nvidia driver.

[  382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  382.946075] rcu:     53-...0: (4 ticks this GP) 
idle=1c2c/1/0x4000000000000000 softirq=4866/4868 fqs=14124
[  382.955683] rcu:              hardirqs   softirqs   csw/system
[  382.961378] rcu:      number:        0          0            0
[  382.967071] rcu:     cputime:        0          0            0   ==> 
30026(ms)
[  382.974189] rcu:     (detected by 52, t=60034 jiffies, g=24469, q=1199 
ncpus=72)
[  392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
[  392.992769] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, 
OOM is now expected behavior


After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1

KDUMP INFO
WARNING: cpu 54: cannot find NT_PRSTATUS note
      KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k  [TAINTED]
    DUMPFILE: /var/crash/202404172139/dump.202404172139  [PARTIAL DUMP]
        CPUS: 72
        DATE: Wed Apr 17 21:39:13 UTC 2024
      UPTIME: 00:06:10
LOAD AVERAGE: 0.68, 0.63, 0.28
       TASKS: 854
    NODENAME: hinyari
     RELEASE: 6.8.0-1005-nvidia-64k
     VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024
     MACHINE: aarch64  (unknown Mhz)
      MEMORY: 479.7 GB
       PANIC: "Kernel panic - not syncing: RCU Stall"
         PID: 0
     COMMAND: "swapper/21"
        TASK: ffff000082026880  (1 of 72)  [THREAD_INFO: ffff000082026880]
         CPU: 21
       STATE: TASK_RUNNING (PANIC)

[  300.313144] nvidia: loading out-of-tree module taints kernel.
[  300.313153] nvidia: module verification failed: signature and/or required 
key missing - tainting kernel
[  300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 506
[  300.316699] 
[  360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  360.331206] rcu:     54-...0: (24 ticks this GP) 
idle=742c/1/0x4000000000000000 softirq=4931/4933 fqs=13148
[  360.340903] rcu:              hardirqs   softirqs   csw/system
[  360.346597] rcu:      number:        0          0            0
[  360.352291] rcu:     cputime:        0          0            0   ==> 
30031(ms)
[  360.359408] rcu:     (detected by 21, t=60038 jiffies, g=25009, q=1123 
ncpus=72)
[  360.366704] Sending NMI from CPU 21 to CPUs 54:
[  370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
[  370.377983] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, 
OOM is now expected behavior.
[  370.387322] rcu: RCU grace-period kthread stack dump:
[  370.392482] task:rcu_preempt     state:I stack:0     pid:17    tgid:17    
ppid:2      flags:0x00000008
[  370.392488] Call trace:
[  370.392489]  __switch_to+0xd0/0x118
[  370.392499]  __schedule+0x2a8/0x7b0
[  370.392501]  schedule+0x40/0x168
[  370.392502]  schedule_timeout+0xac/0x1e0
[  370.392505]  rcu_gp_fqs_loop+0x128/0x508
[  370.392512]  rcu_gp_kthread+0x150/0x188
[  370.392514]  kthread+0xf8/0x110
[  370.392519]  ret_from_fork+0x10/0x20
[  370.392524] rcu: Stack dump where RCU GP kthread last ran:
[  370.398128] Sending NMI from CPU 21 to CPUs 31:
[  370.398131] NMI backtrace for cpu 31
[  370.398136] CPU: 31 PID: 0 Comm: swapper/31 Kdump: loaded Tainted: G         
  OE      6.8.0-1005-nvidia-64k #5-Ubuntu
[  370.398139] Hardware name:  /P3880, BIOS         01.02.01 20240207
[  370.398140] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[  370.398142] pc : cpuidle_enter_state+0xd8/0x790
[  370.398150] lr : cpuidle_enter_state+0xcc/0x790
[  370.398153] sp : ffff800081eefd70
[  370.398154] x29: ffff800081eefd70 x28: 0000000000000000 x27: 0000000000000000
[  370.398157] x26: 0000000000000000 x25: 000000563d67e4e0 x24: 0000000000000000
[  370.398160] x23: ffffa0a1445699f8 x22: 0000000000000000 x21: 000000563d72ece0
[  370.398162] x20: ffffa0a144569a10 x19: ffff00008fa4a800 x18: ffff800081f00030
[  370.398165] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ac8c73b08db0
[  370.398168] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[  370.398170] x11: 0000000000000000 x10: 2da0fbe3d5e8c649 x9 : ffffa0a1424fd244
[  370.398173] x8 : ffff0000820559b8 x7 : 0000000000000000 x6 : 0000000000000000
[  370.398175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[  370.398178] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
[  370.398181] Call trace:
[  370.398183]  cpuidle_enter_state+0xd8/0x790
[  370.398185]  cpuidle_enter+0x44/0x78
[  370.398195]  cpuidle_idle_call+0x15c/0x210
[  370.398202]  do_idle+0xb0/0x130
[  370.398204]  cpu_startup_entry+0x40/0x50
[  370.398206]  secondary_start_kernel+0xec/0x130
[  370.398211]  __secondary_switched+0xc0/0xc8
[  370.399132] Kernel panic - not syncing: RCU Stall
[  370.403938] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G         
  OE      6.8.0-1005-nvidia-64k #5-Ubuntu
[  370.414876] Hardware name:  /P3880, BIOS         01.02.01 20240207
[  370.421192] Call trace:
[  370.423686]  dump_backtrace+0xa4/0x150
[  370.427514]  show_stack+0x24/0x50
[  370.430896]  dump_stack_lvl+0x78/0xf8
[  370.434640]  dump_stack+0x1c/0x38
[  370.438023]  panic+0x3a4/0x440
[  370.441141]  print_other_cpu_stall+0x578/0x610
[  370.445681]  check_cpu_stall+0x240/0x300
[  370.449686]  rcu_pending+0x44/0x220
[  370.453248]  rcu_sched_clock_irq+0x7c/0x2c8
[  370.457519]  update_process_times+0x7c/0xf8
[  370.461794]  tick_sched_handle+0x3c/0x98
[  370.465803]  tick_nohz_highres_handler+0x5c/0xe8
[  370.470520]  __hrtimer_run_queues+0x164/0x398
[  370.474969]  hrtimer_interrupt+0xf4/0x278
[  370.479063]  arch_timer_handler_phys+0x38/0x80
[  370.483607]  handle_percpu_devid_irq+0x94/0x2b8
[  370.488238]  generic_handle_domain_irq+0x38/0x70
[  370.492954]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
[  370.498743]  gic_handle_irq+0x2c/0xa0
[  370.502481]  call_on_irq_stack+0x3c/0x50
[  370.506486]  do_interrupt_handler+0xb0/0xc8
[  370.510759]  el1_interrupt+0x48/0xf0
[  370.514409]  el1h_64_irq_handler+0x1c/0x40
[  370.518592]  el1h_64_irq+0x7c/0x80
[  370.522063]  cpuidle_enter_state+0xd8/0x790
[  370.526336]  cpuidle_enter+0x44/0x78
[  370.529986]  cpuidle_idle_call+0x15c/0x210
[  370.534169]  do_idle+0xb0/0x130
[  370.537375]  cpu_startup_entry+0x44/0x50
[  370.541380]  secondary_start_kernel+0xec/0x130
[  370.545919]  __secondary_switched+0xc0/0xc8
[  370.550197] SMP: stopping secondary CPUs
[  371.601076] SMP: failed to stop secondary CPUs 0-20,22-71
[  371.607097] Starting crashdump kernel...
[  371.611103] ------------[ cut here ]------------
[  371.615820] Some CPUs may be stale, kdump will be unreliable.
[  371.621695] WARNING: CPU: 21 PID: 0 at arch/arm64/kernel/machine_kexec.c:174 
machine_kexec+0x48/0x1f0
[  371.631124] Modules linked in: nvidia(OE+) ecc qrtr cfg80211 binfmt_misc 
dax_hmem cxl_acpi cxl_core nvidia_cspmu acpi_ipmi ast cdc_ether cdc_subset 
arm_smmuv3_pmu arm_cspmu_module coresight_trbe usbnet arm_spe_pmu ipmi_ssif 
i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler 
nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x 
coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath 
efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic 
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor 
xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core 
mlx5_dpll crct10dif_ce mlx5_core polyval_ce polyval_generic ghash_ce sm4_ce_gcm 
sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce mlxfw sm3 nvme psample sha3_ce 
i2c_smbus sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas 
pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk 
aes_ce_cipher
[  371.719810] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G         
  OE      6.8.0-1005-nvidia-64k #5-Ubuntu
[  371.730748] Hardware name:  /P3880, BIOS         01.02.01 20240207
[  371.737064] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[  371.744180] pc : machine_kexec+0x48/0x1f0
[  371.748275] lr : machine_kexec+0x48/0x1f0
[  371.752369] sp : ffff8000802afa10
[  371.755751] x29: ffff8000802afa10 x28: 0000000000000463 x27: 000000000000003c
[  371.763047] x26: 00000000000000c0 x25: 0000000000000280 x24: ffffa0a144268cb4
[  371.770341] x23: ffffa0a14439f540 x22: ffffa0a1447cf4c0 x21: ffffa0a14481a000
[  371.777636] x20: ffff0000d987e000 x19: ffff0000d987e000 x18: ffff800080ba0088
[  371.784930] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000463
[  371.792225] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 726e75206562206c
[  371.799519] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 0000000000000000
[  371.806814] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[  371.814108] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[  371.821402] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
[  371.828696] Call trace:
[  371.831189]  machine_kexec+0x48/0x1f0
[  371.834928]  __crash_kexec+0x94/0x128
[  371.838668]  panic+0x380/0x440
[  371.841784]  print_other_cpu_stall+0x578/0x610
[  371.846325]  check_cpu_stall+0x240/0x300
[  371.850331]  rcu_pending+0x44/0x220
[  371.853892]  rcu_sched_clock_irq+0x7c/0x2c8
[  371.858163]  update_process_times+0x7c/0xf8
[  371.862434]  tick_sched_handle+0x3c/0x98
[  371.866440]  tick_nohz_highres_handler+0x5c/0xe8
[  371.871156]  __hrtimer_run_queues+0x164/0x398
[  371.875605]  hrtimer_interrupt+0xf4/0x278
[  371.879700]  arch_timer_handler_phys+0x38/0x80
[  371.884240]  handle_percpu_devid_irq+0x94/0x2b8
[  371.888869]  generic_handle_domain_irq+0x38/0x70
[  371.893585]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
[  371.899368]  gic_handle_irq+0x2c/0xa0
[  371.903105]  call_on_irq_stack+0x3c/0x50
[  371.907110]  do_interrupt_handler+0xb0/0xc8
[  371.911382]  el1_interrupt+0x48/0xf0
[  371.915032]  el1h_64_irq_handler+0x1c/0x40
[  371.919215]  el1h_64_irq+0x7c/0x80
[  371.922686]  cpuidle_enter_state+0xd8/0x790
[  371.926958]  cpuidle_enter+0x44/0x78
[  371.930609]  cpuidle_idle_call+0x15c/0x210
[  371.934793]  do_idle+0xb0/0x130
[  371.937998]  cpu_startup_entry+0x44/0x50
[  371.942003]  secondary_start_kernel+0xec/0x130
[  371.946542]  __secondary_switched+0xc0/0xc8
[  371.950815] ---[ end trace 0000000000000000 ]---


In an attempt to get more debug info, I tried the open driver in github
Using https://github.com/NVIDIA/open-gpu-kernel-modules
Version 550.76- loads successfully
Version 550.67- loads successfully
Version 550.54.15 - crashes - which is the same version as the 550 package that 
hangs.  Below is the crash info.  What is interesting is that in an attempt to 
capture more debug into I changed optimization in utils.mk from -O2 to -O0 and 
the crash went away.  It also doesn't happen with -O1.  

CRASH INFO
[ 8648.399518] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 506
[ 8648.399560] 
[ 8648.399718] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP
[ 8648.407556] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cfg80211 
binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu arm_smmuv3_pmu 
arm_cspmu_module coresight_trbe arm_spe_pmu acpi_ipmi ast cdc_ether cdc_subset 
ipmi_ssif usbnet i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf 
ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq 
coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight 
dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs 
blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib 
ib_uverbs macsec ib_core mlx5_dpll mlx5_core crct10dif_ce polyval_ce 
polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce 
sm3 mlxfw i2c_smbus nvme psample sha3_ce sha2_ce nvme_core tls sha256_arm64 
xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra 
aes_neon_bs aes_neon_blk aes_ce_blk aes
 _ce_cipher [last unloaded: nvidia(OE)]
[ 8648.407608] 
[ 8648.501397] CPU: 5 PID: 48130 Comm: insmod Kdump: loaded Tainted: G          
 OE      6.8.0-1004-nvidia-64k #4
[ 8648.511625] Hardware name:  /P3880, BIOS         01.02.01 20240207
[ 8648.517941] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 8648.525058] pc : __kmalloc+0x1e0/0x490
[ 8648.528892] lr : 0xffffa00000000000
[ 8648.532482] sp : ffff8000d132f5f0
[ 8648.535864] x29: ffff8000d132f5f0 x28: 0000000000000000 x27: ffffa00084d50484
[ 8648.543159] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: ffff0000c2aba828
[ 8648.550454] x23: ffffa00085026380 x22: ffff80009d3e0020 x21: ffff8000d132f7c8
[ 8648.557749] x20: 0000000000000038 x19: ffff8000d132f628 x18: ffff8000d132f5e4
[ 8648.565043] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000004
[ 8648.572337] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 8648.579632] x11: 0000000000000000 x10: ffff8000d132f670 x9 : ffffa000806f73ec
[ 8648.586926] x8 : ffff0000c2a98240 x7 : 0000000000000000 x6 : 0000000000000000
[ 8648.594221] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 8648.601516] x2 : 0000000000000000 x1 : ffff000100084480 x0 : ffff0000c2a98200
[ 8648.608810] Call trace:
[ 8648.611305]  __kmalloc+0x1e0/0x490
[ 8648.614778]  0x8000604466e4a000
[ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf) 
[ 8648.624219] SMP: stopping secondary CPUs

** Affects: nvidia-graphics-drivers-550-server (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to nvidia-graphics-drivers-550-server in
Ubuntu.
https://bugs.launchpad.net/bugs/2062380

Title:
  Using a 6.8 kernel modprobe nvidia hangs on Grace Hopper

Status in nvidia-graphics-drivers-550-server package in Ubuntu:
  New

Bug description:
  Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I
  load the nvidia driver.

  [  382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [  382.946075] rcu:     53-...0: (4 ticks this GP) 
idle=1c2c/1/0x4000000000000000 softirq=4866/4868 fqs=14124
  [  382.955683] rcu:              hardirqs   softirqs   csw/system
  [  382.961378] rcu:      number:        0          0            0
  [  382.967071] rcu:     cputime:        0          0            0   ==> 
30026(ms)
  [  382.974189] rcu:     (detected by 52, t=60034 jiffies, g=24469, q=1199 
ncpus=72)
  [  392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
  [  392.992769] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, 
OOM is now expected behavior

  
  After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1

  KDUMP INFO
  WARNING: cpu 54: cannot find NT_PRSTATUS note
        KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k  [TAINTED]
      DUMPFILE: /var/crash/202404172139/dump.202404172139  [PARTIAL DUMP]
          CPUS: 72
          DATE: Wed Apr 17 21:39:13 UTC 2024
        UPTIME: 00:06:10
  LOAD AVERAGE: 0.68, 0.63, 0.28
         TASKS: 854
      NODENAME: hinyari
       RELEASE: 6.8.0-1005-nvidia-64k
       VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024
       MACHINE: aarch64  (unknown Mhz)
        MEMORY: 479.7 GB
         PANIC: "Kernel panic - not syncing: RCU Stall"
           PID: 0
       COMMAND: "swapper/21"
          TASK: ffff000082026880  (1 of 72)  [THREAD_INFO: ffff000082026880]
           CPU: 21
         STATE: TASK_RUNNING (PANIC)

  [  300.313144] nvidia: loading out-of-tree module taints kernel.
  [  300.313153] nvidia: module verification failed: signature and/or required 
key missing - tainting kernel
  [  300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 506
  [  300.316699] 
  [  360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [  360.331206] rcu:     54-...0: (24 ticks this GP) 
idle=742c/1/0x4000000000000000 softirq=4931/4933 fqs=13148
  [  360.340903] rcu:              hardirqs   softirqs   csw/system
  [  360.346597] rcu:      number:        0          0            0
  [  360.352291] rcu:     cputime:        0          0            0   ==> 
30031(ms)
  [  360.359408] rcu:     (detected by 21, t=60038 jiffies, g=25009, q=1123 
ncpus=72)
  [  360.366704] Sending NMI from CPU 21 to CPUs 54:
  [  370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
  [  370.377983] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, 
OOM is now expected behavior.
  [  370.387322] rcu: RCU grace-period kthread stack dump:
  [  370.392482] task:rcu_preempt     state:I stack:0     pid:17    tgid:17    
ppid:2      flags:0x00000008
  [  370.392488] Call trace:
  [  370.392489]  __switch_to+0xd0/0x118
  [  370.392499]  __schedule+0x2a8/0x7b0
  [  370.392501]  schedule+0x40/0x168
  [  370.392502]  schedule_timeout+0xac/0x1e0
  [  370.392505]  rcu_gp_fqs_loop+0x128/0x508
  [  370.392512]  rcu_gp_kthread+0x150/0x188
  [  370.392514]  kthread+0xf8/0x110
  [  370.392519]  ret_from_fork+0x10/0x20
  [  370.392524] rcu: Stack dump where RCU GP kthread last ran:
  [  370.398128] Sending NMI from CPU 21 to CPUs 31:
  [  370.398131] NMI backtrace for cpu 31
  [  370.398136] CPU: 31 PID: 0 Comm: swapper/31 Kdump: loaded Tainted: G       
    OE      6.8.0-1005-nvidia-64k #5-Ubuntu
  [  370.398139] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [  370.398140] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
  [  370.398142] pc : cpuidle_enter_state+0xd8/0x790
  [  370.398150] lr : cpuidle_enter_state+0xcc/0x790
  [  370.398153] sp : ffff800081eefd70
  [  370.398154] x29: ffff800081eefd70 x28: 0000000000000000 x27: 
0000000000000000
  [  370.398157] x26: 0000000000000000 x25: 000000563d67e4e0 x24: 
0000000000000000
  [  370.398160] x23: ffffa0a1445699f8 x22: 0000000000000000 x21: 
000000563d72ece0
  [  370.398162] x20: ffffa0a144569a10 x19: ffff00008fa4a800 x18: 
ffff800081f00030
  [  370.398165] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000ac8c73b08db0
  [  370.398168] x14: 0000000000000000 x13: 0000000000000000 x12: 
0000000000000000
  [  370.398170] x11: 0000000000000000 x10: 2da0fbe3d5e8c649 x9 : 
ffffa0a1424fd244
  [  370.398173] x8 : ffff0000820559b8 x7 : 0000000000000000 x6 : 
0000000000000000
  [  370.398175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 
0000000000000000
  [  370.398178] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 
0000000000000000
  [  370.398181] Call trace:
  [  370.398183]  cpuidle_enter_state+0xd8/0x790
  [  370.398185]  cpuidle_enter+0x44/0x78
  [  370.398195]  cpuidle_idle_call+0x15c/0x210
  [  370.398202]  do_idle+0xb0/0x130
  [  370.398204]  cpu_startup_entry+0x40/0x50
  [  370.398206]  secondary_start_kernel+0xec/0x130
  [  370.398211]  __secondary_switched+0xc0/0xc8
  [  370.399132] Kernel panic - not syncing: RCU Stall
  [  370.403938] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G       
    OE      6.8.0-1005-nvidia-64k #5-Ubuntu
  [  370.414876] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [  370.421192] Call trace:
  [  370.423686]  dump_backtrace+0xa4/0x150
  [  370.427514]  show_stack+0x24/0x50
  [  370.430896]  dump_stack_lvl+0x78/0xf8
  [  370.434640]  dump_stack+0x1c/0x38
  [  370.438023]  panic+0x3a4/0x440
  [  370.441141]  print_other_cpu_stall+0x578/0x610
  [  370.445681]  check_cpu_stall+0x240/0x300
  [  370.449686]  rcu_pending+0x44/0x220
  [  370.453248]  rcu_sched_clock_irq+0x7c/0x2c8
  [  370.457519]  update_process_times+0x7c/0xf8
  [  370.461794]  tick_sched_handle+0x3c/0x98
  [  370.465803]  tick_nohz_highres_handler+0x5c/0xe8
  [  370.470520]  __hrtimer_run_queues+0x164/0x398
  [  370.474969]  hrtimer_interrupt+0xf4/0x278
  [  370.479063]  arch_timer_handler_phys+0x38/0x80
  [  370.483607]  handle_percpu_devid_irq+0x94/0x2b8
  [  370.488238]  generic_handle_domain_irq+0x38/0x70
  [  370.492954]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
  [  370.498743]  gic_handle_irq+0x2c/0xa0
  [  370.502481]  call_on_irq_stack+0x3c/0x50
  [  370.506486]  do_interrupt_handler+0xb0/0xc8
  [  370.510759]  el1_interrupt+0x48/0xf0
  [  370.514409]  el1h_64_irq_handler+0x1c/0x40
  [  370.518592]  el1h_64_irq+0x7c/0x80
  [  370.522063]  cpuidle_enter_state+0xd8/0x790
  [  370.526336]  cpuidle_enter+0x44/0x78
  [  370.529986]  cpuidle_idle_call+0x15c/0x210
  [  370.534169]  do_idle+0xb0/0x130
  [  370.537375]  cpu_startup_entry+0x44/0x50
  [  370.541380]  secondary_start_kernel+0xec/0x130
  [  370.545919]  __secondary_switched+0xc0/0xc8
  [  370.550197] SMP: stopping secondary CPUs
  [  371.601076] SMP: failed to stop secondary CPUs 0-20,22-71
  [  371.607097] Starting crashdump kernel...
  [  371.611103] ------------[ cut here ]------------
  [  371.615820] Some CPUs may be stale, kdump will be unreliable.
  [  371.621695] WARNING: CPU: 21 PID: 0 at 
arch/arm64/kernel/machine_kexec.c:174 machine_kexec+0x48/0x1f0
  [  371.631124] Modules linked in: nvidia(OE+) ecc qrtr cfg80211 binfmt_misc 
dax_hmem cxl_acpi cxl_core nvidia_cspmu acpi_ipmi ast cdc_ether cdc_subset 
arm_smmuv3_pmu arm_cspmu_module coresight_trbe usbnet arm_spe_pmu ipmi_ssif 
i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler 
nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x 
coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath 
efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic 
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor 
xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core 
mlx5_dpll crct10dif_ce mlx5_core polyval_ce polyval_generic ghash_ce sm4_ce_gcm 
sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce mlxfw sm3 nvme psample sha3_ce 
i2c_smbus sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas 
pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk 
aes_ce_cipher
  [  371.719810] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G       
    OE      6.8.0-1005-nvidia-64k #5-Ubuntu
  [  371.730748] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [  371.737064] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
  [  371.744180] pc : machine_kexec+0x48/0x1f0
  [  371.748275] lr : machine_kexec+0x48/0x1f0
  [  371.752369] sp : ffff8000802afa10
  [  371.755751] x29: ffff8000802afa10 x28: 0000000000000463 x27: 
000000000000003c
  [  371.763047] x26: 00000000000000c0 x25: 0000000000000280 x24: 
ffffa0a144268cb4
  [  371.770341] x23: ffffa0a14439f540 x22: ffffa0a1447cf4c0 x21: 
ffffa0a14481a000
  [  371.777636] x20: ffff0000d987e000 x19: ffff0000d987e000 x18: 
ffff800080ba0088
  [  371.784930] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000000000000463
  [  371.792225] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 
726e75206562206c
  [  371.799519] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 
0000000000000000
  [  371.806814] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 
0000000000000000
  [  371.814108] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 
0000000000000000
  [  371.821402] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 
0000000000000000
  [  371.828696] Call trace:
  [  371.831189]  machine_kexec+0x48/0x1f0
  [  371.834928]  __crash_kexec+0x94/0x128
  [  371.838668]  panic+0x380/0x440
  [  371.841784]  print_other_cpu_stall+0x578/0x610
  [  371.846325]  check_cpu_stall+0x240/0x300
  [  371.850331]  rcu_pending+0x44/0x220
  [  371.853892]  rcu_sched_clock_irq+0x7c/0x2c8
  [  371.858163]  update_process_times+0x7c/0xf8
  [  371.862434]  tick_sched_handle+0x3c/0x98
  [  371.866440]  tick_nohz_highres_handler+0x5c/0xe8
  [  371.871156]  __hrtimer_run_queues+0x164/0x398
  [  371.875605]  hrtimer_interrupt+0xf4/0x278
  [  371.879700]  arch_timer_handler_phys+0x38/0x80
  [  371.884240]  handle_percpu_devid_irq+0x94/0x2b8
  [  371.888869]  generic_handle_domain_irq+0x38/0x70
  [  371.893585]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
  [  371.899368]  gic_handle_irq+0x2c/0xa0
  [  371.903105]  call_on_irq_stack+0x3c/0x50
  [  371.907110]  do_interrupt_handler+0xb0/0xc8
  [  371.911382]  el1_interrupt+0x48/0xf0
  [  371.915032]  el1h_64_irq_handler+0x1c/0x40
  [  371.919215]  el1h_64_irq+0x7c/0x80
  [  371.922686]  cpuidle_enter_state+0xd8/0x790
  [  371.926958]  cpuidle_enter+0x44/0x78
  [  371.930609]  cpuidle_idle_call+0x15c/0x210
  [  371.934793]  do_idle+0xb0/0x130
  [  371.937998]  cpu_startup_entry+0x44/0x50
  [  371.942003]  secondary_start_kernel+0xec/0x130
  [  371.946542]  __secondary_switched+0xc0/0xc8
  [  371.950815] ---[ end trace 0000000000000000 ]---

  
  In an attempt to get more debug info, I tried the open driver in github
  Using https://github.com/NVIDIA/open-gpu-kernel-modules
  Version 550.76- loads successfully
  Version 550.67- loads successfully
  Version 550.54.15 - crashes - which is the same version as the 550 package 
that hangs.  Below is the crash info.  What is interesting is that in an 
attempt to capture more debug into I changed optimization in utils.mk from -O2 
to -O0 and the crash went away.  It also doesn't happen with -O1.  

  CRASH INFO
  [ 8648.399518] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 506
  [ 8648.399560] 
  [ 8648.399718] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP
  [ 8648.407556] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cfg80211 
binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu arm_smmuv3_pmu 
arm_cspmu_module coresight_trbe arm_spe_pmu acpi_ipmi ast cdc_ether cdc_subset 
ipmi_ssif usbnet i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf 
ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq 
coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight 
dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs 
blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib 
ib_uverbs macsec ib_core mlx5_dpll mlx5_core crct10dif_ce polyval_ce 
polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce 
sm3 mlxfw i2c_smbus nvme psample sha3_ce sha2_ce nvme_core tls sha256_arm64 
xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra 
aes_neon_bs aes_neon_blk aes_ce_blk a
 es_ce_cipher [last unloaded: nvidia(OE)]
  [ 8648.407608] 
  [ 8648.501397] CPU: 5 PID: 48130 Comm: insmod Kdump: loaded Tainted: G        
   OE      6.8.0-1004-nvidia-64k #4
  [ 8648.511625] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [ 8648.517941] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
  [ 8648.525058] pc : __kmalloc+0x1e0/0x490
  [ 8648.528892] lr : 0xffffa00000000000
  [ 8648.532482] sp : ffff8000d132f5f0
  [ 8648.535864] x29: ffff8000d132f5f0 x28: 0000000000000000 x27: 
ffffa00084d50484
  [ 8648.543159] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: 
ffff0000c2aba828
  [ 8648.550454] x23: ffffa00085026380 x22: ffff80009d3e0020 x21: 
ffff8000d132f7c8
  [ 8648.557749] x20: 0000000000000038 x19: ffff8000d132f628 x18: 
ffff8000d132f5e4
  [ 8648.565043] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000000000000004
  [ 8648.572337] x14: 0000000000000000 x13: 0000000000000000 x12: 
0000000000000000
  [ 8648.579632] x11: 0000000000000000 x10: ffff8000d132f670 x9 : 
ffffa000806f73ec
  [ 8648.586926] x8 : ffff0000c2a98240 x7 : 0000000000000000 x6 : 
0000000000000000
  [ 8648.594221] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 
0000000000000000
  [ 8648.601516] x2 : 0000000000000000 x1 : ffff000100084480 x0 : 
ffff0000c2a98200
  [ 8648.608810] Call trace:
  [ 8648.611305]  __kmalloc+0x1e0/0x490
  [ 8648.614778]  0x8000604466e4a000
  [ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf) 
  [ 8648.624219] SMP: stopping secondary CPUs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-550-server/+bug/2062380/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to