[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
The fix is also available on 535.171.04 available here - https://www.nvidia.com/Download/driverResults.aspx/223761/en-us/ -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-hwe-6.5 in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: Incomplete Status in linux-hwe-6.5 package in Ubuntu: New Status in nvidia-graphics-drivers-525 package in Ubuntu: Incomplete Status in nvidia-graphics-drivers-525-server package in Ubuntu: Incomplete Status in nvidia-graphics-drivers-535 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535-server package in Ubuntu: Confirmed Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 0003 [ 384.007527] x23: 0001 x22: 00056d1d76c8 x21: c87cf18c2690 [ 384.008086] x20: 00056d1fafa0 x19: 00056d1d76c0 x18:
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
Hi all, This should be fixed on the latest driver 550.67 - https://www.nvidia.com/Download/driverResults.aspx/223429/en-us/. Please help verify if this is resolved on your systems. Thanks! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-hwe-6.5 in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: Incomplete Status in linux-hwe-6.5 package in Ubuntu: New Status in nvidia-graphics-drivers-525 package in Ubuntu: Incomplete Status in nvidia-graphics-drivers-525-server package in Ubuntu: Incomplete Status in nvidia-graphics-drivers-535 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535-server package in Ubuntu: Confirmed Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 0003 [ 384.007527] x23: 0001 x22: 00056d1d76c8 x21: c87cf18c2690 [
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
I identified a similar bug today when installing nvidia- fabricmanager-535 on a noble dev build for arm64 that may be related: https://bugs.launchpad.net/ubuntu/+source/fabric- manager-535/+bug/2052663 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-hwe-6.5 in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: Incomplete Status in linux-hwe-6.5 package in Ubuntu: New Status in nvidia-graphics-drivers-525 package in Ubuntu: Incomplete Status in nvidia-graphics-drivers-525-server package in Ubuntu: Incomplete Status in nvidia-graphics-drivers-535 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535-server package in Ubuntu: Confirmed Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 0003 [ 384.007527] x23: 0001 x22: 00056d1d76c8 x21:
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
I gave this another spin today with 6.5.0-17-generic #17~22.04.1 and the LRM modules of the 535 driver (6.5.0-17.17~22.04.1+1 of linux-modules- nvidia-535-server-generic-hwe-22.04) on our Altra system with 2x L4 GPUs and the same problem exists as with the DKMS modules: [ 39.437849] watchdog: BUG: soft lockup - CPU#62 stuck for 26s! [systemd-udevd:850] [ 39.445411] Modules linked in: nvidia(POE+) crct10dif_ce polyval_ce polyval_generic ghash_ce ast mlx5_core video drm_shmem_helper sm4 mlxfw sha2_ce drm_kms_helper nvme psample sha256_arm64 sha1_ce nvme_core igb drm tls xhci_pci nvme_common pci_hyperv_intf xhci_pci_renesas i2c_algo_bit aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 39.474949] CPU: 62 PID: 850 Comm: systemd-udevd Tainted: P OE 6.5.0-17-generic #17~22.04.1-Ubuntu [ 39.485196] Hardware name: GIGABYTE G242-P30-JG/MP32-AR0-JG, BIOS F07 03/22/2021 [ 39.492578] pstate: 8049 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 39.499526] pc : smp_call_function_many_cond+0x19c/0x720 [ 39.504830] lr : smp_call_function_many_cond+0x1b8/0x720 [ 39.510130] sp : 80008934b920 [ 39.513431] x29: 80008934b920 x28: aef99146dd10 x27: [ 39.520554] x26: 004f x25: 085dcfffbb80 x24: 0026 [ 39.527677] x23: 0001 x22: 085dcfdd6708 x21: aef9914726e0 [ 39.534799] x20: 085dcfadbb80 x19: 085dcfdd6700 x18: 800089341060 [ 39.541921] x17: x16: x15: 43535f5f00656c75 [ 39.549044] x14: 0c030b111b111303 x13: 0006 x12: 3931413337353339 [ 39.556166] x11: 0101010101010101 x10: 004f x9 : aef98ee015b8 [ 39.563289] x8 : x7 : x6 : 003e [ 39.570411] x5 : aef99146d000 x4 : x3 : 085dcfadbb88 [ 39.577533] x2 : 0026 x1 : 0011 x0 : [ 39.584656] Call trace: [ 39.587090] smp_call_function_many_cond+0x19c/0x720 [ 39.592043] kick_all_cpus_sync+0x50/0xa8 [ 39.596040] flush_module_icache+0x94/0xf8 [ 39.600125] load_module+0x448/0x8e0 [ 39.603688] init_module_from_file+0x94/0x110 [ 39.608033] idempotent_init_module+0x194/0x2b0 [ 39.612551] __arm64_sys_finit_module+0x74/0x100 [ 39.617155] invoke_syscall+0x7c/0x130 [ 39.620892] el0_svc_common.constprop.0+0x5c/0x170 [ 39.625670] do_el0_svc+0x38/0x68 [ 39.628972] el0_svc+0x30/0xe0 [ 39.632016] el0t_64_sync_handler+0x128/0x158 [ 39.636360] el0t_64_sync+0x1b0/0x1b8 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-hwe-6.5 in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: Incomplete Status in linux-hwe-6.5 package in Ubuntu: New Status in nvidia-graphics-drivers-525 package in Ubuntu: Incomplete Status in nvidia-graphics-drivers-525-server package in Ubuntu: Incomplete Status in nvidia-graphics-drivers-535 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535-server package in Ubuntu: Confirmed Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
** Changed in: nvidia-graphics-drivers-525 (Ubuntu) Status: Confirmed => Incomplete ** Changed in: nvidia-graphics-drivers-525-server (Ubuntu) Status: Confirmed => Incomplete ** Changed in: linux-aws (Ubuntu) Status: Confirmed => Incomplete ** Also affects: linux-hwe-6.5 (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to nvidia-graphics-drivers-535-server in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: Incomplete Status in linux-hwe-6.5 package in Ubuntu: New Status in nvidia-graphics-drivers-525 package in Ubuntu: Incomplete Status in nvidia-graphics-drivers-525-server package in Ubuntu: Incomplete Status in nvidia-graphics-drivers-535 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535-server package in Ubuntu: Confirmed Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27:
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
Verified that with linux-aws-edge 6.5.0.1012.12~22.04.1 the DKMS installation via $ sudo apt install -y nvidia-driver-535-server on an AWS g5g.xlarge goes through the driver comes up fine. Trying the same with linux-generic-hwe-22.04-edge 6.5.0-17-generic #17~22.04.1 on an Ampere Altra with 2x NVIDIA L4 still runs into the same hang with nvidia-headless-535-server (535.154.05-0ubuntu0.22.04.1). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525-server package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535-server package in Ubuntu: Confirmed Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954]
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
I can reproduce the failure on mantic with both the DKMS and LRM drivers. Specifically what I'm doing to install these are: for DKMS: sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server for LRM: sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-headless-no-dkms-535-server linux-modules-nvidia-535-server-generic nvidia-utils-535-server I'm intentionally not using `ubuntu-drivers` to isolate this testing to just the installation and functioning of the drivers. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525-server package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535-server package in Ubuntu: Confirmed Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp :
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: nvidia-graphics-drivers-535-server (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525-server package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535-server package in Ubuntu: Confirmed Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 0003 [ 384.007527] x23: 0001 x22: 00056d1d76c8 x21: c87cf18c2690 [ 384.008086] x20: 00056d1fafa0 x19: 00056d1d76c0 x18: 8896d058 [
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: nvidia-graphics-drivers-535 (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525-server package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535-server package in Ubuntu: Confirmed Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 0003 [ 384.007527] x23: 0001 x22: 00056d1d76c8 x21: c87cf18c2690 [ 384.008086] x20: 00056d1fafa0 x19: 00056d1d76c0 x18: 8896d058 [
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: nvidia-graphics-drivers-525-server (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525-server package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535-server package in Ubuntu: Confirmed Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 0003 [ 384.007527] x23: 0001 x22: 00056d1d76c8 x21: c87cf18c2690 [ 384.008086] x20: 00056d1fafa0 x19: 00056d1d76c0 x18: 8896d058 [
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: nvidia-graphics-drivers-525 (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525-server package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535-server package in Ubuntu: Confirmed Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 0003 [ 384.007527] x23: 0001 x22: 00056d1d76c8 x21: c87cf18c2690 [ 384.008086] x20: 00056d1fafa0 x19: 00056d1d76c0 x18: 8896d058 [
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: linux-aws (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-525-server package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535 package in Ubuntu: Confirmed Status in nvidia-graphics-drivers-535-server package in Ubuntu: Confirmed Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 0003 [ 384.007527] x23: 0001 x22: 00056d1d76c8 x21: c87cf18c2690 [ 384.008086] x20: 00056d1fafa0 x19: 00056d1d76c0 x18: 8896d058 [ 384.008645] x17:
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
I am surprised that `ubuntu-drivers list` doesn't provide any drivers to install, when it really should. To install pre-built drivers I use $ sudo apt install linux-modules-nvidia-535-server-aws nvidia- headless-535-server Such that signed nvidia modules provided by Canonical are installed. Similarly to upgrade to edge variant I did: $ sudo apt install linux-aws-edge linux-modules-nvidia-535-server-aws- edge -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: New Status in nvidia-graphics-drivers-525 package in Ubuntu: New Status in nvidia-graphics-drivers-525-server package in Ubuntu: New Status in nvidia-graphics-drivers-535 package in Ubuntu: New Status in nvidia-graphics-drivers-535-server package in Ubuntu: New Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26:
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
I wonder if the bug is with trying to install self-built dkms modules, instead of pre-built ones, and how come ubuntu-drivers is not offering pre-built ones... -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: New Status in nvidia-graphics-drivers-525 package in Ubuntu: New Status in nvidia-graphics-drivers-525-server package in Ubuntu: New Status in nvidia-graphics-drivers-535 package in Ubuntu: New Status in nvidia-graphics-drivers-535-server package in Ubuntu: New Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 0003 [ 384.007527] x23: 0001 x22: 00056d1d76c8 x21: c87cf18c2690 [ 384.008086] x20: 00056d1fafa0 x19: 00056d1d76c0 x18: 8896d058 [ 384.008645] x17:
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
and everything seems to work fine. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: New Status in nvidia-graphics-drivers-525 package in Ubuntu: New Status in nvidia-graphics-drivers-525-server package in Ubuntu: New Status in nvidia-graphics-drivers-535 package in Ubuntu: New Status in nvidia-graphics-drivers-535-server package in Ubuntu: New Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 0003 [ 384.007527] x23: 0001 x22: 00056d1d76c8 x21: c87cf18c2690 [ 384.008086] x20: 00056d1fafa0 x19: 00056d1d76c0 x18: 8896d058 [ 384.008645] x17: x16: x15: 617362755f5f0073 [ 384.009209] x14: 0001 x13: 0006 x12:
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
Trying the same with the linux-nvidia-hwe-22.04-edge kernel from proposed linux-image-6.5.0-1011-nvidia wit the same NVIDIA driver (535.154.05-0ubuntu0.22.04.1 of nvidia-utils-535-server) and loading kernel driver and running nvidia-smi works fine without problems. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: New Status in nvidia-graphics-drivers-525 package in Ubuntu: New Status in nvidia-graphics-drivers-525-server package in Ubuntu: New Status in nvidia-graphics-drivers-535 package in Ubuntu: New Status in nvidia-graphics-drivers-535-server package in Ubuntu: New Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 0003 [ 384.007527] x23: 0001 x22: 00056d1d76c8 x21: c87cf18c2690 [
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
Verified that the issue does not exist with 535.154.05-0ubuntu0.22.04.1 of nvidia-utils-535-server on 6.2.0-1017-aws or 6.2.0-1018-aws of linux- aws. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: New Status in nvidia-graphics-drivers-525 package in Ubuntu: New Status in nvidia-graphics-drivers-525-server package in Ubuntu: New Status in nvidia-graphics-drivers-535 package in Ubuntu: New Status in nvidia-graphics-drivers-535-server package in Ubuntu: New Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 0003 [ 384.007527] x23: 0001 x22: 00056d1d76c8 x21: c87cf18c2690 [ 384.008086] x20: 00056d1fafa0 x19: 00056d1d76c0 x18: 8896d058 [ 384.008645] x17: x16:
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
I can reproduce the the same with the latest 535.154.05-0ubuntu0.22.04.1 on jammy with the 6.5 HWE kernel on an arm64 machine. The same happens with the -server driver 535.154.05-0ubuntu0.22.04.1. Reproducing is pretty simple: 1. Boot plain Ubuntu 24.04 with either HWE already installed or manually installed to switch to it from GA 2. Install NVIDIA driver via $ sudo apt install -y nvidia-headless-535-server or $ sudo apt install -y nvidia-headless-535 Doing either an nvidia-smi (triggers the modprobe of nvidia kernel modules) or a `modprobe nvidia` makes the system hang entirely. The same works fine on the 5.15 GA kernel. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: New Status in nvidia-graphics-drivers-525 package in Ubuntu: New Status in nvidia-graphics-drivers-525-server package in Ubuntu: New Status in nvidia-graphics-drivers-535 package in Ubuntu: New Status in nvidia-graphics-drivers-535-server package in Ubuntu: New Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc :
[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic
since then, we had multiple glibc srus; kernel sru's and most recently new release of 535-server. can i request for this to be retested again? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/2029934 Title: arm64 AWS host hangs during modprobe nvidia on lunar and mantic Status in linux-aws package in Ubuntu: New Status in nvidia-graphics-drivers-525 package in Ubuntu: New Status in nvidia-graphics-drivers-525-server package in Ubuntu: New Status in nvidia-graphics-drivers-535 package in Ubuntu: New Status in nvidia-graphics-drivers-535-server package in Ubuntu: New Bug description: Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance. To reproduce using the generic kernel: # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge. # Install the linux generic kernel from lunar-updates: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic # Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel) $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws $ reboot # Install the Nvidia 535-server driver DKMS package: $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server # Enable the driver $ sudo modprobe nvidia # At this point the system will hang and never return. # A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached): [1.964942] nvidia: loading out-of-tree module taints kernel. [1.965475] nvidia: module license 'NVIDIA' taints kernel. [1.965905] Disabling lock debugging due to kernel taint [1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510 [2.012715] [ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=3301 [ 62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4) [ 62.027018] Task dump for CPU 3: [ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000e [ 62.028066] Call trace: [ 62.028273] __switch_to+0xbc/0x100 [ 62.028567] 0x228 Timed out for waiting the udev queue being empty. Timed out for waiting the udev queue being empty. [ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000 softirq=653/654 fqs=12303 [ 242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4) [ 242.046874] Task dump for CPU 3: [ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144flags:0x000f [ 242.047922] Call trace: [ 242.048128] __switch_to+0xbc/0x100 [ 242.048417] 0x228 Timed out for waiting the udev queue being empty. Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215] [ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu [ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018 [ 384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4 [ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4 [ 384.006108] sp : 889a3a70 [ 384.006381] x29: 889a3a70 x28: 0003 x27: 00056d1fafa0 [ 384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 0003 [ 384.007527] x23: 0001 x22: 00056d1d76c8 x21: c87cf18c2690 [ 384.008086] x20: 00056d1fafa0 x19: 00056d1d76c0 x18: 8896d058 [ 384.008645] x17: x16: