[Kernel-packages] [Bug 2052663] Re: fabric-manager-535 setup fails during install on Grace/Hopper arm64 system running noble

2024-04-24 Thread Mitchell Augustin
This bug no longer appears to be reproducible on noble with the 6.8
generic kernels, so I have marked it as resolved.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2052663

Title:
  fabric-manager-535 setup fails during install on Grace/Hopper arm64
  system running noble

Status in fabric-manager-535 package in Ubuntu:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in nvidia-graphics-drivers-535-server package in Ubuntu:
  Fix Released

Bug description:
  This error occurs on both the standard and largemem variants of the latest 
Noble server build of Ubuntu:
  Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-generic-64k 
aarch64) (iso link: 
https://cdimage.ubuntu.com/ubuntu-server/daily-live/20240207.1/noble-live-server-arm64+largemem.iso)
  Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-generic aarch64) 
(iso link: 
https://cdimage.ubuntu.com/ubuntu-server/daily-live/20240207/noble-live-server-arm64.iso)
  CPU/GPU: Nvidia Grace/Hopper

  lsb_release -rd: 
  No LSB modules are available.
  Description:  Ubuntu Noble Numbat (development branch)
  Release:  24.04

  Kernel versions affected:
  GNU/Linux 6.6.0-14-generic-64k aarch64
  GNU/Linux 6.6.0-14-generic aarch64

  Package version: nvidia-fabricmanager-535 (535.154.05-0ubuntu1 arm64)

  Expected behavior: Package starts as expected during post-install
  setup steps

  Actual behavior:
  On our grace/hopper system running noble, when installing 
nvidia-fabricmanager-535, the installation froze at 60% twice, along with all 
ssh processes. I am also unable to ssh back into the system after this happens.

  This is the last output I see from my installer shell:
  + apt install -y nvidia-fabricmanager-535
  Reading package lists... Done
  Building dependency tree... Done
  Reading state information... Done
  The following NEW packages will be installed:
nvidia-fabricmanager-535
  0 upgraded, 1 newly installed, 0 to remove and 1 not upgraded.
  Need to get 1795 kB of archives.
  After this operation, 8679 kB of additional disk space will be used.
  Get:1 http://ports.ubuntu.com/ubuntu-ports noble/multiverse arm64 
nvidia-fabricmanager-535 arm64 535.154.05-0ubuntu1 [1795 kB]
  Fetched 1795 kB in 1s (2439 kB/s)
  Selecting previously unselected package nvidia-fabricmanager-535.
  (Reading database ... 103745 files and directories currently installed.)
  Preparing to unpack 
.../nvidia-fabricmanager-535_535.154.05-0ubuntu1_arm64.deb ...
  Unpacking nvidia-fabricmanager-535 (535.154.05-0ubuntu1) ...
  Setting up nvidia-fabricmanager-535 (535.154.05-0ubuntu1) ...
  Created symlink 
/etc/systemd/system/multi-user.target.wants/nvidia-fabricmanager.service → 
/lib/systemd/system/nvidia-fabricmanager.service.

  Progress: [ 60%]
  
[#...]

  
  This does not appear to cause a panic/reboot, as I can still interact with 
the console, and it even appears that the apt process is still running in ps 
aux (although it doesn't seem to progress). However, I observe the following 
output in the console that I believe may be related:
  [ 1453.814597] watchdog: BUG: soft lockup - CPU#16 stuck for 670s! 
[(udev-worker):33269]
  [ 1477.814602] watchdog: BUG: soft lockup - CPU#16 stuck for 693s! 
[(udev-worker):33269]
  [ 1501.814606] watchdog: BUG: soft lockup - CPU#16 stuck for 715s! 
[(udev-worker):33269]
  [ 1525.814611] watchdog: BUG: soft lockup - CPU#16 stuck for 738s! 
[(udev-worker):33269]
  [ 1579.666718] rcu: INFO: rcu_preempt detected expedited stalls on 
CPUs/tasks: { 17-...D } 240893 ji
  ffies s: 653 root: 0x2/.
  [ 1579.678114] rcu: blocking rcu_node structures (internal RCU debug): 
l=1:15-29:0x4/.
  [ 1597.814625] watchdog: BUG: soft lockup - CPU#16 stuck for 805s! 
[(udev-worker):33269]
  [ 1621.814630] watchdog: BUG: soft lockup - CPU#16 stuck for 827s! 
[(udev-worker):33269]
  [ 1630.562655] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [ 1630.568973] rcu:   17-...0: (1 GPs behind) idle=2444/1/0x4000 
softirq=13696/13700 f
  qs=126842
  [ 1630.578665] rcu:hardirqs   softirqs   csw/system
  [ 1630.584381] rcu:number:0  00
  [ 1630.590109] rcu:   cputime:0  00   ==> 
1110384(ms)
  [ 1630.597458] rcu:   (detected by 20, t=285099 jiffies, g=74061, q=113266 
ncpus=72)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/fabric-manager-535/+bug/2052663/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2052663] Re: fabric-manager-535 setup fails during install on Grace/Hopper arm64 system running noble

2024-04-24 Thread Mitchell Augustin
** Changed in: fabric-manager-535 (Ubuntu)
 Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

** Changed in: linux (Ubuntu)
 Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

** Changed in: nvidia-graphics-drivers-535-server (Ubuntu)
 Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

** Changed in: fabric-manager-535 (Ubuntu)
   Status: New => Fix Released

** Changed in: linux (Ubuntu)
   Status: New => Fix Released

** Changed in: nvidia-graphics-drivers-535-server (Ubuntu)
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2052663

Title:
  fabric-manager-535 setup fails during install on Grace/Hopper arm64
  system running noble

Status in fabric-manager-535 package in Ubuntu:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in nvidia-graphics-drivers-535-server package in Ubuntu:
  Fix Released

Bug description:
  This error occurs on both the standard and largemem variants of the latest 
Noble server build of Ubuntu:
  Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-generic-64k 
aarch64) (iso link: 
https://cdimage.ubuntu.com/ubuntu-server/daily-live/20240207.1/noble-live-server-arm64+largemem.iso)
  Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-generic aarch64) 
(iso link: 
https://cdimage.ubuntu.com/ubuntu-server/daily-live/20240207/noble-live-server-arm64.iso)
  CPU/GPU: Nvidia Grace/Hopper

  lsb_release -rd: 
  No LSB modules are available.
  Description:  Ubuntu Noble Numbat (development branch)
  Release:  24.04

  Kernel versions affected:
  GNU/Linux 6.6.0-14-generic-64k aarch64
  GNU/Linux 6.6.0-14-generic aarch64

  Package version: nvidia-fabricmanager-535 (535.154.05-0ubuntu1 arm64)

  Expected behavior: Package starts as expected during post-install
  setup steps

  Actual behavior:
  On our grace/hopper system running noble, when installing 
nvidia-fabricmanager-535, the installation froze at 60% twice, along with all 
ssh processes. I am also unable to ssh back into the system after this happens.

  This is the last output I see from my installer shell:
  + apt install -y nvidia-fabricmanager-535
  Reading package lists... Done
  Building dependency tree... Done
  Reading state information... Done
  The following NEW packages will be installed:
nvidia-fabricmanager-535
  0 upgraded, 1 newly installed, 0 to remove and 1 not upgraded.
  Need to get 1795 kB of archives.
  After this operation, 8679 kB of additional disk space will be used.
  Get:1 http://ports.ubuntu.com/ubuntu-ports noble/multiverse arm64 
nvidia-fabricmanager-535 arm64 535.154.05-0ubuntu1 [1795 kB]
  Fetched 1795 kB in 1s (2439 kB/s)
  Selecting previously unselected package nvidia-fabricmanager-535.
  (Reading database ... 103745 files and directories currently installed.)
  Preparing to unpack 
.../nvidia-fabricmanager-535_535.154.05-0ubuntu1_arm64.deb ...
  Unpacking nvidia-fabricmanager-535 (535.154.05-0ubuntu1) ...
  Setting up nvidia-fabricmanager-535 (535.154.05-0ubuntu1) ...
  Created symlink 
/etc/systemd/system/multi-user.target.wants/nvidia-fabricmanager.service → 
/lib/systemd/system/nvidia-fabricmanager.service.

  Progress: [ 60%]
  
[#...]

  
  This does not appear to cause a panic/reboot, as I can still interact with 
the console, and it even appears that the apt process is still running in ps 
aux (although it doesn't seem to progress). However, I observe the following 
output in the console that I believe may be related:
  [ 1453.814597] watchdog: BUG: soft lockup - CPU#16 stuck for 670s! 
[(udev-worker):33269]
  [ 1477.814602] watchdog: BUG: soft lockup - CPU#16 stuck for 693s! 
[(udev-worker):33269]
  [ 1501.814606] watchdog: BUG: soft lockup - CPU#16 stuck for 715s! 
[(udev-worker):33269]
  [ 1525.814611] watchdog: BUG: soft lockup - CPU#16 stuck for 738s! 
[(udev-worker):33269]
  [ 1579.666718] rcu: INFO: rcu_preempt detected expedited stalls on 
CPUs/tasks: { 17-...D } 240893 ji
  ffies s: 653 root: 0x2/.
  [ 1579.678114] rcu: blocking rcu_node structures (internal RCU debug): 
l=1:15-29:0x4/.
  [ 1597.814625] watchdog: BUG: soft lockup - CPU#16 stuck for 805s! 
[(udev-worker):33269]
  [ 1621.814630] watchdog: BUG: soft lockup - CPU#16 stuck for 827s! 
[(udev-worker):33269]
  [ 1630.562655] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [ 1630.568973] rcu:   17-...0: (1 GPs behind) idle=2444/1/0x4000 
softirq=13696/13700 f
  qs=126842
  [ 1630.578665] rcu:hardirqs   softirqs   csw/system
  [ 1630.584381] rcu:number:0  00
  [ 1630.590109] rcu:   cputime:0  00   ==> 
1110384(ms)

[Kernel-packages] [Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-04-09 Thread Mitchell Augustin
Fix has landed upstream:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/aio.c?h=v6.9-rc3=caeb4b0a11b3393e43f7fa8e0a5a18462acc66bd

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

Status in linux package in Ubuntu:
  Fix Committed

Bug description:
  A kernel oops and panic occurred during 22.04 SoC certification on
  Gunyolk (Grace/Grace) with 6.8 kernel, arm64+largemem variant

  Steps to reproduce:
  Run (as root) the following commands:

  add-apt-repository -y ppa:checkbox-dev/stable
  apt-add-repository -y ppa:firmware-testing-team/ppa-fwts-stable
  apt update
  apt install -y canonical-certification-server
  /usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 
--base-time 240

  stress_ng_test caused a kernel panic after about 5 minutes. I have
  attached dmesg output from my reproducer to this report.

  Initially, this was identified via a panic during the above test,
  which was running as part of a run of certify-soc-22.04.

  Attached is a tarball containing:

  - apport.linux-image-6.8.0-11-generic-64k.kzsondji.apport: The output of 
`ubuntu-bug linux` on the machine (after reboot)
  - reproduced-dmesg.202403201942: The dmesg output captured by kdump when I 
reproduced my original issue by running only the single stress_ng_test.py 
command above (not the entire cert suite)
  - original-dmesg.txt: The dmesg output I captured when the stress_ng_test 
originally failed during the full cert suite run

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-04-01 Thread Mitchell Augustin
A fix has been applied to vfs.fixes upstream and should land soon. I
have tested this patch and verified that the panic no longer occurs.

** Changed in: linux (Ubuntu)
   Status: New => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

Status in linux package in Ubuntu:
  Fix Committed

Bug description:
  A kernel oops and panic occurred during 22.04 SoC certification on
  Gunyolk (Grace/Grace) with 6.8 kernel, arm64+largemem variant

  Steps to reproduce:
  Run (as root) the following commands:

  add-apt-repository -y ppa:checkbox-dev/stable
  apt-add-repository -y ppa:firmware-testing-team/ppa-fwts-stable
  apt update
  apt install -y canonical-certification-server
  /usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 
--base-time 240

  stress_ng_test caused a kernel panic after about 5 minutes. I have
  attached dmesg output from my reproducer to this report.

  Initially, this was identified via a panic during the above test,
  which was running as part of a run of certify-soc-22.04.

  Attached is a tarball containing:

  - apport.linux-image-6.8.0-11-generic-64k.kzsondji.apport: The output of 
`ubuntu-bug linux` on the machine (after reboot)
  - reproduced-dmesg.202403201942: The dmesg output captured by kdump when I 
reproduced my original issue by running only the single stress_ng_test.py 
command above (not the entire cert suite)
  - original-dmesg.txt: The dmesg output I captured when the stress_ng_test 
originally failed during the full cert suite run

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-28 Thread Mitchell Augustin
This issue is still present upstream, so I reported it to the original
committer of the patch.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

Status in linux package in Ubuntu:
  New

Bug description:
  A kernel oops and panic occurred during 22.04 SoC certification on
  Gunyolk (Grace/Grace) with 6.8 kernel, arm64+largemem variant

  Steps to reproduce:
  Run (as root) the following commands:

  add-apt-repository -y ppa:checkbox-dev/stable
  apt-add-repository -y ppa:firmware-testing-team/ppa-fwts-stable
  apt update
  apt install -y canonical-certification-server
  /usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 
--base-time 240

  stress_ng_test caused a kernel panic after about 5 minutes. I have
  attached dmesg output from my reproducer to this report.

  Initially, this was identified via a panic during the above test,
  which was running as part of a run of certify-soc-22.04.

  Attached is a tarball containing:

  - apport.linux-image-6.8.0-11-generic-64k.kzsondji.apport: The output of 
`ubuntu-bug linux` on the machine (after reboot)
  - reproduced-dmesg.202403201942: The dmesg output captured by kdump when I 
reproduced my original issue by running only the single stress_ng_test.py 
command above (not the entire cert suite)
  - original-dmesg.txt: The dmesg output I captured when the stress_ng_test 
originally failed during the full cert suite run

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-28 Thread Mitchell Augustin
I have isolated the cause of this bug to this commit:
https://git.launchpad.net/~ubuntu-
kernel/ubuntu/+source/linux/+git/noble/commit/?h=Ubuntu-6.8.0-20.20=71eb6b6b0ba93b1467bccff57b5de746b09113d2

All versions that I tested before this commit during my bisect passed
the aiol test at least 15 times in a row, and all versions after this
commit panic during at least one test. To confirm, I reverted this patch
on the latest 6.8 Ubuntu kernel (which was previously panicking reliably
within 5 tests) and verified that, with that change, it passes the test
at least 15x in a row without any panics.

The contents of the patch also support this conclusion, as the patch is
a change to the Linux AIO interface that introduces new calls to
spin_lock_irqsave() and wake_up_process() inside aio_complete(), which
corresponds with the content of the traces I have observed.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

Status in linux package in Ubuntu:
  New

Bug description:
  A kernel oops and panic occurred during 22.04 SoC certification on
  Gunyolk (Grace/Grace) with 6.8 kernel, arm64+largemem variant

  Steps to reproduce:
  Run (as root) the following commands:

  add-apt-repository -y ppa:checkbox-dev/stable
  apt-add-repository -y ppa:firmware-testing-team/ppa-fwts-stable
  apt update
  apt install -y canonical-certification-server
  /usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 
--base-time 240

  stress_ng_test caused a kernel panic after about 5 minutes. I have
  attached dmesg output from my reproducer to this report.

  Initially, this was identified via a panic during the above test,
  which was running as part of a run of certify-soc-22.04.

  Attached is a tarball containing:

  - apport.linux-image-6.8.0-11-generic-64k.kzsondji.apport: The output of 
`ubuntu-bug linux` on the machine (after reboot)
  - reproduced-dmesg.202403201942: The dmesg output captured by kdump when I 
reproduced my original issue by running only the single stress_ng_test.py 
command above (not the entire cert suite)
  - original-dmesg.txt: The dmesg output I captured when the stress_ng_test 
originally failed during the full cert suite run

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-26 Thread Mitchell Augustin
It turns out that this issue does not appear with *every* run of the
aiol test on affected kernels, so multiple runs of that test may be
necessary for the panic to occur.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

Status in linux package in Ubuntu:
  New

Bug description:
  A kernel oops and panic occurred during 22.04 SoC certification on
  Gunyolk (Grace/Grace) with 6.8 kernel, arm64+largemem variant

  Steps to reproduce:
  Run (as root) the following commands:

  add-apt-repository -y ppa:checkbox-dev/stable
  apt-add-repository -y ppa:firmware-testing-team/ppa-fwts-stable
  apt update
  apt install -y canonical-certification-server
  /usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 
--base-time 240

  stress_ng_test caused a kernel panic after about 5 minutes. I have
  attached dmesg output from my reproducer to this report.

  Initially, this was identified via a panic during the above test,
  which was running as part of a run of certify-soc-22.04.

  Attached is a tarball containing:

  - apport.linux-image-6.8.0-11-generic-64k.kzsondji.apport: The output of 
`ubuntu-bug linux` on the machine (after reboot)
  - reproduced-dmesg.202403201942: The dmesg output captured by kdump when I 
reproduced my original issue by running only the single stress_ng_test.py 
command above (not the entire cert suite)
  - original-dmesg.txt: The dmesg output I captured when the stress_ng_test 
originally failed during the full cert suite run

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-25 Thread Mitchell Augustin
I did some more version testing, and I have not been able to reproduce
this bug with the "aiol" stressor on either Upstream 6.5 or Ubuntu
6.5.0-26-generic-64k, so it was evidently introduced after that version.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

Status in linux package in Ubuntu:
  New

Bug description:
  A kernel oops and panic occurred during 22.04 SoC certification on
  Gunyolk (Grace/Grace) with 6.8 kernel, arm64+largemem variant

  Steps to reproduce:
  Run (as root) the following commands:

  add-apt-repository -y ppa:checkbox-dev/stable
  apt-add-repository -y ppa:firmware-testing-team/ppa-fwts-stable
  apt update
  apt install -y canonical-certification-server
  /usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 
--base-time 240

  stress_ng_test caused a kernel panic after about 5 minutes. I have
  attached dmesg output from my reproducer to this report.

  Initially, this was identified via a panic during the above test,
  which was running as part of a run of certify-soc-22.04.

  Attached is a tarball containing:

  - apport.linux-image-6.8.0-11-generic-64k.kzsondji.apport: The output of 
`ubuntu-bug linux` on the machine (after reboot)
  - reproduced-dmesg.202403201942: The dmesg output captured by kdump when I 
reproduced my original issue by running only the single stress_ng_test.py 
command above (not the entire cert suite)
  - original-dmesg.txt: The dmesg output I captured when the stress_ng_test 
originally failed during the full cert suite run

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-22 Thread Mitchell Augustin
Earlier, I said that the device mapper observation did not seem to be a
hard line - however, further testing now indicates that the situations
where I observed panics when stressing nvme0n1 were due to an unrelated
bug that is present in the latest 6.5 mainline tree, but *not* the
latest 6.5 Ubuntu kernel tree (6.5.0-26-generic-64k).

Therefore, from the perspective of *this* bug report, it once again
*does* appear that this issue is only present when stressing dm-0 and
not present when stressing a non-device-mapper device.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

Status in linux package in Ubuntu:
  New

Bug description:
  A kernel oops and panic occurred during 22.04 SoC certification on
  Gunyolk (Grace/Grace) with 6.8 kernel, arm64+largemem variant

  Steps to reproduce:
  Run (as root) the following commands:

  add-apt-repository -y ppa:checkbox-dev/stable
  apt-add-repository -y ppa:firmware-testing-team/ppa-fwts-stable
  apt update
  apt install -y canonical-certification-server
  /usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 
--base-time 240

  stress_ng_test caused a kernel panic after about 5 minutes. I have
  attached dmesg output from my reproducer to this report.

  Initially, this was identified via a panic during the above test,
  which was running as part of a run of certify-soc-22.04.

  Attached is a tarball containing:

  - apport.linux-image-6.8.0-11-generic-64k.kzsondji.apport: The output of 
`ubuntu-bug linux` on the machine (after reboot)
  - reproduced-dmesg.202403201942: The dmesg output captured by kdump when I 
reproduced my original issue by running only the single stress_ng_test.py 
command above (not the entire cert suite)
  - original-dmesg.txt: The dmesg output I captured when the stress_ng_test 
originally failed during the full cert suite run

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-22 Thread Mitchell Augustin
I did not observe this issue with any other stress_ng disk tests on
linux-image-6.8.0-11-generic-64k after 1 full run of the suite with the
"aiol" test disabled.

(When running the "aiol" test alone, it panicked reliably each time.)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

Status in linux package in Ubuntu:
  New

Bug description:
  A kernel oops and panic occurred during 22.04 SoC certification on
  Gunyolk (Grace/Grace) with 6.8 kernel, arm64+largemem variant

  Steps to reproduce:
  Run (as root) the following commands:

  add-apt-repository -y ppa:checkbox-dev/stable
  apt-add-repository -y ppa:firmware-testing-team/ppa-fwts-stable
  apt update
  apt install -y canonical-certification-server
  /usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 
--base-time 240

  stress_ng_test caused a kernel panic after about 5 minutes. I have
  attached dmesg output from my reproducer to this report.

  Initially, this was identified via a panic during the above test,
  which was running as part of a run of certify-soc-22.04.

  Attached is a tarball containing:

  - apport.linux-image-6.8.0-11-generic-64k.kzsondji.apport: The output of 
`ubuntu-bug linux` on the machine (after reboot)
  - reproduced-dmesg.202403201942: The dmesg output captured by kdump when I 
reproduced my original issue by running only the single stress_ng_test.py 
command above (not the entire cert suite)
  - original-dmesg.txt: The dmesg output I captured when the stress_ng_test 
originally failed during the full cert suite run

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-21 Thread Mitchell Augustin
Upon further investigation, the device mapper observation does not seem
to be a hard line, as I was able to observe panics when stressing both
dm-0 and nvme0n1 under different circumstances.

At the moment, it also seems like the specific part of stress_ng_test
that is the culprit is the "stress-ng aiol stressor". When running only
the "aiol" stressor in isolation on linux-image-6.8.0-11-generic-64k,
the panic reliably happens in under 5 minutes.

Currently investigating to see if any other stress_ng tests cause the
same issue on this kernel version, or if it is only aiol.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

Status in linux package in Ubuntu:
  New

Bug description:
  A kernel oops and panic occurred during 22.04 SoC certification on
  Gunyolk (Grace/Grace) with 6.8 kernel, arm64+largemem variant

  Steps to reproduce:
  Run (as root) the following commands:

  add-apt-repository -y ppa:checkbox-dev/stable
  apt-add-repository -y ppa:firmware-testing-team/ppa-fwts-stable
  apt update
  apt install -y canonical-certification-server
  /usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 
--base-time 240

  stress_ng_test caused a kernel panic after about 5 minutes. I have
  attached dmesg output from my reproducer to this report.

  Initially, this was identified via a panic during the above test,
  which was running as part of a run of certify-soc-22.04.

  Attached is a tarball containing:

  - apport.linux-image-6.8.0-11-generic-64k.kzsondji.apport: The output of 
`ubuntu-bug linux` on the machine (after reboot)
  - reproduced-dmesg.202403201942: The dmesg output captured by kdump when I 
reproduced my original issue by running only the single stress_ng_test.py 
command above (not the entire cert suite)
  - original-dmesg.txt: The dmesg output I captured when the stress_ng_test 
originally failed during the full cert suite run

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-21 Thread Mitchell Augustin
I have observed that this panic does not seem to happen when stressing
non-device-mapper devices (ex: it panics when running /usr/lib/checkbox-
provider-base/bin/stress_ng_test.py disk --device dm-0 --base-time 240,
but completes successfully when running /usr/lib/checkbox-provider-
base/bin/stress_ng_test.py disk --device nvme0n1 --base-time 240).

I'm going to investigate this further to confirm.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

Status in linux package in Ubuntu:
  New

Bug description:
  A kernel oops and panic occurred during 22.04 SoC certification on
  Gunyolk (Grace/Grace) with 6.8 kernel, arm64+largemem variant

  Steps to reproduce:
  Run (as root) the following commands:

  add-apt-repository -y ppa:checkbox-dev/stable
  apt-add-repository -y ppa:firmware-testing-team/ppa-fwts-stable
  apt update
  apt install -y canonical-certification-server
  /usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 
--base-time 240

  stress_ng_test caused a kernel panic after about 5 minutes. I have
  attached dmesg output from my reproducer to this report.

  Initially, this was identified via a panic during the above test,
  which was running as part of a run of certify-soc-22.04.

  Attached is a tarball containing:

  - apport.linux-image-6.8.0-11-generic-64k.kzsondji.apport: The output of 
`ubuntu-bug linux` on the machine (after reboot)
  - reproduced-dmesg.202403201942: The dmesg output captured by kdump when I 
reproduced my original issue by running only the single stress_ng_test.py 
command above (not the entire cert suite)
  - original-dmesg.txt: The dmesg output I captured when the stress_ng_test 
originally failed during the full cert suite run

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-20 Thread Mitchell Augustin
This is also reproducible on the latest mainline version
(https://kernel.ubuntu.com/mainline/v6.8/arm64/, retrieved 20 Mar 2024 @
5 PM):

20 Mar 22:54: Running stress-ng aiol stressor for 240 seconds...
[  354.451450] Unable to handle kernel paging request at virtual address 
17be9b4aa3e187be
[  354.459580] Mem abort info:
[  354.462439]   ESR = 0x9621
[  354.466274]   EC = 0x25: DABT (current EL), IL = 32 bits
[  354.471703]   SET = 0, FnV = 0
[  354.474819]   EA = 0, S1PTW = 0
[  354.478024]   FSC = 0x21: alignment fault
[  354.482118] Data abort info:
[  354.485056]   ISV = 0, ISS = 0x0021, ISS2 = 0x
[  354.490662]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  354.495823]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  354.501251] [17be9b4aa3e187be] address between user and kernel address ranges
[  354.508548] Internal error: Oops: 9621 [#1] SMP
[  354.514245] Modules linked in: qrtr cfg80211 binfmt_misc nls_iso8859_1 
input_leds dax_hmem cxl_acpi acpi_ipmi onboard_usb_hub nvidia_cspmu ipmi_ssif 
cxl_co
re ipmi_devintf arm_cspmu_module arm_smmuv3_pmu ipmi_msghandler uio_pdrv_genirq 
uio spi_nor cppc_cpufreq joydev mtd acpi_power_meter dm_multipath nvme_fabrics
 efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs 
blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor
 xor_neon raid6_pq libcrc32c raid1 raid0 hid_generic rndis_host usbhid 
cdc_ether hid usbnet uas usb_storage crct10dif_ce polyval_ce polyval_generic 
ghash_ce s
m4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 nvme sha3_ce i2c_smbus 
ixgbe sha2_ce nvme_core ast sha256_arm64 xhci_pci sha1_ce xfrm_algo xhci_pci_r
enesas i2c_algo_bit nvme_auth mdio spi_tegra210_quad i2c_tegra aes_neon_bs 
aes_neon_blk aes_ce_blk aes_ce_cipher
[  354.594676] CPU: 61 PID: 0 Comm: swapper/61 Kdump: loaded Not tainted 
6.8.0-060800-generic-64k #202403131158
[  354.604728] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 1.0c 12/28/2023
[  354.611844] pstate: 034000c9 (nzcv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[  354.618962] pc : _raw_spin_lock_irqsave+0x44/0x100
[  354.623863] lr : try_to_wake_up+0x68/0x758
[  354.628053] sp : 8000807afaf0
[  354.631436] x29: 8000807afaf0 x28: 0004 x27: 
[  354.638731] x26: a06103dc8a98 x25: 8000807afd98 x24: 0002
[  354.646027] x23: f8156840 x22: 17be9b4aa3e187be x21: 
[  354.653323] x20: 0003 x19: 00c0 x18: 8000819a0098
[  354.660619] x17:  x16:  x15: e97dca18
[  354.667914] x14:  x13:  x12: 
[  354.675208] x11:  x10:  x9 : a06100ba6810
[  354.682504] x8 :  x7 : 0040 x6 : 9080
[  354.689800] x5 : c2fb0dc488b0 x4 :  x3 : 894178c0
[  354.697096] x2 : 0001 x1 :  x0 : 17be9b4aa3e187be
[  354.704391] Call trace:
[  354.706886]  _raw_spin_lock_irqsave+0x44/0x100
[  354.711426]  try_to_wake_up+0x68/0x758
[  354.715254]  wake_up_process+0x24/0x50
[  354.719082]  aio_complete+0x1c4/0x2b8
[  354.722825]  aio_complete_rw+0x11c/0x2c8
[  354.726831]  iomap_dio_bio_end_io+0x1f0/0x248
[  354.731282]  bio_endio+0x170/0x270
[  354.734758]  __dm_io_complete+0x180/0x200
[  354.738855]  clone_endio+0xc8/0x288
[  354.742416]  bio_endio+0x170/0x270
[  354.745889]  blk_mq_end_request_batch+0x2e0/0x558
[  354.750696]  nvme_pci_complete_batch+0x94/0x118 [nvme]
[  354.755958]  nvme_irq+0x9c/0xb0 [nvme]
[  354.759788]  __handle_irq_event_percpu+0x68/0x2c0
[  354.764595]  handle_irq_event+0x58/0xe8
[  354.768511]  handle_fasteoi_irq+0xb0/0x218
[  354.772695]  generic_handle_domain_irq+0x38/0x70
[  354.777411]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
[  354.783195]  gic_handle_irq+0x2c/0xa0
[  354.786935]  call_on_irq_stack+0x3c/0x50
[  354.790941]  do_interrupt_handler+0xb0/0xc8
[  354.795214]  el1_interrupt+0x48/0xf0
[  354.798866]  el1h_64_irq_handler+0x1c/0x40
[  354.803050]  el1h_64_irq+0x7c/0x80
[  354.806523]  cpuidle_enter_state+0xd8/0x790
[  354.810795]  cpuidle_enter+0x44/0x78
[  354.814446]  cpuidle_idle_call+0x15c/0x210
[  354.818631]  do_idle+0xb0/0x130
[  354.821837]  cpu_startup_entry+0x44/0x50
[  354.825845]  secondary_start_kernel+0xec/0x130
[  354.830386]  __secondary_switched+0xc0/0xc8
[  354.834661] Code: b9001041 d503201f 5281 52800022 (88e17c02) 
[  354.840893] SMP: stopping secondary CPUs
[  355.897569] SMP: failed to stop secondary CPUs 0-60,62-143
[  355.904206] Starting crashdump kernel...
[  355.908214] [ cut here ]
[  355.912930] Some CPUs may be stale, kdump will be unreliable.
[  355.918807] WARNING: CPU: 61 PID: 0 at arch/arm64/kernel/machine_kexec.c:174 
machine_kexec+0x48/0x1f0
[  355.928236] Modules linked in: qrtr cfg80211 binfmt_misc nls_iso8859_1 
input_leds dax_hmem 

[Kernel-packages] [Bug 2058557] [NEW] Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-20 Thread Mitchell Augustin
Public bug reported:

A kernel oops and panic occurred during 22.04 SoC certification on
Gunyolk (Grace/Grace) with 6.8 kernel, arm64+largemem variant

Steps to reproduce:
Run (as root) the following commands:

add-apt-repository -y ppa:checkbox-dev/stable
apt-add-repository -y ppa:firmware-testing-team/ppa-fwts-stable
apt update
apt install -y canonical-certification-server
/usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 
--base-time 240

stress_ng_test caused a kernel panic after about 5 minutes. I have
attached dmesg output from my reproducer to this report.

Initially, this was identified via a panic during the above test, which
was running as part of a run of certify-soc-22.04.

Attached is a tarball containing:

- apport.linux-image-6.8.0-11-generic-64k.kzsondji.apport: The output of 
`ubuntu-bug linux` on the machine (after reboot)
- reproduced-dmesg.202403201942: The dmesg output captured by kdump when I 
reproduced my original issue by running only the single stress_ng_test.py 
command above (not the entire cert suite)
- original-dmesg.txt: The dmesg output I captured when the stress_ng_test 
originally failed during the full cert suite run

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: New

** Attachment added: "dmesg and ubuntu-bug outputs"
   
https://bugs.launchpad.net/bugs/2058557/+attachment/5757643/+files/grace-panic.tar.xz

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

Status in linux package in Ubuntu:
  New

Bug description:
  A kernel oops and panic occurred during 22.04 SoC certification on
  Gunyolk (Grace/Grace) with 6.8 kernel, arm64+largemem variant

  Steps to reproduce:
  Run (as root) the following commands:

  add-apt-repository -y ppa:checkbox-dev/stable
  apt-add-repository -y ppa:firmware-testing-team/ppa-fwts-stable
  apt update
  apt install -y canonical-certification-server
  /usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 
--base-time 240

  stress_ng_test caused a kernel panic after about 5 minutes. I have
  attached dmesg output from my reproducer to this report.

  Initially, this was identified via a panic during the above test,
  which was running as part of a run of certify-soc-22.04.

  Attached is a tarball containing:

  - apport.linux-image-6.8.0-11-generic-64k.kzsondji.apport: The output of 
`ubuntu-bug linux` on the machine (after reboot)
  - reproduced-dmesg.202403201942: The dmesg output captured by kdump when I 
reproduced my original issue by running only the single stress_ng_test.py 
command above (not the entire cert suite)
  - original-dmesg.txt: The dmesg output I captured when the stress_ng_test 
originally failed during the full cert suite run

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic

2024-02-07 Thread Mitchell Augustin
I identified a similar bug today when installing nvidia-
fabricmanager-535 on a noble dev build for arm64 that may be related:

https://bugs.launchpad.net/ubuntu/+source/fabric-
manager-535/+bug/2052663

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-hwe-6.5 in Ubuntu.
https://bugs.launchpad.net/bugs/2029934

Title:
  arm64 AWS host hangs during modprobe nvidia on lunar and mantic

Status in linux-aws package in Ubuntu:
  Incomplete
Status in linux-hwe-6.5 package in Ubuntu:
  New
Status in nvidia-graphics-drivers-525 package in Ubuntu:
  Incomplete
Status in nvidia-graphics-drivers-525-server package in Ubuntu:
  Incomplete
Status in nvidia-graphics-drivers-535 package in Ubuntu:
  Confirmed
Status in nvidia-graphics-drivers-535-server package in Ubuntu:
  Confirmed

Bug description:
  Loading the nvidia driver dkms modules with "modprove nvidia" will
  result in the host hanging and being completely unusable. This was
  reproduced using both the linux generic and linux-aws kernels on lunar
  and mantic using an AWS g5g.xlarge instance.

  To reproduce using the generic kernel:
  # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge.

  # Install the linux generic kernel from lunar-updates:
  $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o 
DPkg::Options::=--force-confold linux-generic

  # Boot to the linux-generic kernel (this can be accomplished by removing the 
existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel)
  $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o 
DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 
linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws 
linux-image-aws linux-modules-6.2.0-1008-aws  linux-headers-6.2.0-1008-aws 
linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws
  $ reboot

  # Install the Nvidia 535-server driver DKMS package:
  $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y 
nvidia-driver-535-server

  # Enable the driver
  $ sudo modprobe nvidia

  # At this point the system will hang and never return.
  # A reboot instead of a modprobe will result in a system that never boots up 
all the way. I was able to recover the console logs from such a system and 
found (the full captured log is attached):

  [1.964942] nvidia: loading out-of-tree module taints kernel.
  [1.965475] nvidia: module license 'NVIDIA' taints kernel.
  [1.965905] Disabling lock debugging due to kernel taint
  [1.980905] nvidia: module verification failed: signature and/or required 
key missing - tainting kernel
  [2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 510
  [2.012715] 
  [   62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [   62.025807] rcu:   3-...0: (14 ticks this GP) 
idle=c04c/1/0x4000 softirq=653/654 fqs=3301
  [   62.026516](detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4)
  [   62.027018] Task dump for CPU 3:
  [   62.027290] task:systemd-udevd   state:R  running task stack:0 
pid:164   ppid:144flags:0x000e
  [   62.028066] Call trace:
  [   62.028273]  __switch_to+0xbc/0x100
  [   62.028567]  0x228
  Timed out for waiting the udev queue being empty.
  Timed out for waiting the udev queue being empty.
  [  242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [  242.045655] rcu:   3-...0: (14 ticks this GP) 
idle=c04c/1/0x4000 softirq=653/654 fqs=12303
  [  242.046373](detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4)
  [  242.046874] Task dump for CPU 3:
  [  242.047146] task:systemd-udevd   state:R  running task stack:0 
pid:164   ppid:144flags:0x000f
  [  242.047922] Call trace:
  [  242.048128]  __switch_to+0xbc/0x100
  [  242.048417]  0x228
  Timed out for waiting the udev queue being empty.
  Begin: Loading essential drivers ... [  384.001142] watchdog: BUG: soft 
lockup - CPU#2 stuck for 22s! [modprobe:215]
  [  384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce 
polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce 
sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs 
aes_neon_blk aes_ce_blk aes_ce_cipher
  [  384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P   OE  
6.2.0-26-generic #26-Ubuntu
  [  384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018
  [  384.004715] pstate: 8045 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  [  384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4
  [  384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4
  [  384.006108] sp : 889a3a70
  [  384.006381] x29: 889a3a70 x28: 0003 x27: 
00056d1fafa0
  [  384.006954] x26: 00056d1d76c8 x25: c87cf18bdd10 x24: 
0003
  [  384.007527] x23: 0001 x22: 00056d1d76c8 x21: