[Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-05-08 Thread Mitchell Augustin
** Tags removed: verification-needed-noble-linux
** Tags added: verification-done-noble-linux

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2052663] Re: fabric-manager-535 setup fails during install on Grace/Hopper arm64 system running noble

2024-04-24 Thread Mitchell Augustin
This bug no longer appears to be reproducible on noble with the 6.8
generic kernels, so I have marked it as resolved.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2052663

Title:
  fabric-manager-535 setup fails during install on Grace/Hopper arm64
  system running noble

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/fabric-manager-535/+bug/2052663/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

2024-04-24 Thread Mitchell Augustin
Compiling the Nvidia drivers with -ffixed-x18 on affected versions is
also sufficient to prevent this hang/panic:

https://github.com/NVIDIA/open-gpu-kernel-modules

diff --git a/src/nvidia-modeset/Makefile b/src/nvidia-modeset/Makefile
index 66edbf4e..d49a3bfb 100644
--- a/src/nvidia-modeset/Makefile
+++ b/src/nvidia-modeset/Makefile
@@ -95,6 +95,7 @@ endif
 ifeq ($(TARGET_ARCH),aarch64)
   CFLAGS += -mgeneral-regs-only
   CFLAGS += -march=armv8-a
+  CFLAGS += -ffixed-x18
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mno-outline-atomics)
 endif
 
diff --git a/src/nvidia/Makefile b/src/nvidia/Makefile
index e2f1c672..0f70514b 100644
--- a/src/nvidia/Makefile
+++ b/src/nvidia/Makefile
@@ -90,6 +90,7 @@ ifeq ($(TARGET_ARCH),aarch64)
   CFLAGS += -mgeneral-regs-only
   CFLAGS += -march=armv8-a
   CFLAGS += -mstrict-align
+  CFLAGS += -ffixed-x18
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mno-outline-atomics)
 endif

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2062380

Title:
  Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2052663] Re: fabric-manager-535 setup fails during install on Grace/Hopper arm64 system running noble

2024-04-24 Thread Mitchell Augustin
** Changed in: fabric-manager-535 (Ubuntu)
 Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

** Changed in: linux (Ubuntu)
 Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

** Changed in: nvidia-graphics-drivers-535-server (Ubuntu)
 Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

** Changed in: fabric-manager-535 (Ubuntu)
   Status: New => Fix Released

** Changed in: linux (Ubuntu)
   Status: New => Fix Released

** Changed in: nvidia-graphics-drivers-535-server (Ubuntu)
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2052663

Title:
  fabric-manager-535 setup fails during install on Grace/Hopper arm64
  system running noble

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/fabric-manager-535/+bug/2052663/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

2024-04-24 Thread Mitchell Augustin
In trying to determine if core count had any effect on this bug, I set
maxcpus to 4 and tried loading the driver on the kernel with the shadow
stack enabled (aka the standard -generic config). It looks like the same
root issue occurred, but this time, I got a panic with a trace that
corroborates the claim that this is related to the shadow stack:

[  391.736417] Internal error: Oops - FPAC: 7200 [#1] SMP
[  391.744257] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cdc_ether 
cdc_subset usbnet cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core ast 
i2c_algo_bit nvidia_cspmu arm_spe_pmu arm_smmuv3_pmu arm_cspmu_module 
uio_pdrv_genirq uio spi_nor acpi_ipmi mtd nls_iso8859_1 ipmi_ssif ipmi_devintf 
cppc_cpufreq ipmi_msghandler acpi_power_meter dm_multipath efi_pstore nfnetlink 
dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon 
raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll 
i2c_smbus crct10dif_ce polyval_ce polyval_generic ghash_ce sm4_ce_gcm 
sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 nvme sha3_ce sha2_ce 
sha256_arm64 sha1_ce mlx5_core nvme_core mlxfw nvme_auth psample xhci_pci tls 
xhci_pci_renesas pci_hyperv_intf spi_tegra210_quad i2c_tegra aes_neon_bs 
aes_neon_blk aes_ce_blk aes_ce_cipher
[  391.826552] CPU: 0 PID: 14412 Comm: insmod Tainted: G   OE  
6.8.1+ #2
[  391.834202] Hardware name:  /, BIOS 01.02.01 20240207
[  391.840074] pstate: 6349 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[  391.847190] pc : __kmalloc+0x1e4/0x498
[  391.851025] lr : 0xc040
[  391.854605] sp : 8000a3ab3620
[  391.857987] x29: 8000a3ab3620 x28: 0001 x27: 0001
[  391.865282] x26: 01f8 x25: 00aa1d70 x24: 8feac028
[  391.872577] x23: c040aab743f0 x22: 80008d4c5020 x21: 8000a3ab37f8
[  391.879871] x20: 0038 x19: 8000a3ab3658 x18: 8000a3ab3614
[  391.887165] x17:  x16:  x15: 0004
[  391.894459] x14:  x13:  x12: 
[  391.901753] x11:  x10: 8000a3ab36a0 x9 : c040c0af8d48
[  391.909049] x8 : 8edc3c40 x7 :  x6 : 
[  391.916343] x5 :  x4 :  x3 : 
[  391.923637] x2 :  x1 : 8e87c480 x0 : 8edc3c00
[  391.930931] Call trace:
[  391.933427]  __kmalloc+0x1e4/0x498
[  391.936899]  0xc0007304e5f6c040
[  391.940107] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf) 
[  391.946336] ---[ end trace  ]---
[  391.977579] Kernel panic - not syncing: corrupted shadow stack detected 
inside scheduler
[  391.980605] kauditd_printk_skb: 98 callbacks suppressed
[  391.980607] audit: type=1400 audit(1713999301.128:108): apparmor="DENIED" 
operation="open" class="file" profile="rsyslogd" name="/run/systemd/sessions/" 
pid=801 comm=72733A6D61696E20513A526567 requested_mask="r" denied_mask="r" 
fsuid=103 ouid=0
[  391.980674] audit: type=1400 audit(1713999301.128:109): apparmor="DENIED" 
operation="open" class="file" profile="rsyslogd" name="/run/systemd/sessions/" 
pid=801 comm=72733A6D61696E20513A526567 requested_mask="r" denied_mask="r" 
fsuid=103 ouid=0
[  391.980679] audit: type=1400 audit(1713999301.128:110): apparmor="DENIED" 
operation="open" class="file" profile="rsyslogd" name="/run/systemd/sessions/" 
pid=801 comm=72733A6D61696E20513A526567 requested_mask="r" denied_mask="r" 
fsuid=103 ouid=0
[  392.057603] SMP: stopping secondary CPUs
[  392.061632] Kernel Offset: 0x40404069 from 0x80008000
[  392.067859] PHYS_OFFSET: 0x8000
[  392.071420] CPU features: 0x0,,d002cd4a,2b67fea7
[  392.076848] Memory Limit: none
[  392.106695] ---[ end Kernel panic - not syncing: corrupted shadow stack 
detected inside scheduler ]---

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2062380

Title:
  Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

2024-04-24 Thread Mitchell Augustin
It looks like this is the relevant option present in the upstream stable
6.8.1 defconfig but not in the 6.8.0-31-generic config that enables the
defconfig kernel to load the Nvidia driver:

CONFIG_SHADOW_CALL_STACK=n

I suspect that the kernel team is not going to want to disable kernel
support for the GCC shadow stack to fix this bug, so my guess is that
we'll need to explore other potential fixes for this issue.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2062380

Title:
  Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2062380] Re: Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

2024-04-24 Thread Mitchell Augustin
** Changed in: nvidia-graphics-drivers-535-server (Ubuntu)
 Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

** Changed in: nvidia-graphics-drivers-550-server (Ubuntu)
 Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2062380

Title:
  Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-04-09 Thread Mitchell Augustin
Fix has landed upstream:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/aio.c?h=v6.9-rc3=caeb4b0a11b3393e43f7fa8e0a5a18462acc66bd

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-04-01 Thread Mitchell Augustin
A fix has been applied to vfs.fixes upstream and should land soon. I
have tested this patch and verified that the panic no longer occurs.

** Changed in: linux (Ubuntu)
   Status: New => Fix Committed

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-28 Thread Mitchell Augustin
This issue is still present upstream, so I reported it to the original
committer of the patch.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-28 Thread Mitchell Augustin
I have isolated the cause of this bug to this commit:
https://git.launchpad.net/~ubuntu-
kernel/ubuntu/+source/linux/+git/noble/commit/?h=Ubuntu-6.8.0-20.20=71eb6b6b0ba93b1467bccff57b5de746b09113d2

All versions that I tested before this commit during my bisect passed
the aiol test at least 15 times in a row, and all versions after this
commit panic during at least one test. To confirm, I reverted this patch
on the latest 6.8 Ubuntu kernel (which was previously panicking reliably
within 5 tests) and verified that, with that change, it passes the test
at least 15x in a row without any panics.

The contents of the patch also support this conclusion, as the patch is
a change to the Linux AIO interface that introduces new calls to
spin_lock_irqsave() and wake_up_process() inside aio_complete(), which
corresponds with the content of the traces I have observed.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-26 Thread Mitchell Augustin
It turns out that this issue does not appear with *every* run of the
aiol test on affected kernels, so multiple runs of that test may be
necessary for the panic to occur.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-25 Thread Mitchell Augustin
I did some more version testing, and I have not been able to reproduce
this bug with the "aiol" stressor on either Upstream 6.5 or Ubuntu
6.5.0-26-generic-64k, so it was evidently introduced after that version.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-22 Thread Mitchell Augustin
Earlier, I said that the device mapper observation did not seem to be a
hard line - however, further testing now indicates that the situations
where I observed panics when stressing nvme0n1 were due to an unrelated
bug that is present in the latest 6.5 mainline tree, but *not* the
latest 6.5 Ubuntu kernel tree (6.5.0-26-generic-64k).

Therefore, from the perspective of *this* bug report, it once again
*does* appear that this issue is only present when stressing dm-0 and
not present when stressing a non-device-mapper device.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-22 Thread Mitchell Augustin
I did not observe this issue with any other stress_ng disk tests on
linux-image-6.8.0-11-generic-64k after 1 full run of the suite with the
"aiol" test disabled.

(When running the "aiol" test alone, it panicked reliably each time.)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-21 Thread Mitchell Augustin
Upon further investigation, the device mapper observation does not seem
to be a hard line, as I was able to observe panics when stressing both
dm-0 and nvme0n1 under different circumstances.

At the moment, it also seems like the specific part of stress_ng_test
that is the culprit is the "stress-ng aiol stressor". When running only
the "aiol" stressor in isolation on linux-image-6.8.0-11-generic-64k,
the panic reliably happens in under 5 minutes.

Currently investigating to see if any other stress_ng tests cause the
same issue on this kernel version, or if it is only aiol.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-21 Thread Mitchell Augustin
I have observed that this panic does not seem to happen when stressing
non-device-mapper devices (ex: it panics when running /usr/lib/checkbox-
provider-base/bin/stress_ng_test.py disk --device dm-0 --base-time 240,
but completes successfully when running /usr/lib/checkbox-provider-
base/bin/stress_ng_test.py disk --device nvme0n1 --base-time 240).

I'm going to investigate this further to confirm.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2058557] Re: Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-20 Thread Mitchell Augustin
This is also reproducible on the latest mainline version
(https://kernel.ubuntu.com/mainline/v6.8/arm64/, retrieved 20 Mar 2024 @
5 PM):

20 Mar 22:54: Running stress-ng aiol stressor for 240 seconds...
[  354.451450] Unable to handle kernel paging request at virtual address 
17be9b4aa3e187be
[  354.459580] Mem abort info:
[  354.462439]   ESR = 0x9621
[  354.466274]   EC = 0x25: DABT (current EL), IL = 32 bits
[  354.471703]   SET = 0, FnV = 0
[  354.474819]   EA = 0, S1PTW = 0
[  354.478024]   FSC = 0x21: alignment fault
[  354.482118] Data abort info:
[  354.485056]   ISV = 0, ISS = 0x0021, ISS2 = 0x
[  354.490662]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  354.495823]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  354.501251] [17be9b4aa3e187be] address between user and kernel address ranges
[  354.508548] Internal error: Oops: 9621 [#1] SMP
[  354.514245] Modules linked in: qrtr cfg80211 binfmt_misc nls_iso8859_1 
input_leds dax_hmem cxl_acpi acpi_ipmi onboard_usb_hub nvidia_cspmu ipmi_ssif 
cxl_co
re ipmi_devintf arm_cspmu_module arm_smmuv3_pmu ipmi_msghandler uio_pdrv_genirq 
uio spi_nor cppc_cpufreq joydev mtd acpi_power_meter dm_multipath nvme_fabrics
 efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs 
blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor
 xor_neon raid6_pq libcrc32c raid1 raid0 hid_generic rndis_host usbhid 
cdc_ether hid usbnet uas usb_storage crct10dif_ce polyval_ce polyval_generic 
ghash_ce s
m4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 nvme sha3_ce i2c_smbus 
ixgbe sha2_ce nvme_core ast sha256_arm64 xhci_pci sha1_ce xfrm_algo xhci_pci_r
enesas i2c_algo_bit nvme_auth mdio spi_tegra210_quad i2c_tegra aes_neon_bs 
aes_neon_blk aes_ce_blk aes_ce_cipher
[  354.594676] CPU: 61 PID: 0 Comm: swapper/61 Kdump: loaded Not tainted 
6.8.0-060800-generic-64k #202403131158
[  354.604728] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 1.0c 12/28/2023
[  354.611844] pstate: 034000c9 (nzcv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[  354.618962] pc : _raw_spin_lock_irqsave+0x44/0x100
[  354.623863] lr : try_to_wake_up+0x68/0x758
[  354.628053] sp : 8000807afaf0
[  354.631436] x29: 8000807afaf0 x28: 0004 x27: 
[  354.638731] x26: a06103dc8a98 x25: 8000807afd98 x24: 0002
[  354.646027] x23: f8156840 x22: 17be9b4aa3e187be x21: 
[  354.653323] x20: 0003 x19: 00c0 x18: 8000819a0098
[  354.660619] x17:  x16:  x15: e97dca18
[  354.667914] x14:  x13:  x12: 
[  354.675208] x11:  x10:  x9 : a06100ba6810
[  354.682504] x8 :  x7 : 0040 x6 : 9080
[  354.689800] x5 : c2fb0dc488b0 x4 :  x3 : 894178c0
[  354.697096] x2 : 0001 x1 :  x0 : 17be9b4aa3e187be
[  354.704391] Call trace:
[  354.706886]  _raw_spin_lock_irqsave+0x44/0x100
[  354.711426]  try_to_wake_up+0x68/0x758
[  354.715254]  wake_up_process+0x24/0x50
[  354.719082]  aio_complete+0x1c4/0x2b8
[  354.722825]  aio_complete_rw+0x11c/0x2c8
[  354.726831]  iomap_dio_bio_end_io+0x1f0/0x248
[  354.731282]  bio_endio+0x170/0x270
[  354.734758]  __dm_io_complete+0x180/0x200
[  354.738855]  clone_endio+0xc8/0x288
[  354.742416]  bio_endio+0x170/0x270
[  354.745889]  blk_mq_end_request_batch+0x2e0/0x558
[  354.750696]  nvme_pci_complete_batch+0x94/0x118 [nvme]
[  354.755958]  nvme_irq+0x9c/0xb0 [nvme]
[  354.759788]  __handle_irq_event_percpu+0x68/0x2c0
[  354.764595]  handle_irq_event+0x58/0xe8
[  354.768511]  handle_fasteoi_irq+0xb0/0x218
[  354.772695]  generic_handle_domain_irq+0x38/0x70
[  354.777411]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
[  354.783195]  gic_handle_irq+0x2c/0xa0
[  354.786935]  call_on_irq_stack+0x3c/0x50
[  354.790941]  do_interrupt_handler+0xb0/0xc8
[  354.795214]  el1_interrupt+0x48/0xf0
[  354.798866]  el1h_64_irq_handler+0x1c/0x40
[  354.803050]  el1h_64_irq+0x7c/0x80
[  354.806523]  cpuidle_enter_state+0xd8/0x790
[  354.810795]  cpuidle_enter+0x44/0x78
[  354.814446]  cpuidle_idle_call+0x15c/0x210
[  354.818631]  do_idle+0xb0/0x130
[  354.821837]  cpu_startup_entry+0x44/0x50
[  354.825845]  secondary_start_kernel+0xec/0x130
[  354.830386]  __secondary_switched+0xc0/0xc8
[  354.834661] Code: b9001041 d503201f 5281 52800022 (88e17c02) 
[  354.840893] SMP: stopping secondary CPUs
[  355.897569] SMP: failed to stop secondary CPUs 0-60,62-143
[  355.904206] Starting crashdump kernel...
[  355.908214] [ cut here ]
[  355.912930] Some CPUs may be stale, kdump will be unreliable.
[  355.918807] WARNING: CPU: 61 PID: 0 at arch/arm64/kernel/machine_kexec.c:174 
machine_kexec+0x48/0x1f0
[  355.928236] Modules linked in: qrtr cfg80211 binfmt_misc nls_iso8859_1 
input_leds dax_hmem 

[Bug 2058557] [NEW] Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

2024-03-20 Thread Mitchell Augustin
Public bug reported:

A kernel oops and panic occurred during 22.04 SoC certification on
Gunyolk (Grace/Grace) with 6.8 kernel, arm64+largemem variant

Steps to reproduce:
Run (as root) the following commands:

add-apt-repository -y ppa:checkbox-dev/stable
apt-add-repository -y ppa:firmware-testing-team/ppa-fwts-stable
apt update
apt install -y canonical-certification-server
/usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 
--base-time 240

stress_ng_test caused a kernel panic after about 5 minutes. I have
attached dmesg output from my reproducer to this report.

Initially, this was identified via a panic during the above test, which
was running as part of a run of certify-soc-22.04.

Attached is a tarball containing:

- apport.linux-image-6.8.0-11-generic-64k.kzsondji.apport: The output of 
`ubuntu-bug linux` on the machine (after reboot)
- reproduced-dmesg.202403201942: The dmesg output captured by kdump when I 
reproduced my original issue by running only the single stress_ng_test.py 
command above (not the entire cert suite)
- original-dmesg.txt: The dmesg output I captured when the stress_ng_test 
originally failed during the full cert suite run

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: New

** Attachment added: "dmesg and ubuntu-bug outputs"
   
https://bugs.launchpad.net/bugs/2058557/+attachment/5757643/+files/grace-panic.tar.xz

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2058557

Title:
  Kernel panic during checkbox stress_ng_test on Grace running noble 6.8
  (arm64+largemem) kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs