Re: Linux kernel stability fixes for older SPARCs
Hi Gregor, On Mon, 2024-09-23 at 00:20 +0200, Gregor Riepl wrote: > > > > It's probably due to CONFIG_DEBUG_INFO. Try turning that off. Debian builds > > the kernel with debug > > symbols enabled and then runs the strip command afterwards. This way both a > > debug and a standard > > kernel package can be provided from the same build. > > Ah thanks, that did the trick. > > I built a 6.10.0 kernel using the Debian 6.10.7-sparc64-smp config, with > module signing turned off. > This kernel crashed instantly at boot, just after checking the rootfs. The > fsck output was intermingled with the kernel log, but it did complete with a > "done." > > Begin: Will now check root file system ... fsck from util-linux 2.38.1 > [ 68.420534] \|/ \|/ > [ 68.420534] "@'/ .. \`@" > [ 68.420534] /_| \__/ |_\ > [ 68.420534] \__U_/ > [ 68.630552] mount(192): Kernel illegal instruction [#1] > [ 68.715911] CPU: 0 PID: 192 Comm: mount Tainted: GE > 6.10.0 #28 > [ 68.828841] TSTATE: 11001605 TPC: 10320158 TNPC: > 1032015c Y: Tainted: GE > [ 68.994452] TPC: > [ 69.078729] g0: 0001 g1: 0001 g2: > g3: > [ 69.209968] g4: fff000100210ac00 g5: fff000103e442000 g6: fff000598000 > g7: > [ 69.341210] o0: 0200 o1: fff0029cc3e8 o2: 0001 > o3: 103cc1b0 > [ 69.472457] o4: 1678 o5: b000 sp: fff00059a8d1 > ret_pc: 00ea309c > [ 69.608287] RPC: <__cond_resched+0x1c/0x60> > [ 69.679947] l0: fff000d06416 l1: fff0029cc128 l2: 00010001 > l3: 00ff > [ 69.811203] l4: l5: 0005 l6: 00010001 > l7: 0002 > [ 69.942448] i0: 00010001 i1: i2: > i3: > [ 70.070570] i4: i5: 0001 i6: fff00059a9a1 > i7: 10325748 > [ 70.185151] I7: > [ 70.255147] Call Trace: > [ 70.287221] [<10325748>] ext4_ext_map_blocks+0x68/0x2060 [ext4] > [ 70.374417] [<1033ee38>] ext4_map_blocks+0x98/0x6c0 [ext4] > [ 70.455872] [<1033fd34>] ext4_iomap_begin+0x254/0x2e0 [ext4] > [ 70.539621] [<007f494c>] iomap_iter+0x14c/0x420 > [ 70.608470] [<007fa5f0>] iomap_bmap+0x70/0xe0 > [ 70.674929] [<1033bd3c>] ext4_bmap+0x9c/0xe0 [ext4] > [ 70.748366] [<00789404>] bmap+0x24/0x40 > [ 70.808051] [<102d7e54>] jbd2_journal_init_inode+0x14/0x120 [jbd2] > [ 70.898675] [<10385c2c>] ext4_load_and_init_journal+0xec/0xd20 > [ext4] > [ 70.992740] [<1038bd78>] ext4_fill_super+0x2638/0x2aa0 [ext4] > [ 71.077631] [<0076ae5c>] get_tree_bdev+0xfc/0x1c0 > [ 71.148774] [<10377c34>] ext4_get_tree+0x14/0x40 [ext4] > [ 71.226799] [<0076b4c0>] vfs_get_tree+0x20/0x120 > [ 71.296792] [<00796f0c>] path_mount+0x40c/0xa60 > [ 71.365539] [<00797a94>] sys_mount+0xf4/0x1c0 > [ 71.431997] [<00406274>] linux_sparc_syscall+0x34/0x44 > > I'll try a bisect with that config now, perhaps I can find something. Could you try whether reverting the following commit fixes it? commit 223b5e57d0d50b0c07b933350dbcde92018d3080 Author: Mike Rapoport (IBM) Date: Sun May 5 19:06:20 2024 +0300 mm/execmem, arch: convert remaining overrides of module_alloc to execmem Alternatively, please try the following change: diff --git a/mm/execmem.c b/mm/execmem.c index 0c4b36bc6d10..8232f9767c8c 100644 --- a/mm/execmem.c +++ b/mm/execmem.c @@ -17,7 +17,11 @@ static struct execmem_info default_execmem_info __ro_after_init; static void *__execmem_alloc(struct execmem_range *range, size_t size) { bool kasan = range->flags & EXECMEM_KASAN_SHADOW; +#ifndef __sparc__ unsigned long vm_flags = VM_FLUSH_RESET_PERMS; +#else + unsigned long vm_flags = 0; +#endif gfp_t gfp_flags = GFP_KERNEL | __GFP_NOWARN; unsigned long start = range->start; unsigned long end = range->end; Thanks, Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: Linux kernel stability fixes for older SPARCs
On Mon, 2024-09-23 at 09:56 +0200, Ignacio Soriano Hernandez wrote: > We are at a very stable working configuration with Radeon support on T2 SDE. >From what I have heard, Rene did not apply any local SPARC-specific patches to the kernel, so the fact that your machine runs stable with T2 SDE is more likely a result of disabled kernel features. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: Linux kernel stability fixes for older SPARCs
Good to hear Gregor. We are at a very stable working configuration with Radeon support on T2 SDE. I think combining the efforts will be beneficial for the whole Linux/SPARC64 community. Cheers Iggi Gregor Riepl schrieb am Mi. 18. Sept. 2024 um 20:17: > Small update: > > I managed to build a 6.10 kernel with gcc 14 now. > I'll do some more stress testing with it, but looks very stable so far. > > My kernel config is attached. > It's much smaller than the Debian kernel config config-6.10.7-sparc64-smp, > and now I'm quite convinced that it's really some options that cause the > stability issues. > Finding them is big challenge, though...
Re: Linux kernel stability fixes for older SPARCs
Hi Gregor, On Mon, 2024-09-23 at 00:20 +0200, Gregor Riepl wrote: > > > > It's probably due to CONFIG_DEBUG_INFO. Try turning that off. Debian builds > > the kernel with debug > > symbols enabled and then runs the strip command afterwards. This way both a > > debug and a standard > > kernel package can be provided from the same build. > > Ah thanks, that did the trick. > > I built a 6.10.0 kernel using the Debian 6.10.7-sparc64-smp config, with > module signing turned off. > This kernel crashed instantly at boot, just after checking the rootfs. The > fsck output was intermingled with the kernel log, but it did complete with a > "done." > > Begin: Will now check root file system ... fsck from util-linux 2.38.1 > [ 68.420534] \|/ \|/ > [ 68.420534] "@'/ .. \`@" > [ 68.420534] /_| \__/ |_\ > [ 68.420534] \__U_/ > [ 68.630552] mount(192): Kernel illegal instruction [#1] > [ 68.715911] CPU: 0 PID: 192 Comm: mount Tainted: GE > 6.10.0 #28 > [ 68.828841] TSTATE: 11001605 TPC: 10320158 TNPC: > 1032015c Y: Tainted: GE > [ 68.994452] TPC: > [ 69.078729] g0: 0001 g1: 0001 g2: > g3: > [ 69.209968] g4: fff000100210ac00 g5: fff000103e442000 g6: fff000598000 > g7: > [ 69.341210] o0: 0200 o1: fff0029cc3e8 o2: 0001 > o3: 103cc1b0 > [ 69.472457] o4: 1678 o5: b000 sp: fff00059a8d1 > ret_pc: 00ea309c > [ 69.608287] RPC: <__cond_resched+0x1c/0x60> > [ 69.679947] l0: fff000d06416 l1: fff0029cc128 l2: 00010001 > l3: 00ff > [ 69.811203] l4: l5: 0005 l6: 00010001 > l7: 0002 > [ 69.942448] i0: 00010001 i1: i2: > i3: > [ 70.070570] i4: i5: 0001 i6: fff00059a9a1 > i7: 10325748 > [ 70.185151] I7: > [ 70.255147] Call Trace: > [ 70.287221] [<10325748>] ext4_ext_map_blocks+0x68/0x2060 [ext4] > [ 70.374417] [<1033ee38>] ext4_map_blocks+0x98/0x6c0 [ext4] > [ 70.455872] [<1033fd34>] ext4_iomap_begin+0x254/0x2e0 [ext4] > [ 70.539621] [<007f494c>] iomap_iter+0x14c/0x420 > [ 70.608470] [<007fa5f0>] iomap_bmap+0x70/0xe0 > [ 70.674929] [<1033bd3c>] ext4_bmap+0x9c/0xe0 [ext4] > [ 70.748366] [<00789404>] bmap+0x24/0x40 > [ 70.808051] [<102d7e54>] jbd2_journal_init_inode+0x14/0x120 [jbd2] > [ 70.898675] [<10385c2c>] ext4_load_and_init_journal+0xec/0xd20 > [ext4] > [ 70.992740] [<1038bd78>] ext4_fill_super+0x2638/0x2aa0 [ext4] > [ 71.077631] [<0076ae5c>] get_tree_bdev+0xfc/0x1c0 > [ 71.148774] [<10377c34>] ext4_get_tree+0x14/0x40 [ext4] > [ 71.226799] [<0076b4c0>] vfs_get_tree+0x20/0x120 > [ 71.296792] [<00796f0c>] path_mount+0x40c/0xa60 > [ 71.365539] [<00797a94>] sys_mount+0xf4/0x1c0 > [ 71.431997] [<00406274>] linux_sparc_syscall+0x34/0x44 > > I'll try a bisect with that config now, perhaps I can find something. Great! Now you finally have something to work with. Crossing my fingers that you'll find something. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: Linux kernel stability fixes for older SPARCs
It looks very much like it isn't specifically a kernel bug at all, but either something wrong with the Debian kernel config, or with newer gcc versions. I still think it's a kernel bug. Very well. Tracing through all possible kernel config parameters is probably not feasible anyway. It's probably due to CONFIG_DEBUG_INFO. Try turning that off. Debian builds the kernel with debug symbols enabled and then runs the strip command afterwards. This way both a debug and a standard kernel package can be provided from the same build. Ah thanks, that did the trick. I built a 6.10.0 kernel using the Debian 6.10.7-sparc64-smp config, with module signing turned off. This kernel crashed instantly at boot, just after checking the rootfs. The fsck output was intermingled with the kernel log, but it did complete with a "done." Begin: Will now check root file system ... fsck from util-linux 2.38.1 [ 68.420534] \|/ \|/ [ 68.420534] "@'/ .. \`@" [ 68.420534] /_| \__/ |_\ [ 68.420534] \__U_/ [ 68.630552] mount(192): Kernel illegal instruction [#1] [ 68.715911] CPU: 0 PID: 192 Comm: mount Tainted: GE 6.10.0 #28 [ 68.828841] TSTATE: 11001605 TPC: 10320158 TNPC: 1032015c Y: Tainted: GE [ 68.994452] TPC: [ 69.078729] g0: 0001 g1: 0001 g2: g3: [ 69.209968] g4: fff000100210ac00 g5: fff000103e442000 g6: fff000598000 g7: [ 69.341210] o0: 0200 o1: fff0029cc3e8 o2: 0001 o3: 103cc1b0 [ 69.472457] o4: 1678 o5: b000 sp: fff00059a8d1 ret_pc: 00ea309c [ 69.608287] RPC: <__cond_resched+0x1c/0x60> [ 69.679947] l0: fff000d06416 l1: fff0029cc128 l2: 00010001 l3: 00ff [ 69.811203] l4: l5: 0005 l6: 00010001 l7: 0002 [ 69.942448] i0: 00010001 i1: i2: i3: [ 70.070570] i4: i5: 0001 i6: fff00059a9a1 i7: 10325748 [ 70.185151] I7: [ 70.255147] Call Trace: [ 70.287221] [<10325748>] ext4_ext_map_blocks+0x68/0x2060 [ext4] [ 70.374417] [<1033ee38>] ext4_map_blocks+0x98/0x6c0 [ext4] [ 70.455872] [<1033fd34>] ext4_iomap_begin+0x254/0x2e0 [ext4] [ 70.539621] [<007f494c>] iomap_iter+0x14c/0x420 [ 70.608470] [<007fa5f0>] iomap_bmap+0x70/0xe0 [ 70.674929] [<1033bd3c>] ext4_bmap+0x9c/0xe0 [ext4] [ 70.748366] [<00789404>] bmap+0x24/0x40 [ 70.808051] [<102d7e54>] jbd2_journal_init_inode+0x14/0x120 [jbd2] [ 70.898675] [<10385c2c>] ext4_load_and_init_journal+0xec/0xd20 [ext4] [ 70.992740] [<1038bd78>] ext4_fill_super+0x2638/0x2aa0 [ext4] [ 71.077631] [<0076ae5c>] get_tree_bdev+0xfc/0x1c0 [ 71.148774] [<10377c34>] ext4_get_tree+0x14/0x40 [ext4] [ 71.226799] [<0076b4c0>] vfs_get_tree+0x20/0x120 [ 71.296792] [<00796f0c>] path_mount+0x40c/0xa60 [ 71.365539] [<00797a94>] sys_mount+0xf4/0x1c0 [ 71.431997] [<00406274>] linux_sparc_syscall+0x34/0x44 I'll try a bisect with that config now, perhaps I can find something.
Re: Linux kernel stability fixes for older SPARCs
On Wed, 2024-09-18 at 18:17 +0200, Gregor Riepl wrote: > My first attempt at bisecting ran into lots of compilation issues with the > default config of each version and gcc 14. > All the 4.x and 5.x kernels fail with the following errors (at least, some > versions have more): > > arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name': > arch/sparc/kernel/mdesc.c:646:22: error: 'strcmp' reading 1 or more bytes > from a region of size 0 [-Werror=stringop-overread] >646 | if (!strcmp(names + ep[ret].name_offset, name)) >| ^ > arch/sparc/kernel/mdesc.c:78:33: note: at offset [32, 8589934606] into source > object 'mdesc' of size 16 > 78 | struct mdesc_hdrmdesc; >| ^ > ... > In function 'kernel_lds_init', > inlined from 'report_memory' at arch/sparc/mm/init_64.c:3112:2: > arch/sparc/mm/init_64.c:3102:31: error: array subscript -1 is outside array > bounds of 'char[]' [-Werror=array-bounds=] > 3102 | data_resource.end = compute_kern_paddr(_edata - 1); >| ^~ > ./include/asm-generic/sections.h: In function 'report_memory': > ./include/asm-generic/sections.h:36:32: note: at offset -1 into object > '_edata' of size [0, 9223372036854775807] > 36 | extern char _data[], _sdata[], _edata[]; >|^~ > ... Yeah, a lot of warnings were actually fixed in the kernel which are handled as errors if CONFIG_WERROR is set. > Next issue: The default kernel config lacks some essential drivers to make my > system bootable. For my Fire V215, > at least CONFIG_FUSIONMPT and CONFIG_CGROUPS are needed, plus a few other > things. systemd requires cgroups v2 > support theses days. The default configs for 32-bit and 64-bit SPARC could probably see an update here. > I started off with a default config in the first bisect step (corresponding > with 5.14), added the required options, > and then did a make oldconfig in each subsequent step, answering all > questions with the default. "make localmodconfig" is probably easier in this case. > Building with make bindeb-pkg produces an almost usable kernel package. For > some reason, grub-ieee1275 requires an > unpacked kernel, so the installed vmlinuz needed to be gunzipped afterwards. That's not an arbitrary reason, but simply a requirement for GRUB on SPARC due to size limitations. It's documented in the GRUB manual. > Now for the actual testing... triggering a panic/oops reliably was difficult. > The Debian 6.10 kernel usually crashes > relatively quickly on disk I/O, and enabling swap accelerates the effect. > bonnie++ should therefore make for a good > stress test. I haven't found a good reproducer yet, either unfortunately. > I don't have the exact commit IDs of each bisection step, but it was > (roughly) 5.14-rc6, 6.6-rc7, 6.8-rc3, 6.9, 6.10. > > There were a few odd non-critical issues, such as this I/O error with 5.14 > (but nothing in dmesg): > > $ /usr/sbin/bonnie++ > Writing a byte at a time...done > Writing intelligently...done > Rewriting...Can't write block.: Unknown error 2560 > Bonnie: drastic I/O error (re write(2)): Unknown error 2560 Just use "git bisect skip" in this case to skip unreleated regressions. > 6.2 produces this warning at boot: > > [ +21.090317] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: > [ +1.422401] rcu: 0-...!: (1 GPs behind) idle=a29c/0/0x1 softirq=18/19 > fqs=44 > [ +0.093960] (detected by 0, t=2246 jiffies, g=-1175, q=989 ncpus=2) > [ +0.083646] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc7+ #18 > [ +0.083641] TSTATE: 004411001605 TPC: 0042beac TNPC: > 0042beb0 Y: Not tainted > [ +0.129479] TPC: > [ +0.053848] g0: 004209d0 g1: 015282c0 g2: 015105c8 > g3: 0001 > [ +0.114585] g4: fff000390ba0 g5: fff27e2f g6: fff000398000 > g7: 173aa294 > [ +0.114582] o0: fff000390ba0 o1: 0001 o2: 0130ae78 > o3: 015105c8 > [ +0.114580] o4: 015280c0 o5: 0130b580 sp: fff00039b3d1 > ret_pc: 0042bea0 > [ +0.119164] RPC: > [ +0.053850] l0: 01407f20 l1: 00022c05 l2: > l3: 0130b538 > [ +0.114585] l4: 0130b400 l5: 0040 l6: > l7: 01408140 > [ +0.114581] i0: 173aa299 i1: fff27f814990 i2: 0001 > i3: 0001 > [ +0.114580] i4: fff27f814990 i5: 01524990 i6: fff00039b481 > i7: 00b22f68 > [ +0.114582] I7: > [ +0.058433] Call Trace: > [ +0.032082] [<00b22f68>] default_idle_call+0x48/0x100 > [ +0.075624] [<004adc28>] do_idle+0x108/0x180 > [ +0.065311] [<004adf34>] cpu_startup_entry+0x14/0x40 > [ +0.074477] [<0
Re: Linux kernel stability fixes for older SPARCs
Small update: I managed to build a 6.10 kernel with gcc 14 now. I'll do some more stress testing with it, but looks very stable so far. My kernel config is attached. It's much smaller than the Debian kernel config config-6.10.7-sparc64-smp, and now I'm quite convinced that it's really some options that cause the stability issues. Finding them is big challenge, though...# # Automatically generated file; DO NOT EDIT. # Linux/sparc 6.10.0 Kernel Configuration # CONFIG_CC_VERSION_TEXT="sparc64-linux-gcc (GCC) 14.2.0" CONFIG_CC_IS_GCC=y CONFIG_GCC_VERSION=140200 CONFIG_CLANG_VERSION=0 CONFIG_AS_IS_GNU=y CONFIG_AS_VERSION=24200 CONFIG_LD_IS_BFD=y CONFIG_LD_VERSION=24200 CONFIG_LLD_VERSION=0 CONFIG_CC_HAS_ASM_GOTO_OUTPUT=y CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT=y CONFIG_CC_HAS_ASM_INLINE=y CONFIG_CC_HAS_NO_PROFILE_FN_ATTR=y CONFIG_PAHOLE_VERSION=0 CONFIG_IRQ_WORK=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 # CONFIG_COMPILE_TEST is not set # CONFIG_WERROR is not set CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_BUILD_SALT="" CONFIG_DEFAULT_INIT="" CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_SYSVIPC_COMPAT=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y # CONFIG_WATCH_QUEUE is not set CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_USELIB=y # CONFIG_AUDIT is not set CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_SHOW=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_SPARSE_IRQ=y # CONFIG_GENERIC_IRQ_DEBUGFS is not set # end of IRQ subsystem CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_CONTEXT_TRACKING=y CONFIG_CONTEXT_TRACKING_IDLE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # end of Timers subsystem CONFIG_BPF=y CONFIG_HAVE_EBPF_JIT=y # # BPF subsystem # CONFIG_BPF_SYSCALL=y CONFIG_BPF_JIT=y # CONFIG_BPF_JIT_ALWAYS_ON is not set CONFIG_BPF_UNPRIV_DEFAULT_OFF=y CONFIG_USERMODE_DRIVER=y CONFIG_BPF_PRELOAD=y CONFIG_BPF_PRELOAD_UMD=m # end of BPF subsystem CONFIG_PREEMPT_VOLUNTARY_BUILD=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set # CONFIG_SCHED_CORE is not set # # CPU/Task time and stats accounting # CONFIG_TICK_CPU_ACCOUNTING=y # CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_PSI is not set # end of CPU/Task time and stats accounting CONFIG_CPU_ISOLATION=y # # RCU Subsystem # CONFIG_TREE_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_TREE_SRCU=y CONFIG_NEED_SRCU_NMI_SAFE=y CONFIG_TASKS_RCU_GENERIC=y CONFIG_NEED_TASKS_RCU=y CONFIG_TASKS_TRACE_RCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y # end of RCU Subsystem # CONFIG_IKCONFIG is not set # CONFIG_IKHEADERS is not set CONFIG_LOG_BUF_SHIFT=18 CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 # CONFIG_PRINTK_INDEX is not set # # Scheduler features # # end of Scheduler features CONFIG_CC_HAS_INT128=y CONFIG_CC_IMPLICIT_FALLTHROUGH="-Wimplicit-fallthrough=5" CONFIG_GCC10_NO_ARRAY_BOUNDS=y CONFIG_CC_NO_ARRAY_BOUNDS=y CONFIG_GCC_NO_STRINGOP_OVERFLOW=y CONFIG_CC_NO_STRINGOP_OVERFLOW=y CONFIG_SLAB_OBJ_EXT=y CONFIG_CGROUPS=y CONFIG_PAGE_COUNTER=y # CONFIG_CGROUP_FAVOR_DYNMODS is not set CONFIG_MEMCG=y CONFIG_MEMCG_KMEM=y CONFIG_BLK_CGROUP=y CONFIG_CGROUP_WRITEBACK=y CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y CONFIG_CFS_BANDWIDTH=y CONFIG_RT_GROUP_SCHED=y CONFIG_CGROUP_PIDS=y CONFIG_CGROUP_RDMA=y CONFIG_CGROUP_FREEZER=y CONFIG_CGROUP_HUGETLB=y CONFIG_CPUSETS=y CONFIG_PROC_PID_CPUSET=y CONFIG_CGROUP_DEVICE=y CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y CONFIG_CGROUP_MISC=y # CONFIG_CGROUP_DEBUG is not set CONFIG_SOCK_CGROUP_DATA=y CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y CONFIG_USER_NS=y CONFIG_PID_NS=y CONFIG_NET_NS=y # CONFIG_CHECKPOINT_RESTORE is not set CONFIG_SCHED_AUTOGROUP=y CONFIG_RELAY=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" CONFIG_RD_GZIP=y CONFIG_RD_BZIP2=y CONFIG_RD_LZMA=y CONFIG_RD_XZ=y CONFIG_RD_LZO=y CONFIG_RD_LZ4=y CONFIG_RD_ZSTD=y # CONFIG_BOOT_CONFIG is not set CONFIG_INITRAMFS_PRESERVE_MTIME=y CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_HAVE_UID16=y CONFIG_SYSCTL_EXCEPTION_TRACE=y # CONFIG_EXPERT is not set CONFIG_UID16=y CONFIG_MULTIUSER=y CONFIG_SGETMASK_SYSCALL=y CONFIG_SYSFS_SYSCALL=y CONFIG_FHANDLE=y CONFIG_POSIX_TIMERS=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_FUTEX=y CONFIG_FUTEX_PI=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_AIO=y CONFIG_IO_URING=y CONFIG_ADVISE_SYSCALLS=y CONFIG_MEMBARRIER=y CONFIG_CACHESTAT_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_SELFTEST is not set # CONFIG_KALLSYMS_ALL is not set CONFIG_KALLSYMS_BASE_RELATIVE=y CONFIG_HAVE_PERF_
Re: Linux kernel stability fixes for older SPARCs
So, here's a status report, sorry for the long mail (summary at the end): I did all tests on my Sun Fire V215 for now, because the machine is a bit faster than the Ultra 10 and the ALOM makes remote testing a little bit easier. It also has two CPUs, helping to uncover SMP-related issues. Kernel 4.19 indeed seems to run more stable than later Debian kernels. That at least gives me stable system to fix things when they break. Thanks for that hint. My first attempt at bisecting ran into lots of compilation issues with the default config of each version and gcc 14. All the 4.x and 5.x kernels fail with the following errors (at least, some versions have more): arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name': arch/sparc/kernel/mdesc.c:646:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread] 646 | if (!strcmp(names + ep[ret].name_offset, name)) | ^ arch/sparc/kernel/mdesc.c:78:33: note: at offset [32, 8589934606] into source object 'mdesc' of size 16 78 | struct mdesc_hdrmdesc; | ^ ... In function 'kernel_lds_init', inlined from 'report_memory' at arch/sparc/mm/init_64.c:3112:2: arch/sparc/mm/init_64.c:3102:31: error: array subscript -1 is outside array bounds of 'char[]' [-Werror=array-bounds=] 3102 | data_resource.end = compute_kern_paddr(_edata - 1); | ^~ ./include/asm-generic/sections.h: In function 'report_memory': ./include/asm-generic/sections.h:36:32: note: at offset -1 into object '_edata' of size [0, 9223372036854775807] 36 | extern char _data[], _sdata[], _edata[]; |^~ ... I then tried gcc 8.1, which roughly matches 4.19 by the time of release. Older kernels compile well with this version, but 6.10 failed with these errors. I couldn't reproduce this error later on, so it may have been a fluke: `.exit.text' referenced in section `__jump_table' of fs/fuse/inode.o: defined in discarded section `.exit.text' of fs/fuse/inode.o `.exit.text' referenced in section `__jump_table' of fs/fuse/inode.o: defined in discarded section `.exit.text' of fs/fuse/inode.o I decided to ignore the error for now and start bisecting from 4.19 to 6.10 with gcc 8.1. Next issue: The default kernel config lacks some essential drivers to make my system bootable. For my Fire V215, at least CONFIG_FUSIONMPT and CONFIG_CGROUPS are needed, plus a few other things. systemd requires cgroups v2 support theses days. I started off with a default config in the first bisect step (corresponding with 5.14), added the required options, and then did a make oldconfig in each subsequent step, answering all questions with the default. Building with make bindeb-pkg produces an almost usable kernel package. For some reason, grub-ieee1275 requires an unpacked kernel, so the installed vmlinuz needed to be gunzipped afterwards. Now for the actual testing... triggering a panic/oops reliably was difficult. The Debian 6.10 kernel usually crashes relatively quickly on disk I/O, and enabling swap accelerates the effect. bonnie++ should therefore make for a good stress test. I don't have the exact commit IDs of each bisection step, but it was (roughly) 5.14-rc6, 6.6-rc7, 6.8-rc3, 6.9, 6.10. There were a few odd non-critical issues, such as this I/O error with 5.14 (but nothing in dmesg): $ /usr/sbin/bonnie++ Writing a byte at a time...done Writing intelligently...done Rewriting...Can't write block.: Unknown error 2560 Bonnie: drastic I/O error (re write(2)): Unknown error 2560 6.2 produces this warning at boot: [ +21.090317] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ +1.422401] rcu: 0-...!: (1 GPs behind) idle=a29c/0/0x1 softirq=18/19 fqs=44 [ +0.093960] (detected by 0, t=2246 jiffies, g=-1175, q=989 ncpus=2) [ +0.083646] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc7+ #18 [ +0.083641] TSTATE: 004411001605 TPC: 0042beac TNPC: 0042beb0 Y: Not tainted [ +0.129479] TPC: [ +0.053848] g0: 004209d0 g1: 015282c0 g2: 015105c8 g3: 0001 [ +0.114585] g4: fff000390ba0 g5: fff27e2f g6: fff000398000 g7: 173aa294 [ +0.114582] o0: fff000390ba0 o1: 0001 o2: 0130ae78 o3: 015105c8 [ +0.114580] o4: 015280c0 o5: 0130b580 sp: fff00039b3d1 ret_pc: 0042bea0 [ +0.119164] RPC: [ +0.053850] l0: 01407f20 l1: 00022c05 l2: l3: 0130b538 [ +0.114585] l4: 0130b400 l5: 0040 l6: l7: 01408140 [ +0.114581] i0: 173aa299 i1: fff27f814990 i2: 0001 i3: 0001 [ +0.114580] i4: fff27f814990 i5: 01524990 i6: fff000
Re: Linux kernel stability fixes for older SPARCs
Hi Gregor, On Wed, 2024-09-04 at 01:22 +0200, Gregor Riepl wrote: > > > > It's actually pretty simple these days as the kernel.org mirrors provide > > binary > > distributions of freestanding toolchains for all major supported > > architectures > > of the Linux kernel. > > > > To set up on any x86_64 machine, do the following: > > > > # wget > > https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/14.2.0/x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz > > # tar xf x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz > > # export PATH=$PATH:$PWD/gcc-14.2.0-nolibc/sparc64-linux/bin/ > > # git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > # cd linux > > # export ARCH=sparc > > # export CROSS_COMPILE=sparc64-linux- > > # make sparc64_defconfig > > # make -j > > > > The cross-compiled kernel will be available as "vmlinux". > > Very good, thanks! Thanks for looking into it! I'm especially interested in finding a proper reproducer which would make the bisecting process much easier. So far, the crashes seem to be rather random although they mainly occur with newer kernels. FWIW, I found a very handy patch yesterday which could help debugging these crashes once it's been merged into the upstream kernel [1]. What it does is that it dumps the back of the stack after a stack corruption has occurred which should in theory help find what part of the kernel is responsible for the stack corruption. It looks like this particular crash we have been seeing on the older SPARCs was always due to stack corruption which could mean that it's related to a driver or arch-specific code that is used on the older SPARCs but not on the newer machines. Adrian > [1] https://lore.kernel.org/lkml/20231219032254.96685-1-feng.t...@intel.com/ -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: Linux kernel stability fixes for older SPARCs
I may have some time to do test runs next week. Could you give me some quick starters for setting up a kernel cross build env on an amd64 machine, or maybe access to a Sun box I could use? It's actually pretty simple these days as the kernel.org mirrors provide binary distributions of freestanding toolchains for all major supported architectures of the Linux kernel. To set up on any x86_64 machine, do the following: # wget https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/14.2.0/x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz # tar xf x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz # export PATH=$PATH:$PWD/gcc-14.2.0-nolibc/sparc64-linux/bin/ # git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git # cd linux # export ARCH=sparc # export CROSS_COMPILE=sparc64-linux- # make sparc64_defconfig # make -j The cross-compiled kernel will be available as "vmlinux". Very good, thanks!
Re: Linux kernel stability fixes for older SPARCs
Hi Gregor, On Tue, 2024-09-03 at 19:19 +0200, Gregor Riepl wrote: > > > If you have issues to bi-sect just let us know for any arch. Given T2’s > > > cross-compile > > > support and I have most hardware in my museum now, I can usually bisect > > > issues > > > within a day or two. > > > > I don't have issues with bisecting, I'm just rather time-constrained at the > > moment, so > > I'm always happy when someone else can step in and help. Would be great to > > get this issue > > fixed upstream. > > My Ultra 10 and Fire V215 are desperately waiting for a more stable kernel. > I actually wanted to offer help with bisecting, but kept back due to a lack > of time and also suitable build system (compiling kernels is so > time-consuming). Any help is welcome ;-). > I may have some time to do test runs next week. > Could you give me some quick starters for setting up a kernel cross build env > on > an amd64 machine, or maybe access to a Sun box I could use? It's actually pretty simple these days as the kernel.org mirrors provide binary distributions of freestanding toolchains for all major supported architectures of the Linux kernel. To set up on any x86_64 machine, do the following: # wget https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/14.2.0/x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz # tar xf x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz # export PATH=$PATH:$PWD/gcc-14.2.0-nolibc/sparc64-linux/bin/ # git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git # cd linux # export ARCH=sparc # export CROSS_COMPILE=sparc64-linux- # make sparc64_defconfig # make -j The cross-compiled kernel will be available as "vmlinux". Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: Linux kernel stability fixes for older SPARCs
If you have issues to bi-sect just let us know for any arch. Given T2’s cross-compile support and I have most hardware in my museum now, I can usually bisect issues within a day or two. I don't have issues with bisecting, I'm just rather time-constrained at the moment, so I'm always happy when someone else can step in and help. Would be great to get this issue fixed upstream. My Ultra 10 and Fire V215 are desperately waiting for a more stable kernel. I actually wanted to offer help with bisecting, but kept back due to a lack of time and also suitable build system (compiling kernels is so time-consuming). I may have some time to do test runs next week. Could you give me some quick starters for setting up a kernel cross build env on an amd64 machine, or maybe access to a Sun box I could use?
Re: Linux kernel stability fixes for older SPARCs
Hello Rene, On Tue, 2024-09-03 at 11:09 +0200, René Rebe wrote: > > according to these posts [1][2] by Iggi, you figured out the stability > > problem > > No, we are just sometimes lucky it run that long stable. I was only made aware > recently that sun4u was not 100% and my fasted UltraSPARC until some year ago > was only a 360MHz Ultra5 until I was donated a Sun Blade 1000 recently. I see > some MM corruption that I wanted to hunt next. Hmm, ok. I was under the impression that you made some changes that made the kernel on Iggi's machine stable. Currently, the kernel crashes randomly on older SPARCs such as reported by Iggi: > https://x.com/Iggi76123640/status/1827658841581896152 > > with newer kernels on older SPARC machines. There has been a regression on > > older > > SPARCs since around kernel 4.19.x which I haven't gotten around to > > bisecting yet. > > Happy to bi-sect. I guess you mean random memory corruption I see or anything > else? Not sure what the underlying issue is, but the kernel just crashes completely. > If you have issues to bi-sect just let us know for any arch. Given T2’s > cross-compile > support and I have most hardware in my museum now, I can usually bisect issues > within a day or two. I don't have issues with bisecting, I'm just rather time-constrained at the moment, so I'm always happy when someone else can step in and help. Would be great to get this issue fixed upstream. > > If you've found and fixed the bug in question, it would be great if you > > could share > > your fix with the community and maybe whip up a kernel patch to fix the bug > > upstream. > > > Of course - all patches are always nicely sorted in our public and nicely > readable > SVN tree in any case. > > https://t2linux.com Is there a web view available? I'm not really a big fan of SVN, to be honest. > > Newer SPARCs are not affected by this bug, although there are other issues. > > You mean sun4v? I found a cheap T4-1 some month ago, and T2/Linux appears > to run stable on that. Any list of issues w/ sun4v I should be aware of? Linux runs mostly stable on sun4v, but there are filesystem corruption issues when you run Linux inside an LDOM on Solaris 11.3 and 11.4 even with the latest SRU of Solaris. These happen rarely, but they do occur and they are quite annoying as they mandate rebooting the LDOM as the root filesystem is mounted read-only and the filesystems as errors afterwards. It seems to be a bug in the LDOM vdisk driver (drivers/block/sunvdc.c). Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913