Re: Linux kernel stability fixes for older SPARCs

2024-11-01 Thread John Paul Adrian Glaubitz
Hi Gregor,

On Mon, 2024-09-23 at 00:20 +0200, Gregor Riepl wrote:
> > 
> > It's probably due to CONFIG_DEBUG_INFO. Try turning that off. Debian builds 
> > the kernel with debug
> > symbols enabled and then runs the strip command afterwards. This way both a 
> > debug and a standard
> > kernel package can be provided from the same build.
> 
> Ah thanks, that did the trick.
> 
> I built a 6.10.0 kernel using the Debian 6.10.7-sparc64-smp config, with 
> module signing turned off.
> This kernel crashed instantly at boot, just after checking the rootfs. The 
> fsck output was intermingled with the kernel log, but it did complete with a 
> "done."
> 
> Begin: Will now check root file system ... fsck from util-linux 2.38.1
> [   68.420534]   \|/  \|/
> [   68.420534]   "@'/ .. \`@"
> [   68.420534]   /_| \__/ |_\
> [   68.420534]  \__U_/
> [   68.630552] mount(192): Kernel illegal instruction [#1]
> [   68.715911] CPU: 0 PID: 192 Comm: mount Tainted: GE  
> 6.10.0 #28
> [   68.828841] TSTATE: 11001605 TPC: 10320158 TNPC: 
> 1032015c Y: Tainted: GE
> [   68.994452] TPC: 
> [   69.078729] g0: 0001 g1: 0001 g2:  
> g3: 
> [   69.209968] g4: fff000100210ac00 g5: fff000103e442000 g6: fff000598000 
> g7: 
> [   69.341210] o0: 0200 o1: fff0029cc3e8 o2: 0001 
> o3: 103cc1b0
> [   69.472457] o4: 1678 o5: b000 sp: fff00059a8d1 
> ret_pc: 00ea309c
> [   69.608287] RPC: <__cond_resched+0x1c/0x60>
> [   69.679947] l0: fff000d06416 l1: fff0029cc128 l2: 00010001 
> l3: 00ff
> [   69.811203] l4:  l5: 0005 l6: 00010001 
> l7: 0002
> [   69.942448] i0: 00010001 i1:  i2:  
> i3: 
> [   70.070570] i4:  i5: 0001 i6: fff00059a9a1 
> i7: 10325748
> [   70.185151] I7: 
> [   70.255147] Call Trace:
> [   70.287221] [<10325748>] ext4_ext_map_blocks+0x68/0x2060 [ext4]
> [   70.374417] [<1033ee38>] ext4_map_blocks+0x98/0x6c0 [ext4]
> [   70.455872] [<1033fd34>] ext4_iomap_begin+0x254/0x2e0 [ext4]
> [   70.539621] [<007f494c>] iomap_iter+0x14c/0x420
> [   70.608470] [<007fa5f0>] iomap_bmap+0x70/0xe0
> [   70.674929] [<1033bd3c>] ext4_bmap+0x9c/0xe0 [ext4]
> [   70.748366] [<00789404>] bmap+0x24/0x40
> [   70.808051] [<102d7e54>] jbd2_journal_init_inode+0x14/0x120 [jbd2]
> [   70.898675] [<10385c2c>] ext4_load_and_init_journal+0xec/0xd20 
> [ext4]
> [   70.992740] [<1038bd78>] ext4_fill_super+0x2638/0x2aa0 [ext4]
> [   71.077631] [<0076ae5c>] get_tree_bdev+0xfc/0x1c0
> [   71.148774] [<10377c34>] ext4_get_tree+0x14/0x40 [ext4]
> [   71.226799] [<0076b4c0>] vfs_get_tree+0x20/0x120
> [   71.296792] [<00796f0c>] path_mount+0x40c/0xa60
> [   71.365539] [<00797a94>] sys_mount+0xf4/0x1c0
> [   71.431997] [<00406274>] linux_sparc_syscall+0x34/0x44
> 
> I'll try a bisect with that config now, perhaps I can find something.

Could you try whether reverting the following commit fixes it?

commit 223b5e57d0d50b0c07b933350dbcde92018d3080
Author: Mike Rapoport (IBM) 
Date:   Sun May 5 19:06:20 2024 +0300

 mm/execmem, arch: convert remaining overrides of module_alloc to execmem

Alternatively, please try the following change:

diff --git a/mm/execmem.c b/mm/execmem.c
index 0c4b36bc6d10..8232f9767c8c 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -17,7 +17,11 @@ static struct execmem_info default_execmem_info 
__ro_after_init;
 static void *__execmem_alloc(struct execmem_range *range, size_t size)
 {
bool kasan = range->flags & EXECMEM_KASAN_SHADOW;
+#ifndef __sparc__
unsigned long vm_flags  = VM_FLUSH_RESET_PERMS;
+#else
+   unsigned long vm_flags  = 0;
+#endif
gfp_t gfp_flags = GFP_KERNEL | __GFP_NOWARN;
unsigned long start = range->start;
unsigned long end = range->end;

Thanks,
Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: Linux kernel stability fixes for older SPARCs

2024-09-23 Thread John Paul Adrian Glaubitz
On Mon, 2024-09-23 at 09:56 +0200, Ignacio Soriano Hernandez wrote:
> We are at  a very stable working configuration with Radeon support on T2 SDE. 

>From what I have heard, Rene did not apply any local SPARC-specific patches to
the kernel, so the fact that your machine runs stable with T2 SDE is more likely
a result of disabled kernel features.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: Linux kernel stability fixes for older SPARCs

2024-09-23 Thread Ignacio Soriano Hernandez
Good to hear Gregor.

We are at  a very stable working configuration with Radeon support on T2
SDE.


I think combining the efforts will be beneficial for the whole
Linux/SPARC64 community.

Cheers

Iggi

Gregor Riepl  schrieb am Mi. 18. Sept. 2024 um 20:17:

> Small update:
>
> I managed to build a 6.10 kernel with gcc 14 now.
> I'll do some more stress testing with it, but looks very stable so far.
>
> My kernel config is attached.
> It's much smaller than the Debian kernel config config-6.10.7-sparc64-smp,
> and now I'm quite convinced that it's really some options that cause the
> stability issues.
> Finding them is big challenge, though...


Re: Linux kernel stability fixes for older SPARCs

2024-09-22 Thread John Paul Adrian Glaubitz
Hi Gregor,

On Mon, 2024-09-23 at 00:20 +0200, Gregor Riepl wrote:
> > 
> > It's probably due to CONFIG_DEBUG_INFO. Try turning that off. Debian builds 
> > the kernel with debug
> > symbols enabled and then runs the strip command afterwards. This way both a 
> > debug and a standard
> > kernel package can be provided from the same build.
> 
> Ah thanks, that did the trick.
> 
> I built a 6.10.0 kernel using the Debian 6.10.7-sparc64-smp config, with 
> module signing turned off.
> This kernel crashed instantly at boot, just after checking the rootfs. The 
> fsck output was intermingled with the kernel log, but it did complete with a 
> "done."
> 
> Begin: Will now check root file system ... fsck from util-linux 2.38.1
> [   68.420534]   \|/  \|/
> [   68.420534]   "@'/ .. \`@"
> [   68.420534]   /_| \__/ |_\
> [   68.420534]  \__U_/
> [   68.630552] mount(192): Kernel illegal instruction [#1]
> [   68.715911] CPU: 0 PID: 192 Comm: mount Tainted: GE  
> 6.10.0 #28
> [   68.828841] TSTATE: 11001605 TPC: 10320158 TNPC: 
> 1032015c Y: Tainted: GE
> [   68.994452] TPC: 
> [   69.078729] g0: 0001 g1: 0001 g2:  
> g3: 
> [   69.209968] g4: fff000100210ac00 g5: fff000103e442000 g6: fff000598000 
> g7: 
> [   69.341210] o0: 0200 o1: fff0029cc3e8 o2: 0001 
> o3: 103cc1b0
> [   69.472457] o4: 1678 o5: b000 sp: fff00059a8d1 
> ret_pc: 00ea309c
> [   69.608287] RPC: <__cond_resched+0x1c/0x60>
> [   69.679947] l0: fff000d06416 l1: fff0029cc128 l2: 00010001 
> l3: 00ff
> [   69.811203] l4:  l5: 0005 l6: 00010001 
> l7: 0002
> [   69.942448] i0: 00010001 i1:  i2:  
> i3: 
> [   70.070570] i4:  i5: 0001 i6: fff00059a9a1 
> i7: 10325748
> [   70.185151] I7: 
> [   70.255147] Call Trace:
> [   70.287221] [<10325748>] ext4_ext_map_blocks+0x68/0x2060 [ext4]
> [   70.374417] [<1033ee38>] ext4_map_blocks+0x98/0x6c0 [ext4]
> [   70.455872] [<1033fd34>] ext4_iomap_begin+0x254/0x2e0 [ext4]
> [   70.539621] [<007f494c>] iomap_iter+0x14c/0x420
> [   70.608470] [<007fa5f0>] iomap_bmap+0x70/0xe0
> [   70.674929] [<1033bd3c>] ext4_bmap+0x9c/0xe0 [ext4]
> [   70.748366] [<00789404>] bmap+0x24/0x40
> [   70.808051] [<102d7e54>] jbd2_journal_init_inode+0x14/0x120 [jbd2]
> [   70.898675] [<10385c2c>] ext4_load_and_init_journal+0xec/0xd20 
> [ext4]
> [   70.992740] [<1038bd78>] ext4_fill_super+0x2638/0x2aa0 [ext4]
> [   71.077631] [<0076ae5c>] get_tree_bdev+0xfc/0x1c0
> [   71.148774] [<10377c34>] ext4_get_tree+0x14/0x40 [ext4]
> [   71.226799] [<0076b4c0>] vfs_get_tree+0x20/0x120
> [   71.296792] [<00796f0c>] path_mount+0x40c/0xa60
> [   71.365539] [<00797a94>] sys_mount+0xf4/0x1c0
> [   71.431997] [<00406274>] linux_sparc_syscall+0x34/0x44
> 
> I'll try a bisect with that config now, perhaps I can find something.

Great! Now you finally have something to work with. Crossing my fingers that 
you'll find something.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: Linux kernel stability fixes for older SPARCs

2024-09-22 Thread Gregor Riepl

It looks very much like it isn't specifically a kernel bug at all, but either 
something
wrong with the Debian kernel config, or with newer gcc versions.


I still think it's a kernel bug.


Very well.
Tracing through all possible kernel config parameters is probably not feasible 
anyway.


It's probably due to CONFIG_DEBUG_INFO. Try turning that off. Debian builds the 
kernel with debug
symbols enabled and then runs the strip command afterwards. This way both a 
debug and a standard
kernel package can be provided from the same build.


Ah thanks, that did the trick.

I built a 6.10.0 kernel using the Debian 6.10.7-sparc64-smp config, with module 
signing turned off.
This kernel crashed instantly at boot, just after checking the rootfs. The fsck output 
was intermingled with the kernel log, but it did complete with a "done."

Begin: Will now check root file system ... fsck from util-linux 2.38.1
[   68.420534]   \|/  \|/
[   68.420534]   "@'/ .. \`@"
[   68.420534]   /_| \__/ |_\
[   68.420534]  \__U_/
[   68.630552] mount(192): Kernel illegal instruction [#1]
[   68.715911] CPU: 0 PID: 192 Comm: mount Tainted: GE  6.10.0 
#28
[   68.828841] TSTATE: 11001605 TPC: 10320158 TNPC: 
1032015c Y: Tainted: GE
[   68.994452] TPC: 
[   69.078729] g0: 0001 g1: 0001 g2:  
g3: 
[   69.209968] g4: fff000100210ac00 g5: fff000103e442000 g6: fff000598000 
g7: 
[   69.341210] o0: 0200 o1: fff0029cc3e8 o2: 0001 
o3: 103cc1b0
[   69.472457] o4: 1678 o5: b000 sp: fff00059a8d1 
ret_pc: 00ea309c
[   69.608287] RPC: <__cond_resched+0x1c/0x60>
[   69.679947] l0: fff000d06416 l1: fff0029cc128 l2: 00010001 
l3: 00ff
[   69.811203] l4:  l5: 0005 l6: 00010001 
l7: 0002
[   69.942448] i0: 00010001 i1:  i2:  
i3: 
[   70.070570] i4:  i5: 0001 i6: fff00059a9a1 
i7: 10325748
[   70.185151] I7: 
[   70.255147] Call Trace:
[   70.287221] [<10325748>] ext4_ext_map_blocks+0x68/0x2060 [ext4]
[   70.374417] [<1033ee38>] ext4_map_blocks+0x98/0x6c0 [ext4]
[   70.455872] [<1033fd34>] ext4_iomap_begin+0x254/0x2e0 [ext4]
[   70.539621] [<007f494c>] iomap_iter+0x14c/0x420
[   70.608470] [<007fa5f0>] iomap_bmap+0x70/0xe0
[   70.674929] [<1033bd3c>] ext4_bmap+0x9c/0xe0 [ext4]
[   70.748366] [<00789404>] bmap+0x24/0x40
[   70.808051] [<102d7e54>] jbd2_journal_init_inode+0x14/0x120 [jbd2]
[   70.898675] [<10385c2c>] ext4_load_and_init_journal+0xec/0xd20 [ext4]
[   70.992740] [<1038bd78>] ext4_fill_super+0x2638/0x2aa0 [ext4]
[   71.077631] [<0076ae5c>] get_tree_bdev+0xfc/0x1c0
[   71.148774] [<10377c34>] ext4_get_tree+0x14/0x40 [ext4]
[   71.226799] [<0076b4c0>] vfs_get_tree+0x20/0x120
[   71.296792] [<00796f0c>] path_mount+0x40c/0xa60
[   71.365539] [<00797a94>] sys_mount+0xf4/0x1c0
[   71.431997] [<00406274>] linux_sparc_syscall+0x34/0x44

I'll try a bisect with that config now, perhaps I can find something.



Re: Linux kernel stability fixes for older SPARCs

2024-09-18 Thread John Paul Adrian Glaubitz
On Wed, 2024-09-18 at 18:17 +0200, Gregor Riepl wrote:
> My first attempt at bisecting ran into lots of compilation issues with the 
> default config of each version and gcc 14.
> All the 4.x and 5.x kernels fail with the following errors (at least, some 
> versions have more):
> 
> arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name':
> arch/sparc/kernel/mdesc.c:646:22: error: 'strcmp' reading 1 or more bytes 
> from a region of size 0 [-Werror=stringop-overread]
>646 | if (!strcmp(names + ep[ret].name_offset, name))
>|  ^
> arch/sparc/kernel/mdesc.c:78:33: note: at offset [32, 8589934606] into source 
> object 'mdesc' of size 16
> 78 | struct mdesc_hdrmdesc;
>| ^
> ...
> In function 'kernel_lds_init',
>  inlined from 'report_memory' at arch/sparc/mm/init_64.c:3112:2:
> arch/sparc/mm/init_64.c:3102:31: error: array subscript -1 is outside array 
> bounds of 'char[]' [-Werror=array-bounds=]
>   3102 | data_resource.end   = compute_kern_paddr(_edata - 1);
>|   ^~
> ./include/asm-generic/sections.h: In function 'report_memory':
> ./include/asm-generic/sections.h:36:32: note: at offset -1 into object 
> '_edata' of size [0, 9223372036854775807]
> 36 | extern char _data[], _sdata[], _edata[];
>|^~
> ...

Yeah, a lot of warnings were actually fixed in the kernel which are handled as 
errors if CONFIG_WERROR is set.

> Next issue: The default kernel config lacks some essential drivers to make my 
> system bootable. For my Fire V215,
> at least CONFIG_FUSIONMPT and CONFIG_CGROUPS are needed, plus a few other 
> things. systemd requires cgroups v2
> support theses days.

The default configs for 32-bit and 64-bit SPARC could probably see an update 
here.

> I started off with a default config in the first bisect step (corresponding 
> with 5.14), added the required options,
> and then did a make oldconfig in each subsequent step, answering all 
> questions with the default.

"make localmodconfig" is probably easier in this case.

> Building with make bindeb-pkg produces an almost usable kernel package. For 
> some reason, grub-ieee1275 requires an
> unpacked kernel, so the installed vmlinuz needed to be gunzipped afterwards.

That's not an arbitrary reason, but simply a requirement for GRUB on SPARC due 
to size limitations. It's documented
in the GRUB manual.

> Now for the actual testing... triggering a panic/oops reliably was difficult. 
> The Debian 6.10 kernel usually crashes
> relatively quickly on disk I/O, and enabling swap accelerates the effect. 
> bonnie++ should therefore make for a good
> stress test.

I haven't found a good reproducer yet, either unfortunately.

> I don't have the exact commit IDs of each bisection step, but it was 
> (roughly) 5.14-rc6, 6.6-rc7, 6.8-rc3, 6.9, 6.10.
> 
> There were a few odd non-critical issues, such as this I/O error with 5.14 
> (but nothing in dmesg):
> 
> $ /usr/sbin/bonnie++
> Writing a byte at a time...done
> Writing intelligently...done
> Rewriting...Can't write block.: Unknown error 2560
> Bonnie: drastic I/O error (re write(2)): Unknown error 2560

Just use "git bisect skip" in this case to skip unreleated regressions.

> 6.2 produces this warning at boot:
> 
> [ +21.090317] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [  +1.422401] rcu:  0-...!: (1 GPs behind) idle=a29c/0/0x1 softirq=18/19 
> fqs=44
> [  +0.093960]   (detected by 0, t=2246 jiffies, g=-1175, q=989 ncpus=2)
> [  +0.083646] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc7+ #18
> [  +0.083641] TSTATE: 004411001605 TPC: 0042beac TNPC: 
> 0042beb0 Y: Not tainted
> [  +0.129479] TPC: 
> [  +0.053848] g0: 004209d0 g1: 015282c0 g2: 015105c8 
> g3: 0001
> [  +0.114585] g4: fff000390ba0 g5: fff27e2f g6: fff000398000 
> g7: 173aa294
> [  +0.114582] o0: fff000390ba0 o1: 0001 o2: 0130ae78 
> o3: 015105c8
> [  +0.114580] o4: 015280c0 o5: 0130b580 sp: fff00039b3d1 
> ret_pc: 0042bea0
> [  +0.119164] RPC: 
> [  +0.053850] l0: 01407f20 l1: 00022c05 l2:  
> l3: 0130b538
> [  +0.114585] l4: 0130b400 l5: 0040 l6:  
> l7: 01408140
> [  +0.114581] i0: 173aa299 i1: fff27f814990 i2: 0001 
> i3: 0001
> [  +0.114580] i4: fff27f814990 i5: 01524990 i6: fff00039b481 
> i7: 00b22f68
> [  +0.114582] I7: 
> [  +0.058433] Call Trace:
> [  +0.032082] [<00b22f68>] default_idle_call+0x48/0x100
> [  +0.075624] [<004adc28>] do_idle+0x108/0x180
> [  +0.065311] [<004adf34>] cpu_startup_entry+0x14/0x40
> [  +0.074477] [<0

Re: Linux kernel stability fixes for older SPARCs

2024-09-18 Thread Gregor Riepl

Small update:

I managed to build a 6.10 kernel with gcc 14 now.
I'll do some more stress testing with it, but looks very stable so far.

My kernel config is attached.
It's much smaller than the Debian kernel config config-6.10.7-sparc64-smp, and 
now I'm quite convinced that it's really some options that cause the stability 
issues.
Finding them is big challenge, though...#
# Automatically generated file; DO NOT EDIT.
# Linux/sparc 6.10.0 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="sparc64-linux-gcc (GCC) 14.2.0"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=140200
CONFIG_CLANG_VERSION=0
CONFIG_AS_IS_GNU=y
CONFIG_AS_VERSION=24200
CONFIG_LD_IS_BFD=y
CONFIG_LD_VERSION=24200
CONFIG_LLD_VERSION=0
CONFIG_CC_HAS_ASM_GOTO_OUTPUT=y
CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_CC_HAS_NO_PROFILE_FN_ATTR=y
CONFIG_PAHOLE_VERSION=0
CONFIG_IRQ_WORK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
# CONFIG_WERROR is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT=""
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_CONTEXT_TRACKING=y
CONFIG_CONTEXT_TRACKING_IDLE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

CONFIG_BPF=y
CONFIG_HAVE_EBPF_JIT=y

#
# BPF subsystem
#
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
# CONFIG_BPF_JIT_ALWAYS_ON is not set
CONFIG_BPF_UNPRIV_DEFAULT_OFF=y
CONFIG_USERMODE_DRIVER=y
CONFIG_BPF_PRELOAD=y
CONFIG_BPF_PRELOAD_UMD=m
# end of BPF subsystem

CONFIG_PREEMPT_VOLUNTARY_BUILD=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
# CONFIG_SCHED_CORE is not set

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_TREE_SRCU=y
CONFIG_NEED_SRCU_NMI_SAFE=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_NEED_TASKS_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem

# CONFIG_IKCONFIG is not set
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=18
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
# CONFIG_PRINTK_INDEX is not set

#
# Scheduler features
#
# end of Scheduler features

CONFIG_CC_HAS_INT128=y
CONFIG_CC_IMPLICIT_FALLTHROUGH="-Wimplicit-fallthrough=5"
CONFIG_GCC10_NO_ARRAY_BOUNDS=y
CONFIG_CC_NO_ARRAY_BOUNDS=y
CONFIG_GCC_NO_STRINGOP_OVERFLOW=y
CONFIG_CC_NO_STRINGOP_OVERFLOW=y
CONFIG_SLAB_OBJ_EXT=y
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
# CONFIG_CGROUP_FAVOR_DYNMODS is not set
CONFIG_MEMCG=y
CONFIG_MEMCG_KMEM=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_BPF=y
CONFIG_CGROUP_MISC=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_SCHED_AUTOGROUP=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_RD_LZ4=y
CONFIG_RD_ZSTD=y
# CONFIG_BOOT_CONFIG is not set
CONFIG_INITRAMFS_PRESERVE_MTIME=y
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
# CONFIG_EXPERT is not set
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
CONFIG_FHANDLE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_IO_URING=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_MEMBARRIER=y
CONFIG_CACHESTAT_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_SELFTEST is not set
# CONFIG_KALLSYMS_ALL is not set
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_HAVE_PERF_

Re: Linux kernel stability fixes for older SPARCs

2024-09-18 Thread Gregor Riepl

So, here's a status report, sorry for the long mail (summary at the end):

I did all tests on my Sun Fire V215 for now, because the machine is a bit 
faster than the Ultra 10 and the ALOM makes remote testing a little bit easier. 
It also has two CPUs, helping to uncover SMP-related issues.

Kernel 4.19 indeed seems to run more stable than later Debian kernels. That at 
least gives me stable system to fix things when they break. Thanks for that 
hint.

My first attempt at bisecting ran into lots of compilation issues with the 
default config of each version and gcc 14. All the 4.x and 5.x kernels fail 
with the following errors (at least, some versions have more):

arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name':
arch/sparc/kernel/mdesc.c:646:22: error: 'strcmp' reading 1 or more bytes from 
a region of size 0 [-Werror=stringop-overread]
  646 | if (!strcmp(names + ep[ret].name_offset, name))
  |  ^
arch/sparc/kernel/mdesc.c:78:33: note: at offset [32, 8589934606] into source 
object 'mdesc' of size 16
   78 | struct mdesc_hdrmdesc;
  | ^
...
In function 'kernel_lds_init',
inlined from 'report_memory' at arch/sparc/mm/init_64.c:3112:2:
arch/sparc/mm/init_64.c:3102:31: error: array subscript -1 is outside array 
bounds of 'char[]' [-Werror=array-bounds=]
 3102 | data_resource.end   = compute_kern_paddr(_edata - 1);
  |   ^~
./include/asm-generic/sections.h: In function 'report_memory':
./include/asm-generic/sections.h:36:32: note: at offset -1 into object '_edata' 
of size [0, 9223372036854775807]
   36 | extern char _data[], _sdata[], _edata[];
  |^~
...

I then tried gcc 8.1, which roughly matches 4.19 by the time of release. Older 
kernels compile well with this version, but 6.10 failed with these errors. I 
couldn't reproduce this error later on, so it may have been a fluke:

`.exit.text' referenced in section `__jump_table' of fs/fuse/inode.o: defined 
in discarded section `.exit.text' of fs/fuse/inode.o
`.exit.text' referenced in section `__jump_table' of fs/fuse/inode.o: defined 
in discarded section `.exit.text' of fs/fuse/inode.o

I decided to ignore the error for now and start bisecting from 4.19 to 6.10 
with gcc 8.1.

Next issue: The default kernel config lacks some essential drivers to make my 
system bootable. For my Fire V215, at least CONFIG_FUSIONMPT and CONFIG_CGROUPS 
are needed, plus a few other things. systemd requires cgroups v2 support theses 
days. I started off with a default config in the first bisect step 
(corresponding with 5.14), added the required options, and then did a make 
oldconfig in each subsequent step, answering all questions with the default.

Building with make bindeb-pkg produces an almost usable kernel package. For 
some reason, grub-ieee1275 requires an unpacked kernel, so the installed 
vmlinuz needed to be gunzipped afterwards.

Now for the actual testing... triggering a panic/oops reliably was difficult. 
The Debian 6.10 kernel usually crashes relatively quickly on disk I/O, and 
enabling swap accelerates the effect.
bonnie++ should therefore make for a good stress test.

I don't have the exact commit IDs of each bisection step, but it was (roughly) 
5.14-rc6, 6.6-rc7, 6.8-rc3, 6.9, 6.10.

There were a few odd non-critical issues, such as this I/O error with 5.14 (but 
nothing in dmesg):

$ /usr/sbin/bonnie++
Writing a byte at a time...done
Writing intelligently...done
Rewriting...Can't write block.: Unknown error 2560
Bonnie: drastic I/O error (re write(2)): Unknown error 2560

6.2 produces this warning at boot:

[ +21.090317] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  +1.422401] rcu:  0-...!: (1 GPs behind) idle=a29c/0/0x1 softirq=18/19 
fqs=44
[  +0.093960]   (detected by 0, t=2246 jiffies, g=-1175, q=989 ncpus=2)
[  +0.083646] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc7+ #18
[  +0.083641] TSTATE: 004411001605 TPC: 0042beac TNPC: 
0042beb0 Y: Not tainted
[  +0.129479] TPC: 
[  +0.053848] g0: 004209d0 g1: 015282c0 g2: 015105c8 
g3: 0001
[  +0.114585] g4: fff000390ba0 g5: fff27e2f g6: fff000398000 
g7: 173aa294
[  +0.114582] o0: fff000390ba0 o1: 0001 o2: 0130ae78 
o3: 015105c8
[  +0.114580] o4: 015280c0 o5: 0130b580 sp: fff00039b3d1 
ret_pc: 0042bea0
[  +0.119164] RPC: 
[  +0.053850] l0: 01407f20 l1: 00022c05 l2:  
l3: 0130b538
[  +0.114585] l4: 0130b400 l5: 0040 l6:  
l7: 01408140
[  +0.114581] i0: 173aa299 i1: fff27f814990 i2: 0001 
i3: 0001
[  +0.114580] i4: fff27f814990 i5: 01524990 i6: fff000

Re: Linux kernel stability fixes for older SPARCs

2024-09-03 Thread John Paul Adrian Glaubitz
Hi Gregor,

On Wed, 2024-09-04 at 01:22 +0200, Gregor Riepl wrote:
> > 
> > It's actually pretty simple these days as the kernel.org mirrors provide 
> > binary
> > distributions of freestanding toolchains for all major supported 
> > architectures
> > of the Linux kernel.
> > 
> > To set up on any x86_64 machine, do the following:
> > 
> > # wget 
> > https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/14.2.0/x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz
> > # tar xf x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz
> > # export PATH=$PATH:$PWD/gcc-14.2.0-nolibc/sparc64-linux/bin/
> > # git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > # cd linux
> > # export ARCH=sparc
> > # export CROSS_COMPILE=sparc64-linux-
> > # make sparc64_defconfig
> > # make -j
> > 
> > The cross-compiled kernel will be available as "vmlinux".
> 
> Very good, thanks!

Thanks for looking into it!

I'm especially interested in finding a proper reproducer which would make
the bisecting process much easier. So far, the crashes seem to be rather
random although they mainly occur with newer kernels.

FWIW, I found a very handy patch yesterday which could help debugging these
crashes once it's been merged into the upstream kernel [1]. What it does is
that it dumps the back of the stack after a stack corruption has occurred
which should in theory help find what part of the kernel is responsible for
the stack corruption.

It looks like this particular crash we have been seeing on the older SPARCs
was always due to stack corruption which could mean that it's related to
a driver or arch-specific code that is used on the older SPARCs but not on
the newer machines.

Adrian

> [1] https://lore.kernel.org/lkml/20231219032254.96685-1-feng.t...@intel.com/

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: Linux kernel stability fixes for older SPARCs

2024-09-03 Thread Gregor Riepl

I may have some time to do test runs next week.
Could you give me some quick starters for setting up a kernel cross build env on
an amd64 machine, or maybe access to a Sun box I could use?


It's actually pretty simple these days as the kernel.org mirrors provide binary
distributions of freestanding toolchains for all major supported architectures
of the Linux kernel.

To set up on any x86_64 machine, do the following:

# wget 
https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/14.2.0/x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz
# tar xf x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz
# export PATH=$PATH:$PWD/gcc-14.2.0-nolibc/sparc64-linux/bin/
# git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
# cd linux
# export ARCH=sparc
# export CROSS_COMPILE=sparc64-linux-
# make sparc64_defconfig
# make -j

The cross-compiled kernel will be available as "vmlinux".


Very good, thanks!



Re: Linux kernel stability fixes for older SPARCs

2024-09-03 Thread John Paul Adrian Glaubitz
Hi Gregor,

On Tue, 2024-09-03 at 19:19 +0200, Gregor Riepl wrote:
> > > If you have issues to bi-sect just let us know for any arch. Given T2’s 
> > > cross-compile
> > > support and I have most hardware in my museum now, I can usually bisect 
> > > issues
> > > within a day or two.
> > 
> > I don't have issues with bisecting, I'm just rather time-constrained at the 
> > moment, so
> > I'm always happy when someone else can step in and help. Would be great to 
> > get this issue
> > fixed upstream.
> 
> My Ultra 10 and Fire V215 are desperately waiting for a more stable kernel.
> I actually wanted to offer help with bisecting, but kept back due to a lack
> of time and also suitable build system (compiling kernels is so 
> time-consuming).

Any help is welcome ;-).

> I may have some time to do test runs next week.
> Could you give me some quick starters for setting up a kernel cross build env 
> on
> an amd64 machine, or maybe access to a Sun box I could use?

It's actually pretty simple these days as the kernel.org mirrors provide binary
distributions of freestanding toolchains for all major supported architectures
of the Linux kernel.

To set up on any x86_64 machine, do the following:

# wget 
https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/14.2.0/x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz
# tar xf x86_64-gcc-14.2.0-nolibc-sparc64-linux.tar.gz
# export PATH=$PATH:$PWD/gcc-14.2.0-nolibc/sparc64-linux/bin/
# git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
# cd linux
# export ARCH=sparc
# export CROSS_COMPILE=sparc64-linux-
# make sparc64_defconfig
# make -j

The cross-compiled kernel will be available as "vmlinux".

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: Linux kernel stability fixes for older SPARCs

2024-09-03 Thread Gregor Riepl




If you have issues to bi-sect just let us know for any arch. Given T2’s 
cross-compile
support and I have most hardware in my museum now, I can usually bisect issues
within a day or two.


I don't have issues with bisecting, I'm just rather time-constrained at the 
moment, so
I'm always happy when someone else can step in and help. Would be great to get 
this issue
fixed upstream.


My Ultra 10 and Fire V215 are desperately waiting for a more stable kernel.
I actually wanted to offer help with bisecting, but kept back due to a lack of 
time and also suitable build system (compiling kernels is so time-consuming).

I may have some time to do test runs next week.
Could you give me some quick starters for setting up a kernel cross build env 
on an amd64 machine, or maybe access to a Sun box I could use?



Re: Linux kernel stability fixes for older SPARCs

2024-09-03 Thread John Paul Adrian Glaubitz
Hello Rene,

On Tue, 2024-09-03 at 11:09 +0200, René Rebe wrote:
> > according to these posts [1][2] by Iggi, you figured out the stability 
> > problem
> 
> No, we are just sometimes lucky it run that long stable. I was only made aware
> recently that sun4u was not 100% and my fasted UltraSPARC until some year ago
> was only a 360MHz Ultra5 until I was donated a Sun Blade 1000 recently. I see
> some MM corruption that I wanted to hunt next.

Hmm, ok. I was under the impression that you made some changes that made the 
kernel
on Iggi's machine stable. Currently, the kernel crashes randomly on older SPARCs
such as reported by Iggi:

> https://x.com/Iggi76123640/status/1827658841581896152

> > with newer kernels on older SPARC machines. There has been a regression on 
> > older
> > SPARCs since around kernel 4.19.x which I haven't gotten around to 
> > bisecting yet.
> 
> Happy to bi-sect. I guess you mean random memory corruption I see or anything
> else?

Not sure what the underlying issue is, but the kernel just crashes completely.

> If you have issues to bi-sect just let us know for any arch. Given T2’s 
> cross-compile
> support and I have most hardware in my museum now, I can usually bisect issues
> within a day or two.

I don't have issues with bisecting, I'm just rather time-constrained at the 
moment, so
I'm always happy when someone else can step in and help. Would be great to get 
this issue
fixed upstream.

> > If you've found and fixed the bug in question, it would be great if you 
> > could share
> > your fix with the community and maybe whip up a kernel patch to fix the bug 
> > upstream.
> 
> 
> Of course - all patches are always nicely sorted in our public and nicely 
> readable
> SVN tree in any case.
> 
>   https://t2linux.com

Is there a web view available? I'm not really a big fan of SVN, to be honest.

> > Newer SPARCs are not affected by this bug, although there are other issues.
> 
> You mean sun4v? I found a cheap T4-1 some month ago, and T2/Linux appears
> to run stable on that. Any list of issues w/ sun4v I should be aware of?

Linux runs mostly stable on sun4v, but there are filesystem corruption issues 
when you
run Linux inside an LDOM on Solaris 11.3 and 11.4 even with the latest SRU of 
Solaris.

These happen rarely, but they do occur and they are quite annoying as they 
mandate rebooting
the LDOM as the root filesystem is mounted read-only and the filesystems as 
errors afterwards.

It seems to be a bug in the LDOM vdisk driver (drivers/block/sunvdc.c).

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913