Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Riccardo, all, On 17.01.22 21:35, Riccardo Mottola wrote: Hi, Riccardo Mottola wrote: John Paul Adrian Glaubitz wrote: Not nice. I started compiling some stuff and the box froze, I connected serial console and could not resume due to Fast Data Access MMU miss" So, this crash occurs with the latest 5.15 kernel on your T2000? exactly latest kernel. I will retest it with stress-ng as soon as I finish this email and copy the dmesg errors. wow, running the test suite once or twice, I am able to have the system power-cycle... wow Frank test latest kernel on yours :) I yesterday found the time to give Linux 5.15.0-3 a try on my T1000 (UltraSPARC T1) and V210 (US IIIi), but the boot issue is still there - at least for my use case: The klibc based tools inside of the initramfs are not able to mount the root FS over NFS (details further below). But it's still good to see that mounting an on-disk root FS seems to work now for your T2000, though the instabilities during runtime are not reassuring. For me the last good Debian kernel - at least for booting, more on that shortly - is 5.9.0-5. Both T1000 and V210 boot fine with it (incl. mounting the root FS via NFS(v3 BTW)). But during operation (tested with `apt upgrade` on a root FS replicated multiple times for testing from the same tarball) the V210 crashes (=> kernel panic), the T1000 does not. For the V210 I also see that for 5.8.0-3. Doing the same with kernel 4.19.0-5 running on the V210, no problems are seen, not even the messages below. The crash when running 5.9.0-5 or 5.8.0-3 is usually "announced" (or at least accompanied) by one or more occurrence(s) of the following messages: ``` [...] [ 360.489852] CPU[0]: Cheetah+ D-cache parity error at TPC[005b28c8] [ 360.580300] TPC [...] ``` ...which should be familiar for UltraSPARC IIIi users with newer kernels (see for example [1] which shows it for 4.16.x). According to [2] this error should be recoverable (otherwise it would be followed by a panic and "Irrecoverable Cheetah+ parity error."), which seems to happen, until it is no longer, but I don't see that second message, so something else must happen. [1]: https://www.spinics.net/lists/sparclinux/msg21019.html [2]: https://github.com/torvalds/linux/blob/master/arch/sparc/kernel/traps_64.c#L1767..L1799 Of course our CPU's caches don't go pop magically. There is something broken in the newer kernels (> 4.19.x) for UltraSPARC IIIi (and most likely all the other related processors, too), apart from the mounting issues for NFS (see [3] for processors affected by this, update to that: US II is not affected, too). [3]: https://lists.debian.org/debian-sparc/2021/12/msg4.html If I find the time and mood I'll try to bisect this US IIIi specific issue in the hope that we will eventually get a fix for it, also still hoping for a fix for [4]. [4]: https://lists.debian.org/debian-sparc/2021/03/msg00045.html Cheers, Frank ## T1000 ## ``` [...] [0.000116] Linux version 5.15.0-3-sparc64-smp (debian-ker...@lists.debian.org) (gcc-11 (Debian 11.2.0-14) 11.2.0, GNU ld (GNU Binutils for Debian) 2.37.90.20220123) #1 SMP Debian 5.15.15-2 (2022-01-30) [...] [ 12.484314] tg3 0001:03:04.0 enP1p3s4f0: Link is up at 1000 Mbps, full duplex [ 12.484520] tg3 0001:03:04.0 enP1p3s4f0: Flow control is on for TX and on for RX [ 12.484689] IPv6: ADDRCONF(NETDEV_CHANGE): enP1p3s4f0: link becomes ready [ 16.765173] Unable to handle kernel paging request at virtual address 6120 [ 16.765384] tsk->{mm,active_mm}->context = 006e [ 16.765493] tsk->{mm,active_mm}->pgd = 800014af [ 16.765650] \|/ \|/ [ 16.765650] "@'/ .. \`@" [ 16.765650] /_| \__/ |_\ [ 16.765650] \__U_/ [ 16.765975] nfsmount(374): Oops [#1] [ 16.766167] CPU: 2 PID: 374 Comm: nfsmount Tainted: GE 5.15.0-3-sparc64-smp #1 Debian 5.15.15-2 [ 16.766345] TSTATE: 11001607 TPC: 006a5fe8 TNPC: 006a5fec Y: Tainted: GE [ 16.766642] TPC: [ 16.766704] g0: 8f2e7451 g1: 0004 g2: 6000 g3: 8001fd786000 [ 16.766802] g4: 800014245e80 g5: 8001fd786000 g6: 8f2e4000 g7: 8f2e7c30 [ 16.766983] o0: fffe o1: 006fd714 o2: 2000 o3: 8f2cbaf8 [ 16.767209] o4: 0008 o5: 0cc0 sp: 8f2e7491 ret_pc: 006fd6d4 [ 16.767292] RPC: [ 16.767456] l0: 800014398408 l1: 8001fedeaa00 l2: 00422db4 l3: 00201e00 [ 16.767591] l4: 029c l5: 8001c1a0 l6: 8f2e4000 l7: 006fd660 [ 16.767771] i0: 0cc0 i1: 00201ff0 i2: 0001 i3: 8f2e7dd0 [ 16.767996] i4: i5: 6120 i6: 8f2e7561 i7: 006fd714 [ 16.768079] I7: [ 16.768189] Call Trace: [ 16.768326] [<
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi, Riccardo Mottola wrote: > John Paul Adrian Glaubitz wrote: >>> Not nice. I started compiling some stuff and the box froze, I connected >>> serial console and could not resume due to Fast Data Access MMU miss" >> So, this crash occurs with the latest 5.15 kernel on your T2000? > exactly latest kernel. > > I will retest it with stress-ng as soon as I finish this email and copy > the dmesg errors. > wow, running the test suite once or twice, I am able to have the system power-cycle... wow Frank test latest kernel on yours :) Riccardo
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi, John Paul Adrian Glaubitz wrote: >> Not nice. I started compiling some stuff and the box froze, I connected >> serial console and could not resume due to Fast Data Access MMU miss" > So, this crash occurs with the latest 5.15 kernel on your T2000? exactly latest kernel. I will retest it with stress-ng as soon as I finish this email and copy the dmesg errors. > In my experience, the most stable kernels on the older SPARCs are still the > 4.19 kernels. Thus, we should start bisecting to find out what commit actually > made the kernel unreliable on these older SPARCs. We must find a good way to test though. I stress-tested the 5.9 kernel further. The system sometimes seemed unresponsive, but eventually recovered (some errors to know more pasted below). Thus I would consider it "stable". I did run several small burst of tests and then a session given of 30m minutes but that due to hiccups lasted more like 2 hours, but afterwards, the machine was still up. sudo stress-ng --all 10 --timeout 30m 10 times means more than physical CPUs, but less than logical cores (32). The system has 16GB of ram, I see some OOMs in dmesg, I wonder if this is due to certain stress tests specifically going against any limit. [16195.300448] Unable to handle kernel NULL pointer dereference in mna handler [16195.741725] 40014fef [16195.741793] at virtual address 00e7 [16195.767936] b416801c [16195.767945] c2592468 [16195.767990] current->{active_,}mm->context = 0bb8 [16195.768848] b818 [16195.768857] 920126c8 [16195.769673] current->{active_,}mm->pgd = 800089cac000 [16195.770413] \|/ \|/ "@'/ .. \`@" /_| \__/ |_\ \__U_/ [16196.30] systemd-journald[219777]: /dev/kmsg buffer overrun, some messages lost. [16196.304235] stress-ng(234874): Oops [#864] [16196.304262] CPU: 8 PID: 234874 Comm: stress-ng Tainted: G D E X 5.9.0-5-sparc64-smp #1 Debian 5.9.15-1 [16196.304281] TSTATE: 008811001605 TPC: 0042d8e0 TNPC: 0042d8e4 Y: Tainted: G D E X [16196.304311] TPC: [16196.304327] g0: 0040770c g1: 032f g2: g3: 80010007c000 [16196.304341] g4: 8003f13f9240 g5: 8003fdaa4000 g6: 800087df8000 g7: 4000 [16196.304355] o0: 01ef o1: 032f o2: 800087df8000 o3: 0007 [16196.304368] o4: 0007 o5: fff2 sp: 800087dfb451 ret_pc: 0042d8c4 [16196.304390] RPC: [16196.304404] l0: 030800010304 l1: 0044f0001201 l2: 0040770c l3: [16196.304418] l4: l5: 80010007c000 l6: 800087df8000 l7: 11001002 [16196.304432] i0: 0077 i1: 020f i2: fff2 i3: 800187dfff70 [16196.304445] i4: fff2 i5: 0007 i6: 800087dfb4d1 i7: 0042d6fc [16196.304472] I7: [16205.284863] aes_sparc64: sparc64 aes opcodes not available. [16205.753417] Call Trace: [16205.753453] [<0042d6fc>] do_signal+0x25c/0x560 [16205.753478] [<0042e218>] do_notify_resume+0x58/0xa0 [16205.753500] [<00404b48>] __handle_signal+0xc/0x30 [16205.753525] Caller[0042d6fc]: do_signal+0x25c/0x560 [16205.753546] Caller[0042e218]: do_notify_resume+0x58/0xa0 [16205.753562] Caller[00404b48]: __handle_signal+0xc/0x30 [16205.753575] Caller[0107294c]: 0x107294c [16205.753580] Instruction DUMP: [16205.753587] c029a00d [16205.753595] b4168008 [16205.753602] 900761e8 [16205.753610] [16205.753616] 40014fef [16205.753623] b416801c [16205.753629] c2592468 [16205.753636] b818 [16205.753644] 920126c8 then also these messages. I think they explain the "slowness" and apparent freeze of the system - I was about to power-cycle but waited and it recovered: [16253.233924] ata1.00: qc timeout (cmd 0xa0) [16335.213786] PM: hibernation: Basic memory bitmaps created [16830.619976] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [16830.620193] (detected by 18, t=5252 jiffies, g=711181, q=6) [16830.620215] rcu: All QSes seen, last rcu_sched kthread activity 1191 (4299098242-4299097051), jiffies_till_next_fqs=1, root ->qsmask 0x0 [16830.620491] rcu: rcu_sched kthread starved for 1191 jiffies! g711181 f0x2 RCU_GP_CLEANUP(7) ->state=0x0 ->cpu=30 [16830.620749] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [16830.620844] rcu: RCU grace-period kthread stack dump: [16830.621069] task:rcu_sched state:R running task stack: 0 pid: 10 ppid: 2 flags:0x0500 [16830.621095] Call Trace: [16830.621128] [<00bda560>] _cond_resched+0x40/0x60 [16830.621153] [<004ee1d0>] rcu_gp_kthread+0x9b0/0xe40 [16830.621175] [<00491c48>] kthread+0x108/0x120 [16830.621205] [<004060c8>] ret_from_fork+0x1c/0x2c [16830.621224] [<0
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi! On 1/17/22 14:41, Riccardo Mottola wrote: >>> The good news is that latest kernel installed seems to boot and takes >>> all CPUs online. How stable it is I don't know, it needs to be tested. >> >> Please run some stress tests such as stress-ng and report back. > > Not nice. I started compiling some stuff and the box froze, I connected > serial console and could not resume due to Fast Data Access MMU miss" So, this crash occurs with the latest 5.15 kernel on your T2000? In my experience, the most stable kernels on the older SPARCs are still the 4.19 kernels. Thus, we should start bisecting to find out what commit actually made the kernel unreliable on these older SPARCs. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
I reply to myself. I did run the old 5.9 kernel from debian - which has proven quite stable. I did run the same tests... and I found once error in the console indeed. [ 380.918996] Unable to handle kernel NULL pointer dereference [ 380.919198] tsk->{mm,active_mm}->context = 057d [ 380.919326] tsk->{mm,active_mm}->pgd = 8003f1fd4000 [ 380.919496] \|/ \|/ "@'/ .. \`@" /_| \__/ |_\ \__U_/ [ 380.919510] stress-ng(1529): Oops [#287] [ 380.919536] CPU: 24 PID: 1529 Comm: stress-ng Tainted: G D E X 5.9.0-5-sparc64-smp #1 Debian 5.9.15-1 [ 380.919557] TSTATE: 008811001602 TPC: 0042d8e0 TNPC: 0042d8e4 Y: Tainted: G D E X [ 380.919587] TPC: [ 380.919604] g0: 800100ef7194 g1: 0328 g2: g3: 80010002c000 [ 380.919620] g4: 8003cf6f6b40 g5: 8003fdea4000 g6: 8003cf9cc000 g7: 4000 [ 380.919634] o0: 01e8 o1: 0328 o2: 8003cf9cc000 o3: 0007 [ 380.919650] o4: 0007 o5: fff2 sp: 8003cf9cf451 ret_pc: 0042d8c4 [ 380.919673] RPC: [ 380.919690] l0: 020800010404 l1: 0044f226 l2: 800100ef7194 l3: [ 380.919705] l4: l5: 0005 l6: 8003cf9cc000 l7: 00698c20 [ 380.919719] i0: 0070 i1: 0208 i2: fff2 i3: 8003cf9eff70 [ 380.919732] i4: fff2 i5: i6: 8003cf9cf4d1 i7: 0042d6fc [ 380.919752] I7: [ 380.919760] Call Trace: [ 380.919783] [<0042d6fc>] do_signal+0x25c/0x560 [ 380.919806] [<0042e218>] do_notify_resume+0x58/0xa0 [ 380.919828] [<00404b48>] __handle_signal+0xc/0x30 [ 380.919852] Caller[0042d6fc]: do_signal+0x25c/0x560 [ 380.919874] Caller[0042e218]: do_notify_resume+0x58/0xa0 [ 380.919893] Caller[00404b48]: __handle_signal+0xc/0x30 [ 380.919910] Caller[800100ef716c]: 0x800100ef716c [ 380.919916] Instruction DUMP: [ 380.919923] c029a00d [ 380.919930] b4168008 [ 380.919938] 900761e8 [ 380.919945] [ 380.919952] 40014fef [ 380.919959] b416801c [ 380.919965] c2592468 [ 380.919972] b818 [ 380.919979] 920126c8 [ 380.972358] systemd-journald[66048]: File /var/log/journal/bdb2a41ce825489ba567bea53add247e/system.journal corrupted or uncleanly shut down, renaming and replacing. [ 407.494981] systemd[1]: Started Journal Service. as well as error in the stressors: stress-ng: info: [12989] stress-ng-fanotify: 148 open, 41 close write, 128 close nowrite, 96 access, 27 modify stress-ng: info: [12908] stress-ng-fanotify: 159 open, 66 close write, 108 close nowrite, 88 access, 43 modify stress-ng: info: [12911] stress-ng-fanotify: 147 open, 43 close write, 122 close nowrite, 99 access, 20 modify stress-ng: info: [13079] stress-ng-fanotify: 159 open, 60 close write, 112 close nowrite, 97 access, 32 modify stress-ng: info: [12820] stress-ng-fanotify: 155 open, 46 close write, 123 close nowrite, 87 access, 27 modify stress-ng: info: [913] unsuccessful run completed in 282.58s (4 mins, 42.58 secs) stress-ng: fail: [913] chattr instance 2 corrupted bogo-ops counter, 48 vs 0 stress-ng: fail: [913] chattr instance 2 hash error in bogo-ops counter and run flag, 1918819509 vs 0 stress-ng: fail: [913] chattr instance 6 corrupted bogo-ops counter, 50 vs 0 stress-ng: fail: [913] chattr instance 6 hash error in bogo-ops counter and run flag, 506138270 vs 0 stress-ng: fail: [913] dnotify instance 4 corrupted bogo-ops counter, 224 vs 0 info: 5 failures reached, aborting stress process stress-ng: fail: [913] dnotify instance 4 hash error in bogo-ops counter and run flag, 1503783545 vs 0 stress-ng: fail: [913] dnotify instance 6 corrupted bogo-ops counter, 222 vs 0 stress-ng: fail: [913] dnotify instance 6 hash error in bogo-ops counter and run flag, 4199465241 vs 0 stress-ng: fail: [913] metrics-check: stressor metrics corrupted, data is compromised However the machine did not crash. I did run exactly the same stress command again... and the failures are reproducible, so I suppose maybe the tests are not 64bit big endian safe or such.
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Adrian, John Paul Adrian Glaubitz wrote: > Did you forget to create an initrd? After installing the kernel, run: > > $ update-initramfs -k KERNEL_VERSION -c I did not run it this way, will do. I had it however, of a very big size: 316M Jan 14 17:15 initrd.img-5.9.0-rc1+ which filled up my /boot I removed it, regenerated with your command, but I get dropped into initramfs with no modules found. Hmm.. > >> The good news is that latest kernel installed seems to boot and takes >> all CPUs online. How stable it is I don't know, it needs to be tested. > Please run some stress tests such as stress-ng and report back. Not nice. I started compiling some stuff and the box froze, I connected serial console and could not resume due to Fast Data Access MMU miss" I will now stress things again, but keeping serial console attached with another computer and see. UP to last week with the old 5.9 kernel I had no issues compiling even large things as gecko based ArcticFox or the Linux kernel itself. So if the Fire didn't fail over the weekend it smells as kernel instability. What should I use in stress-ng? I just tried "--all 8 --timeout 120s" and the machine locked up after a little and in the serial console I see: [ 8563.833509] current->{active_,}mm->context = 0fcb [ 8563.833523] current->{active_,}mm->pgd = 8000d35c8000 [ 8563.846347] Unable to handle kernel NULL pointer dereference in mna handler [ 8563.846365] at virtual address 00e7 [ 8563.846380] current->{active_,}mm->context = 0fcc [ 8563.846395] current->{active_,}mm->pgd = 8000d2d3c000 [ 8563.856171] Unable to handle kernel NULL pointer dereference [ 8563.863274] tsk->{mm,active_mm}->context = 0fd2 [ 8563.863294] tsk->{mm,active_mm}->pgd = 8000d3fc [ 8563.928911] Unable to handle kernel NULL pointer dereference in mna handler [ 8563.928935] at virtual address 00e7 [ 8563.928955] current->{active_,}mm->context = 0fde [ 8563.928972] current->{active_,}mm->pgd = 8000d32e8000 [ 8563.952221] Unable to handle kernel NULL pointer dereference in mna handler [ 8563.952244] at virtual address 00e7 [ 8563.952261] current->{active_,}mm->context = 0fe3 [ 8563.952278] current->{active_,}mm->pgd = 8000d2f54000 [ 8563.954004] Unable to handle kernel NULL pointer dereference in mna handler [ 8563.954022] at virtual address 00e7 [ 8563.954037] current->{active_,}mm->context = 0fe5 [ 8563.954053] current->{active_,}mm->pgd = 8000d2d5c000 [ 8563.972643] Unable to handle kernel NULL pointer dereference [ 8563.972660] tsk->{mm,active_mm}->context = 0fea [ 8563.972677] tsk->{mm,active_mm}->pgd = 8000d31300 These are kernel messages, not OF, so it looks like a kernel problem Riccardo
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi! On 1/14/22 17:58, Riccardo Mottola wrote: > as Frank asked, I compiled myself a kernel using his latest commit > identified as good: > 67e306c6906137020267eb9bbdbc127034da3627 > > and this kernel works, but then fails to load initramfs. Did you forget to create an initrd? After installing the kernel, run: $ update-initramfs -k KERNEL_VERSION -c > The good news is that latest kernel installed seems to boot and takes > all CPUs online. How stable it is I don't know, it needs to be tested. Please run some stress tests such as stress-ng and report back. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi all, as Frank asked, I compiled myself a kernel using his latest commit identified as good: 67e306c6906137020267eb9bbdbc127034da3627 and this kernel works, but then fails to load initramfs. I don't know if the crash was before or after, so if it is a "proof" that it is good or it is not conclusive? The good news is that latest kernel installed seems to boot and takes all CPUs online. How stable it is I don't know, it needs to be tested. Riccardo 5.15.0-2-sparc64-smp #1 SMP Debian 5.15.5-2 (2021-12-18) sparc64 GNU/Linux multix@narya:~$ cat /proc/cpuinfo cpu : UltraSparc T1 (Niagara) fpu : UltraSparc T1 integrated FPU pmu : niagara prom: OBP 4.30.4.d 2011/07/06 14:29 type: sun4v ncpus probed: 32 ncpus active: 32 D$ parity tl1 : 0 I$ parity tl1 : 0 cpucaps : flush,stbar,swap,muldiv,v9,blkinit,mul32,div32,v8plus,ASIBlkInit Cpu0ClkTck : 3b9aca00 Cpu1ClkTck : 3b9aca00 Cpu2ClkTck : 3b9aca00 Cpu3ClkTck : 3b9aca00 Cpu4ClkTck : 3b9aca00 Cpu5ClkTck : 3b9aca00 Cpu6ClkTck : 3b9aca00 Cpu7ClkTck : 3b9aca00 Cpu8ClkTck : 3b9aca00 Cpu9ClkTck : 3b9aca00 Cpu10ClkTck : 3b9aca00 Cpu11ClkTck : 3b9aca00 Cpu12ClkTck : 3b9aca00 Cpu13ClkTck : 3b9aca00 Cpu14ClkTck : 3b9aca00 Cpu15ClkTck : 3b9aca00 Cpu16ClkTck : 3b9aca00 Cpu17ClkTck : 3b9aca00 Cpu18ClkTck : 3b9aca00 Cpu19ClkTck : 3b9aca00 Cpu20ClkTck : 3b9aca00 Cpu21ClkTck : 3b9aca00 Cpu22ClkTck : 3b9aca00 Cpu23ClkTck : 3b9aca00 Cpu24ClkTck : 3b9aca00 Cpu25ClkTck : 3b9aca00 Cpu26ClkTck : 3b9aca00 Cpu27ClkTck : 3b9aca00 Cpu28ClkTck : 3b9aca00 Cpu29ClkTck : 3b9aca00 Cpu30ClkTck : 3b9aca00 Cpu31ClkTck : 3b9aca00 MMU Type: Hypervisor (sun4v) MMU PGSZs : 8K,64K,4MB,256MB State: CPU0: online CPU1: online CPU2: online CPU3: online CPU4: online CPU5: online CPU6: online CPU7: online CPU8: online CPU9: online CPU10: online CPU11: online CPU12: online CPU13: online CPU14: online CPU15: online CPU16: online CPU17: online CPU18: online CPU19: online CPU20: online CPU21: online CPU22: online CPU23: online CPU24: online CPU25: online CPU26: online CPU27: online CPU28: online CPU29: online CPU30: online CPU31: online
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi guys, On 11.12.21 18:59, John Paul Adrian Glaubitz wrote: On 12/11/21 18:40, Riccardo Mottola wrote: I remember you bisected about the breaking commits. Has there been any progress? A better place where to report this issue other than this mailing list? The proper place is to send an email to the author of the breaking commit and CC the sparclinux Linux kernel mailing list. Most kernel developers don't read the debian-sparc mailing list. We actually did discuss this in late March 2021 starting here: https://lists.debian.org/debian-sparc/2021/03/msg00045.html ...with Christoph Hellwig and CCed to sparcli...@vger.kernel.org and this list, but no solution back then. Back in October I did some testing on various UltraSPARC machines to sort out which processor( generation)s are affected but didn't found the time to make something out of it apart from notes and a conclusion. I couldn't get my Ultra 80 to netboot, so no result for UltraSPARC II. My Ultra 10 with US IIi worked though with kernel 5.14.0-3. My 280r with US III worked with kernel 5.9.0-5 and with 5.14.0-3 gives: ``` Begin: Retrying nfs mount ... mount: Invalid argument done. ``` ...when trying to mount the root FS. My v480 crashes with 5.14.0-3 but it crashed with every kernel version I tried since I own it, so perfectly normal. I don't know what the issue is, because hardware-wise, the - working with 5.9.0-5 - 280r seems to be very similar though with only 2 processors instead of 4 for the V480. My T5220 with T2 crashed once with 5.14.0-3 but worked with 5.14.0-4. It later also worked with 5.14.0-3. And the crash happened way before a mount of the root FS was tried, so possibly unrelated. My T1000 with T1 panics with 5.14.0-3 because it can't mount the root FS. Using `break=premount` in the kernel command line and issueing the mount command manually gives; ``` (initramfs) nfsmount -o nolock "172.16.0.2:/srv/nfs/t1000/root" "$rootmnt" [ 641.272949] Unable to handle kernel paging request at virtual address 6120 [ 641.273138] tsk->{mm,active_mm}->context = 038f [ 641.273248] tsk->{mm,active_mm}->pgd = 800016c1c000 [ 641.273310] \|/ \|/ [ 641.273310] "@'/ .. \`@" [ 641.273310] /_| \__/ |_\ [ 641.273310] \__U_/ [ 641.273444] nfsmount(750): Oops [#182] [ 641.273497] CPU: 12 PID: 750 Comm: nfsmount Tainted: G D E 5.14.0-3-sparc64-smp #1 Debian 5.14.12-1 [ 641.273603] TSTATE: 11001607 TPC: 0069ce48 TNPC: 0069ce4c Y: Tainted: G D E [ 641.273705] TPC: [ 641.273775] g0: 0006 g1: 0004 g2: 6000 g3: 8001fda18000 [ 641.273858] g4: 800013b13340 g5: 8001fda18000 g6: 800016bd g7: 800016bd3c30 [ 641.273942] o0: fffe o1: 006f4c94 o2: 2000 o3: 8000146d3aa8 [ 641.274024] o4: 0008 o5: 0cc0 sp: 800016bd34a1 ret_pc: 006f4c54 [ 641.274107] RPC: [ 641.274165] l0: 00f1a000 l1: 0111f000 l2: 00422db4 l3: 00201db0 [ 641.274292] l4: 029c l5: 8001c1a0 l6: 800016bd l7: 006f4be0 [ 641.274377] i0: 0cc0 i1: 00201fe0 i2: 0001 i3: 800016bd3dd0 [ 641.274460] i4: i5: 6120 i6: 800016bd3561 i7: 006f4c94 [ 641.274542] I7: [ 641.274599] Call Trace: [ 641.274640] [<006f4c94>] sys_mount+0xb4/0x1a0 [ 641.274712] [<006f4c54>] sys_mount+0x74/0x1a0 [ 641.274783] [<00406274>] linux_sparc_syscall+0x34/0x44 [ 641.274866] Caller[006f4c94]: sys_mount+0xb4/0x1a0 [ 641.274939] Caller[006f4c54]: sys_mount+0x74/0x1a0 [ 641.275011] Caller[00406274]: linux_sparc_syscall+0x34/0x44 [ 641.275090] Caller[00100aa8]: 0x100aa8 [ 641.275143] Instruction DUMP: [ 641.275150] ba074001 [ 641.275192] bb2f7003 [ 641.275233] ba074002 [ 641.275274] [ 641.275314] 84086001 [ 641.275355] 82007fff [ 641.275395] 8378841d [ 641.275436] ba11 [ 641.275525] c2586008 [ 641.275614] Killed ``` Doing the same on a V210 with US IIIi gives: ``` (initramfs) nfsmount -o nolock "172.16.0.2:/srv/nfs/v210/root" "$rootmnt" mount: Invalid argument (initramfs) echo $? 1 ``` ...so similar to 280r with US III. From all that, I assume UltraSPARC IIi driven machines (and most likely also older ones with US II) are not affected by this, as are UltraSPARC T2 driven ones and possibly machines with newer processors (I didn't have time to try one of my T5240s with T2+). UltraSPARC III, IIIi and T1 driven machines are affected and to me it now looks more like some of the klibc programs from the initramfs are at fault. I also tested my V210 with an on-disk root FS and although the mounting seemed to work for that method with 5.14.0-3 I faced multiple problems later on that crashed the machine. M
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
On 12/11/21 18:40, Riccardo Mottola wrote: > I remember you bisected about the breaking commits. Has there been any > progress? > A better place where to report this issue other than this mailing list? The proper place is to send an email to the author of the breaking commit and CC the sparclinux Linux kernel mailing list. Most kernel developers don't read the debian-sparc mailing list. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Frank, several months have passed… new kernels came into debian and they still do not work for me, so let me dig up this matter again. I can continue using 5.9 for now, but for how long? On 2021-03-11 23:43:10 +0100 Frank Scheiner wrote: > From [1] I assume T2 CPUs are not affected, but yeah, the issue could > be that selective that it only affects the very first generation. > > [1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html Did more people report this issue perhaps on other systems? I remember you bisected about the breaking commits. Has there been any progress? A better place where to report this issue other than this mailing list? Thank you, Riccardo -- Sent with GNUMail running on MacOS 10.7
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Anatoly! Anatoly Pugachev wrote: > current grub2 version does not support compressed image kernels, do > the following: > > gzip -dc /boot/vmlinuz-5.12.0-rc5+ > /boot/vmlinux-5.12.0-rc5+ > rm /boot/vmlinuz-5.12.0-rc5+ > update-grub > > and reboot oh yes, that was it. Finally, I could boot my own built kernel. Which, of course, crashes as expected. At least I can confirm Frank's findings. Riccardo
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
On Thu, Apr 1, 2021 at 2:40 PM Riccardo Mottola wrote: > multix@narya:~/code/linux-stable$ time sudo make install > sh ./arch/sparc/boot/install.sh 5.12.0-rc5+ arch/sparc/boot/zImage \ > System.map "/boot" > run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 5.12.0-rc5+ > /boot/vmlinuz-5.12.0-rc5+ > run-parts: executing /etc/kernel/postinst.d/initramfs-tools 5.12.0-rc5+ > /boot/vmlinuz-5.12.0-rc5+ > update-initramfs: Generating /boot/initrd.img-5.12.0-rc5+ > run-parts: executing /etc/kernel/postinst.d/zz-update-grub 5.12.0-rc5+ > /boot/vmlinuz-5.12.0-rc5+ > Generating grub configuration file ... > Found linux image: /boot/vmlinuz-5.12.0-rc5+ > Found initrd image: /boot/initrd.img-5.12.0-rc5+ > Found linux image: /boot/vmlinuz-5.12.0-rc5+.old > Found initrd image: /boot/initrd.img-5.12.0-rc5+ > Found linux image: /boot/vmlinux-5.10.0-4-sparc64-smp > Found initrd image: /boot/initrd.img-5.10.0-4-sparc64-smp > Found linux image: /boot/vmlinux-5.10.0-trunk-sparc64-smp > Found initrd image: /boot/initrd.img-5.10.0-trunk-sparc64-smp > Found linux image: /boot/vmlinux-5.9.0-5-sparc64-smp > Found initrd image: /boot/initrd.img-5.9.0-5-sparc64-smp > done > > At boot: > > Loading Linux 5.12.0-rc5+ ... > error: premature end of file /vmlinuz-5.12.0-rc5+. > Loading initial ramdisk ... > error: you need to load the kernel first. current grub2 version does not support compressed image kernels, do the following: gzip -dc /boot/vmlinuz-5.12.0-rc5+ > /boot/vmlinux-5.12.0-rc5+ rm /boot/vmlinuz-5.12.0-rc5+ update-grub and reboot
Re: Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Riccardo, On Thu, Apr 01, 2021 at 01:43:29PM +0200, Riccardo Mottola wrote: > > Yep, in your kernel config set: > > CONFIG_SYSTEM_TRUSTED_KEYS="" > > thanks, that was it! Now the kernel build great! > Do I need to do somethings special? > > make install > make modules_install sorry, don't know. I'm always doing: make -j bindeb-pkg dpkg -i ../linux-image*.dpkg But that is even slower on weak hardware (e.g. BananaUltra) and the above SHOULD work. Advantage comes when deleting kernels. > Loading Linux 5.12.0-rc5+ ... > error: premature end of file /vmlinuz-5.12.0-rc5+. Somehow your vmlinuz is to short or the loader is not able to put it in memory. Good luck and greetings Hermann -- Administration/Zentrale Dienste, Interdiziplinaeres Zentrum fuer wissenschaftliches Rechnen der Universitaet Heidelberg IWR; INF 205; 69120 Heidelberg; Tel: (06221)54-14405 Fax: -14427 Email: hermann.la...@iwr.uni-heidelberg.de
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hhi Hermann, hermann.la...@uni-heidelberg.de wrote: > Yep, in your kernel config set: > CONFIG_SYSTEM_TRUSTED_KEYS="" thanks, that was it! Now the kernel build Do I need to do somethings special? make install make modules_install Which shows: multix@narya:~/code/linux-stable$ time sudo make install sh ./arch/sparc/boot/install.sh 5.12.0-rc5+ arch/sparc/boot/zImage \ System.map "/boot" run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+ run-parts: executing /etc/kernel/postinst.d/initramfs-tools 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+ update-initramfs: Generating /boot/initrd.img-5.12.0-rc5+ run-parts: executing /etc/kernel/postinst.d/zz-update-grub 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+ Generating grub configuration file ... Found linux image: /boot/vmlinuz-5.12.0-rc5+ Found initrd image: /boot/initrd.img-5.12.0-rc5+ Found linux image: /boot/vmlinuz-5.12.0-rc5+.old Found initrd image: /boot/initrd.img-5.12.0-rc5+ Found linux image: /boot/vmlinux-5.10.0-4-sparc64-smp Found initrd image: /boot/initrd.img-5.10.0-4-sparc64-smp Found linux image: /boot/vmlinux-5.10.0-trunk-sparc64-smp Found initrd image: /boot/initrd.img-5.10.0-trunk-sparc64-smp Found linux image: /boot/vmlinux-5.9.0-5-sparc64-smp Found initrd image: /boot/initrd.img-5.9.0-5-sparc64-smp done real 33m3.954s user 28m18.936s sys 4m36.889s At boot: Loading Linux 5.12.0-rc5+ ... error: premature end of file /vmlinuz-5.12.0-rc5+. Loading initial ramdisk ... error: you need to load the kernel first. it is interesting how certain operations are very slow on this system, since a "single" core is slow.. so installing takes longer as a ... celeron laptop! It took... 33 minutes ?! Riccardo
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
On Thu, Apr 1, 2021 at 12:59 PM Riccardo Mottola wrote: > > This seems to only happen when the machines do a long run with high > > workload and seemingly not when i just power them off again for night > > with no high workload. > > I have a limited experience and can only share that the kernel I > currently am running on this Fire T2000 > > Linux narya 5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17) > sparc64 GNU/Linux > > Is quite stable for me. > However, i did not try to run for several days compiling, so I don't > know if it is stable for a long time. Riccardo, if you would like to check sparc64 kernel stability, you might want to run stress-ng tests, like: $ ./stress-ng --sequential 4 -v --timeout 3m --metrics-brief it still successfully kills the latest (git) kernel (5.12.0-rc5) on my sparc64 test LDOM running on a T5-2 hardware server. But please take stress-ng from git repo [1] , since it has a few recent fixes for sparc, not yet packaged into debian. Thanks. 1. https://github.com/ColinIanKing/stress-ng/
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Connor, Connor McLaughlan wrote: > can anyone possible give a list of known stable kernel versions for > SPARC machines? (is there a difference necessary between > architectures/old vs. newer machines? sun4u/sun4v)? > > Also this instability manifests such that the machine is crashing > during high workload? (halting? rebooting?) > > I ask, because on three different SPARC machines i have been > experiencing a weird effect when using debian: > I would start a high compiling load for several days (7-10) where the > machines are running fine without any apparent error visible in dmesg > or somewhere else. > Then when i power off tand on again, the filesystem would be corrupt > and sometimes impossible to repair without reinstallation. > > This seems to only happen when the machines do a long run with high > workload and seemingly not when i just power them off again for night > with no high workload. I have a limited experience and can only share that the kernel I currently am running on this Fire T2000 Linux narya 5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17) sparc64 GNU/Linux Is quite stable for me: I did compile with high loads (e.g. compiling linux kernel on all 32 cores) and sync the git repository of linux kernel and ArcticFox browser. GIT sync of such repositories in my experience is a good stress, I had disk drivers crash, network freeze on different architectures and systems. But not in this case. However, i did not try to run for several days compiling, so I don't know if it is stable for a long time. Riccardo
Re: Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Riccardo, On Sat, Mar 27, 2021 at 01:16:11PM -0600, Stan Johnson wrote: > > I took the config out of /boot/config of a good kernel, updated it with > > "make oldconfig" > > > > During compilation I see: > > > > CC init/init_task.o > > make[1]: *** No rule to make target > > 'debian/certs/debian-uefi-certs.pem', needed by > > 'certs/x509_certificate_list'. Stop. > > make[1]: *** Waiting for unfinished jobs > > ... > > I think you need to remove all references to debian certs to compile a > custom kernel. Yep, in your kernel config set: CONFIG_SYSTEM_TRUSTED_KEYS="" Greetings Hermann -- Administration/Zentrale Dienste, Interdiziplinaeres Zentrum fuer wissenschaftliches Rechnen der Universitaet Heidelberg IWR; INF 205; 69120 Heidelberg; Tel: (06221)54-14405 Fax: -14427 Email: hermann.la...@iwr.uni-heidelberg.de
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Riccardo, On 3/26/21 6:21 PM, Riccardo Mottola wrote: > Hi, > ... > > I cloned linux stable. It took 60 minutes... > > I took the config out of /boot/config of a good kernel, updated it with > "make oldconfig" > > During compilation I see: > > CC init/init_task.o > make[1]: *** No rule to make target > 'debian/certs/debian-uefi-certs.pem', needed by > 'certs/x509_certificate_list'. Stop. > make[1]: *** Waiting for unfinished jobs > ... I think you need to remove all references to debian certs to compile a custom kernel. -Stan Johnson
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi, I was unable to "hack" for some days due to day-job. I have seen Frank and others have done a great deal. Still, I wanted to try my own compilation, as a first attempt and also to build and be able to check eventual patches myself. On 3/11/21 11:56 PM, Gregor Riepl wrote: You should clone the upstream Git repo, otherwise bisecting will be much more difficult. I think these instructions are still valid: https://wiki.debian.org/DebianKernel/GitBisect You can also skip the Debian-specific stuff and simply do make -j8 && make modules_install && make install It's better to use at least a compatible kernel config, though. I cloned linux stable. It took 60 minutes... I took the config out of /boot/config of a good kernel, updated it with "make oldconfig" During compilation I see: CC init/init_task.o make[1]: *** No rule to make target 'debian/certs/debian-uefi-certs.pem', needed by 'certs/x509_certificate_list'. Stop. make[1]: *** Waiting for unfinished jobs It took 134 minutes to build with -j32. So well, compiling is not the strongest point of this CPU, but not so bad either. real 134m55.288s user 4111m46.186s sys 145m12.479s I actually wonder if the kernel is not "overconfigured" ? building things like nouveau make sense on SPARC? I wonder.. maybe sticking a PCI-e card would work in a Netra or Fire? but I can't install: multix@narya:~/code/linux-stable$ sudo make modules_install sed: can't read modules.order: No such file or directory I wonder if it is related with the error above? Thanks, Riccardo
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi, On 23.03.21 17:30, Connor McLaughlan wrote: Hi, can anyone possible give a list of known stable kernel versions for SPARC machines? (is there a difference necessary between architectures/old vs. newer machines? sun4u/sun4v)? Also this instability manifests such that the machine is crashing during high workload? (halting? rebooting?) I ask, because on three different SPARC machines i have been experiencing a weird effect when using debian: I would start a high compiling load for several days (7-10) where the machines are running fine without any apparent error visible in dmesg or somewhere else. Then when i power off tand on again, the filesystem would be corrupt and sometimes impossible to repair without reinstallation. Can you be sure that your used disks are in full working order? Maybe you have bad sectors on them and their EOL is nearing, manifesting in these FS errors? I assume the more accesses you have on your disks the more a problem is prone to show up. And the accesses happening during compile runs could be already too much for your disks. If you have enough RAM, you could try to run your compile jobs in a RAM disk and check if this makes a difference. This seems to only happen when the machines do a long run with high workload and seemingly not when i just power them off again for night with no high workload. I believe the error this thread is about is unrelated to what you experience on your machines. This because the problem happens early on when the root FS is to be mounted. Cheers, Frank
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi, can anyone possible give a list of known stable kernel versions for SPARC machines? (is there a difference necessary between architectures/old vs. newer machines? sun4u/sun4v)? Also this instability manifests such that the machine is crashing during high workload? (halting? rebooting?) I ask, because on three different SPARC machines i have been experiencing a weird effect when using debian: I would start a high compiling load for several days (7-10) where the machines are running fine without any apparent error visible in dmesg or somewhere else. Then when i power off tand on again, the filesystem would be corrupt and sometimes impossible to repair without reinstallation. This seems to only happen when the machines do a long run with high workload and seemingly not when i just power them off again for night with no high workload. Regards, Connor On Tue, Mar 23, 2021 at 4:46 PM Frank Scheiner wrote: > Hi Jan, > > On 23.03.21 16:36, Jan Engelhardt wrote: > > On Tuesday 2021-03-23 16:29, Frank Scheiner wrote: > >> ``` > >> [...] > >> Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't > >> pass remote address > >> mount: Invalid argument > > > > I seem to recall that NFS is one of those filesystems that (a) makes use > of > > filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount > helper, > > /usr/sbin/mount.nfs. > > > > Now, with the change in Linux kernel > 028abd9222df0cf5855dab5014a5ebaf06f90565, > > I am postulating the hypothesis that that the fs/nfs/ code for parsing > this > > binary blob is no longer aware that it is being invoked in a compat32 > context. > > That sounds interesting. Can you perhaps post your hypothesis also in > this thread: > > https://marc.info/?t=16164490063&r=1&w=2 > > Maybe this gives the kernel developers some ideas. > > > Since T2 systems were said to be fine and T1 not, perhaps the T1 systems > in > > question were all on NFS mounts and the T2 one wasn't? > > No, the T5220 was also running diskless, actually using the same root FS > as the T1000 (in form of a btrfs subvolume snapshot) plus identical > kernel and initramfs: > > ``` > root@nfs:/srv/tftp# ls -la $( host2hex t5220 )* > lrwxrwxrwx 1 root root 35 Feb 28 2018 AC10026E -> > boot/grub/sparc64-ieee1275/core.img > lrwxrwxrwx 1 root root 38 Mar 15 18:16 AC10026E.initrd.img -> > initrd.img.5.10.0-4.debian.sid.sparc64 > lrwxrwxrwx 1 root root 36 Mar 15 18:16 AC10026E.vmlinuz -> > linux.mp.5.10.0-4.debian.sid.sparc64 > ``` > > Cheers, > Frank > >
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Jan, On 23.03.21 16:36, Jan Engelhardt wrote: On Tuesday 2021-03-23 16:29, Frank Scheiner wrote: ``` [...] Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't pass remote address mount: Invalid argument I seem to recall that NFS is one of those filesystems that (a) makes use of filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount helper, /usr/sbin/mount.nfs. Now, with the change in Linux kernel 028abd9222df0cf5855dab5014a5ebaf06f90565, I am postulating the hypothesis that that the fs/nfs/ code for parsing this binary blob is no longer aware that it is being invoked in a compat32 context. That sounds interesting. Can you perhaps post your hypothesis also in this thread: https://marc.info/?t=16164490063&r=1&w=2 Maybe this gives the kernel developers some ideas. Since T2 systems were said to be fine and T1 not, perhaps the T1 systems in question were all on NFS mounts and the T2 one wasn't? No, the T5220 was also running diskless, actually using the same root FS as the T1000 (in form of a btrfs subvolume snapshot) plus identical kernel and initramfs: ``` root@nfs:/srv/tftp# ls -la $( host2hex t5220 )* lrwxrwxrwx 1 root root 35 Feb 28 2018 AC10026E -> boot/grub/sparc64-ieee1275/core.img lrwxrwxrwx 1 root root 38 Mar 15 18:16 AC10026E.initrd.img -> initrd.img.5.10.0-4.debian.sid.sparc64 lrwxrwxrwx 1 root root 36 Mar 15 18:16 AC10026E.vmlinuz -> linux.mp.5.10.0-4.debian.sid.sparc64 ``` Cheers, Frank
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
On Tuesday 2021-03-23 16:29, Frank Scheiner wrote: >> >> while I was able to "install" correctly using a slightly older ISO, I >> get not a bootable system. The kernel appears to crash very early during >> boot. > > From my current testing it looks like "UltraSPARC IIIi"s are also > affected by this problem with UltraSPARC T1s in some way: > > With the latest Linux 5.10.x (from Debian) the root FS can't be > successfully mounted, with the latest Linux 5.9.x (also from Debian) it > just works fine. Unfortunately the V245 doesn't fail/work for the exact > same kernels that I tested during the bisecting for the T1000, e.g. the > first bad commit version that didn't work on the T1000 seems to work on > the V245 but some good versions don't with: > > ``` > [...] > Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't > pass remote address > mount: Invalid argument I seem to recall that NFS is one of those filesystems that (a) makes use of filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount helper, /usr/sbin/mount.nfs. Now, with the change in Linux kernel 028abd9222df0cf5855dab5014a5ebaf06f90565, I am postulating the hypothesis that that the fs/nfs/ code for parsing this binary blob is no longer aware that it is being invoked in a compat32 context. Since T2 systems were said to be fine and T1 not, perhaps the T1 systems in question were all on NFS mounts and the T2 one wasn't?
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi all, On 09.03.21 13:23, Riccardo Mottola wrote: Hi all, while I was able to "install" correctly using a slightly older ISO, I get not a bootable system. The kernel appears to crash very early during boot. Anybody else has this issue? Booting `Debian GNU/Linux' Loading Linux 5.10.0-4-sparc64-smp ... Loading initial ramdisk ... From my current testing it looks like "UltraSPARC IIIi"s are also affected by this problem with UltraSPARC T1s in some way: With the latest Linux 5.10.x (from Debian) the root FS can't be successfully mounted, with the latest Linux 5.9.x (also from Debian) it just works fine. Unfortunately the V245 doesn't fail/work for the exact same kernels that I tested during the bisecting for the T1000, e.g. the first bad commit version that didn't work on the T1000 seems to work on the V245 but some good versions don't with: ``` [...] Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't pass remote address mount: Invalid argument done. [...] ``` I'm unsure what could go wrong here, as I always pass the remote address via the kernel commandline: ``` [...] [2.928512] Kernel command line: BOOT_IMAGE=(tftp)/AC10027A.vmlinux root=/dev/nfs ip=172.16.2.122:172.16.0.2:172.16.0.1:255.255.0.0:v245-2:enp9s4f0:off nfsroot=172.16.0.2:/srv/nfs/v245-2/root nfsrootdebug rw [...] ``` Maybe there is some breakage in the klibc based programs in the initramfs, but why they don't affect both UltraSPARC IIIi and T1 in the same way is somewhat strange. Cheers, Frank
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Adrian, On 17.03.21 13:39, John Paul Adrian Glaubitz wrote: On 3/17/21 1:22 PM, Frank Scheiner wrote: ``` johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit [...] Did you verify that reverting this commit or - if reverting is not possible - testing out the revision just before the commit? I did not yet revert the bad commit in a current kernel and test it, but from my understanding the parent commit of the first bad one must have been a good one and indeed, [67e306c6906137020267eb9bbdbc127034da3627] is the parent of [028abd9222df0cf5855dab5014a5ebaf06f90565] and was working for me on my T1000: ``` johndoe@x4270:~/git-projects/torvalds/linux$ git bisect log [...] # good: [67e306c6906137020267eb9bbdbc127034da3627] fs,nfs: lift compat nfs4 mount data handling into the nfs code git bisect good 67e306c6906137020267eb9bbdbc127034da3627 # bad: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs: remove compat_sys_mount git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565 # first bad commit: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs: remove compat_sys_mount ``` [67e306c6906137020267eb9bbdbc127034da3627]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=67e306c6906137020267eb9bbdbc127034da3627 [028abd9222df0cf5855dab5014a5ebaf06f90565]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=028abd9222df0cf5855dab5014a5ebaf06f90565 Just to be safe you found the correct commit. If that has been verified, please report the issue to the sparclinux LKML and CC Christoph. Will do that soon-ish but maybe also try to revert that commit in Debian's 5.10.0-4 and test it for additional assurance (then not so soon-ish - maybe this weekend). I'll put you and Riccardo in CC, too. Hopefully this will be easier to fix than the kernel breakage on the rx2800 i2 - assuming that problem is still there ([1], [2]). [1]: https://marc.info/?l=linux-ia64&m=156114769908890&w=2 [2]: https://marc.info/?l=linux-ia64&m=156144480821712&w=2 Cheers and thanks for the pointers, Frank
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Frank! On 3/17/21 1:22 PM, Frank Scheiner wrote: > Hi Adrian, Riccardo > > so I'm finished with bisecting and it points to the following commit as > first bad commit: > > ``` > johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad > 028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit > commit 028abd9222df0cf5855dab5014a5ebaf06f90565 > Author: Christoph Hellwig > Date: Thu Sep 17 10:22:34 2020 +0200 > > fs: remove compat_sys_mount > > compat_sys_mount is identical to the regular sys_mount now, so > remove it > and use the native version everywhere. Did you verify that reverting this commit or - if reverting is not possible - testing out the revision just before the commit? Just to be safe you found the correct commit. If that has been verified, please report the issue to the sparclinux LKML and CC Christoph. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Adrian, Riccardo so I'm finished with bisecting and it points to the following commit as first bad commit: ``` johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit commit 028abd9222df0cf5855dab5014a5ebaf06f90565 Author: Christoph Hellwig Date: Thu Sep 17 10:22:34 2020 +0200 fs: remove compat_sys_mount compat_sys_mount is identical to the regular sys_mount now, so remove it and use the native version everywhere. Signed-off-by: Christoph Hellwig Signed-off-by: Al Viro arch/arm64/include/asm/unistd32.h | 2 +- arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +- arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +- arch/parisc/kernel/syscalls/syscall.tbl| 2 +- arch/powerpc/kernel/syscalls/syscall.tbl | 2 +- arch/s390/kernel/syscalls/syscall.tbl | 2 +- arch/sparc/kernel/syscalls/syscall.tbl | 2 +- arch/x86/entry/syscalls/syscall_32.tbl | 2 +- fs/Makefile| 1 - fs/compat.c| 57 -- fs/internal.h | 3 -- fs/namespace.c | 4 +- include/linux/compat.h | 6 --- include/uapi/asm-generic/unistd.h | 2 +- tools/include/uapi/asm-generic/unistd.h| 2 +- tools/perf/arch/powerpc/entry/syscalls/syscall.tbl | 2 +- tools/perf/arch/s390/entry/syscalls/syscall.tbl| 2 +- 17 files changed, 14 insertions(+), 81 deletions(-) delete mode 100644 fs/compat.c ``` Seems to be indeed related to mounting (the root FS). Why it only affects UltraSPARC T1 CPUs is another question. I don't have any other UltraSPARC II, IIi, IIe, III and IIIi driven machines at hand now for checking those. So what now? Cheers, Frank P.S. Here's the log for reference: ``` johndoe@x4270:~/git-projects/torvalds/linux$ git bisect log git bisect start # good: [bbf5c979011a099af5dc76498918ed7df445635b] Linux 5.9 git bisect good bbf5c979011a099af5dc76498918ed7df445635b # bad: [3650b228f83adda7e5ee532e2b90429c03f7b9ec] Linux 5.10-rc1 git bisect bad 3650b228f83adda7e5ee532e2b90429c03f7b9ec # bad: [c48b75b7271db23c1b2d1204d6e8496d91f27711] Merge tag 'sound-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound git bisect bad c48b75b7271db23c1b2d1204d6e8496d91f27711 # bad: [7fafb54c7d390e9b273a1d7d377e38d9c408046e] Merge tag 'leds-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds git bisect bad 7fafb54c7d390e9b273a1d7d377e38d9c408046e # bad: [fd5c32d80884268a381ed0e67cccef0b3d37750b] Merge tag 'media/v5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media git bisect bad fd5c32d80884268a381ed0e67cccef0b3d37750b # bad: [865c50e1d279671728c2936cb7680eb89355eeea] x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT git bisect bad 865c50e1d279671728c2936cb7680eb89355eeea # good: [13cb73490f475f8e7669f9288be0bcfa85399b1f] Merge tag 'x86-entry-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 13cb73490f475f8e7669f9288be0bcfa85399b1f # good: [dd502a81077a5f3b3e19fa9a1accffdcab5ad5bc] Merge tag 'core-static_call-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good dd502a81077a5f3b3e19fa9a1accffdcab5ad5bc # good: [ced3a9eb3cd0d07462cdbaa8a0f3d46e5aaeadec] Merge tag 'ia64_for_5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux git bisect good ced3a9eb3cd0d07462cdbaa8a0f3d46e5aaeadec # good: [fc67d5bc876b6b224538c8848fc02e70f269ec99] Documentation/admin-guide: README & svga: remove use of "rdev" git bisect good fc67d5bc876b6b224538c8848fc02e70f269ec99 # good: [c90578360c92c71189308ebc71087197080e94c3] Merge branch 'work.csum_and_copy' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs git bisect good c90578360c92c71189308ebc71087197080e94c3 # good: [85ed13e78dbedf9433115a62c85429922bc5035c] Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs git bisect good 85ed13e78dbedf9433115a62c85429922bc5035c # bad: [22230cd2c55bd27ee2c3a3def97c0d5577a75b82] Merge branch 'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs git bisect bad 22230cd2c55bd27ee2c3a3def97c0d5577a75b82 # good: [e18afa5bfa4a2f0e07b0864370485df701dacbc1] Merge branch 'work.quota-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs git bisect good e18afa5bfa4a2f0e07b0864370485df701dacbc1 # good: [67e306c6906137020267eb9bbdbc127034da3627] fs,nfs: lift compat nfs4 mount data handling into the nfs code git bisect good 67e306c6906137020267eb9bbdbc127034da3627 # bad: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs: remove compat_sys_mount git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565 # first bad commit: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs: remov
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Adrian, On 16.03.21 14:27, John Paul Adrian Glaubitz wrote: Hello Frank! On 3/16/21 2:07 PM, Frank Scheiner wrote: After a first cross compile run, I can confirm that 5.10-rc1 is also broken on my T1000. I'll take this version (parent commit: 33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as good means more than 5000 commits in between. Linus's tree doesn't contain v5.9.16 or at least I didn't find it there. How can I get "good" closer to "bad"? I don't want to check too many good versions if I know that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is good? Should I switch to the stable kernel sources from GKH? I'm not sure I am understand your problem here. The bisecting algorithm has a runtime O(ln(n)), so even with 5000 commits, it will converge quite quickly. Yeah, you're right, I think I make this error every time I try to bisect the kernel - i.e. once every two years... ;-) Just make sure you are using a fast machine when compiling the kernel as otherwise it won't be fun. Other topic: As the compile times are actually taking less time than the preparation of the test boot (copy over modules to T1000 root FS, boot T1000 with working kernel, create initramfs, reboot with kernel in question and that initramfs), is there a way to create the initramfs (for sparc64) on the cross compile host (amd64)? Cheers, Frank
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hello Frank! On 3/16/21 2:07 PM, Frank Scheiner wrote: > After a first cross compile run, I can confirm that 5.10-rc1 is also > broken on my T1000. I'll take this version (parent commit: > 33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as > good means more than 5000 commits in between. Linus's tree doesn't > contain v5.9.16 or at least I didn't find it there. How can I get "good" > closer to "bad"? I don't want to check too many good versions if I know > that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is > good? Should I switch to the stable kernel sources from GKH? I'm not sure I am understand your problem here. The bisecting algorithm has a runtime O(ln(n)), so even with 5000 commits, it will converge quite quickly. Just make sure you are using a fast machine when compiling the kernel as otherwise it won't be fun. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi again, On 16.03.21 14:07, Frank Scheiner wrote: @Adrian: After a first cross compile run, I can confirm that 5.10-rc1 is also broken on my T1000. I'll take this version (parent commit: 33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as good means more than 5000 commits in between. Linus's tree doesn't contain v5.9.16 or at least I didn't find it there. How can I get "good" closer to "bad"? I don't want to check too many good versions if I know that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is good? Should I switch to the stable kernel sources from GKH? Forget about that, [1] shows 5000+ commits between v5.9.16 and v5.10-rc1, too. So no difference. [1]: https://github.com/gregkh/linux/compare/v5.9.16...v5.10-rc1 Cheers, Frank
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Riccardo, Adrian, so I did some testing yesterday and also see your problem on my T1000. Because of some kernel command line misconfiguration, my machine at first couldn't find its root FS as it tried to use a non-existent NIC. This lead to a lot of kernel oopses (I assume at least one per hardware thread) that looked very similar to the ones you see. And this happens even with "working" kernels (tested 4.19.x and 5.9.x). So the actual result of that problem in 5.10.x seems to be that the kernel can't find its root FS. On 11.03.21 23:43, Frank Scheiner wrote: On 11.03.21 23:03, Riccardo Mottola wrote: I suppose the Niagara CPU gives the kernel issue From [1] I assume T2 CPUs are not affected, but yeah, the issue could be that selective that it only affects the very first generation. [1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html I can also indeed confirm that this problem only affects the T1 CPU, as my T5220 with T2 CPU works w/o problems with kernel 5.10.x. I didn't get any further yesterday as it took a lot of time to update the root FSes of my T1000 and my X4270 - my intended machine for cross compilation, not sure if it will be "fast" enough*. In addition cloning Linus's linux tree alone took a lot of time (about an hour). * it will: ``` ## with config of Debian's 5.9.0-5 kernel as `.config` $ make ARCH=sparc64 CROSS_COMPILE=sparc64-linux-gnu- olddefconfig [...] ## with lsmod output from T1000 $ make ARCH=sparc64 CROSS_COMPILE=sparc64-linux-gnu- LSMOD=$HOME/t1000-lsmod localmodconfig [...] $ time make -j16 ARCH=sparc64 CROSS_COMPILE=sparc64-linux-gnu- all [...] kernel: arch/sparc/boot/zImage is ready real3m12.264s user42m5.325s sys 3m27.843s ``` @Adrian: After a first cross compile run, I can confirm that 5.10-rc1 is also broken on my T1000. I'll take this version (parent commit: 33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as good means more than 5000 commits in between. Linus's tree doesn't contain v5.9.16 or at least I didn't find it there. How can I get "good" closer to "bad"? I don't want to check too many good versions if I know that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is good? Should I switch to the stable kernel sources from GKH? Cheers, Frank
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
On Thursday 2021-03-11 23:43, Frank Scheiner wrote: >> >> Do you know if I can via serial-console reset the system? > > Reset from the serial console might work via the kernel with the [magic > system request] functionality. > > [magic system request]: > https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html > > But you can always reset the system using the SC. The T1000 (and the > T2000, too) has both serial (on T2000 right of the DB-9 ttya port, > should work with a blue Cisco serial cable) and network port (on T2000 > above the two USB ports). The serial port of the SC automatically > switches to the system console after some (configurable) time SER MGT is a RS232-ish serial line, just with a RJ-45 connector for size. Once the SC has finished booting, system console is the default mode. Since SER has no notion of connections, it should be staying in whatever mode it was left in. Maybe there is a autoswitch, but I never observed it (but I would not want to wait a lot of minutes either just to observe it). For NET MGT, when you start a new SSH connection, it always starts out in system console mode and #. is needed. >> I tried sending a break on the serial console, but the errors just keep >> running. >> Break is received, since I see it as SC Alert, but I am not put into the >> console, maybe there is some further trick on these newer machine? > > So you already got access to the SC. Then you can reset the machine from > there, too. Because NET does not have an equivalent of the serial pin used to traditionally signal "break", a synthetic break can be issued from SC. But it's a bit awkward, because you immediately need to go back into system console mode to type the desired sysrq character. sc> break confirm (y/n)y sc> console confirm (y/n)y type <> Linux kernel: ah yes I received SYSRQ-s >> I am >> used to old SparcStations and UltraSparc Netras, where it was sufficient. >> It is inconvenient at every hang to power-cycle, since at every turn on, >> it runs a self-test which lasts minutes :) > > I think depending on the SC configuration, these machines also run a > self-test for every X resets, but this should be configurable. It's the first thing you want to turn off as a private user. diag_trigger none and probably diag_mode off
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
> How should I proceed? Which kernel sources? > > https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official > > > is 4.3 correct for me? 4.6 ? You should clone the upstream Git repo, otherwise bisecting will be much more difficult. I think these instructions are still valid: https://wiki.debian.org/DebianKernel/GitBisect You can also skip the Debian-specific stuff and simply do make -j8 && make modules_install && make install It's better to use at least a compatible kernel config, though.
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Riccardo, On 11.03.21 23:03, Riccardo Mottola wrote: Hi Frank! I suppose the Niagara CPU gives the kernel issue From [1] I assume T2 CPUs are not affected, but yeah, the issue could be that selective that it only affects the very first generation. [1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html Frank Scheiner wrote: If I remember there was a repository with many snapshots of different versions, already as package, which one can test quickly. That way we can restrict breakage range without git bisect. Do you have a link? I assume you mean "http://snapshot.debian.org"; . Exactly. With this I did some more tests. Still Works: 5.9.0-4-sparc64-smp #1 SMP Debian 5.9.11-1 (2020-11-27) 5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17) Broken: linux-image-5.10.0-trunk-sparc64-smp_5.10.2-1~exp1_sparc64.deb So later series 5.9 series continue to work and even very early 5.10 do not Do you know if I can via serial-console reset the system? Reset from the serial console might work via the kernel with the [magic system request] functionality. [magic system request]: https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html But you can always reset the system using the SC. The T1000 (and the T2000, too) has both serial (on T2000 right of the DB-9 ttya port, should work with a blue Cisco serial cable) and network port (on T2000 above the two USB ports). The serial port of the SC automatically switches to the system console after some (configurable) time and you need to escape to the SC login prompt with a configurable key sequence (`#.` by default, see [2]). [2]: https://docs.oracle.com/cd/E19076-01/t2k.srvr/819-2549-12/ontario-consoleConfig.html#28277 I tried sending a break on the serial console, but the errors just keep running. Break is received, since I see it as SC Alert, but I am not put into the console, maybe there is some further trick on these newer machine? So you already got access to the SC. Then you can reset the machine from there, too. I am used to old SparcStations and UltraSparc Netras, where it was sufficient. It is inconvenient at every hang to power-cycle, since at every turn on, it runs a self-test which lasts minutes :) I think depending on the SC configuration, these machines also run a self-test for every X resets, but this should be configurable. Hope that helps Cheers, Frank
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
> Do you know if I can via serial-console reset the system? > I tried sending a break on the serial console, but the errors just keep > running. > Break is received, since I see it as SC Alert, but I am not put into the > console, maybe there is some further trick on these newer machine? I am > used to old SparcStations and UltraSparc Netras, where it was sufficient. > It is inconvenient at every hang to power-cycle, since at every turn on, > it runs a self-test which lasts minutes :) According to this, you should be able to reach the system console through the SER MGT port: https://unixed.com/index.php/2013/06/16/accessing-the-sparc-system-console/ NET MGT is probably easier, but you'll have to set it up first. Perhaps you can also attach a USB keyboard and press the break key to get into the system console, then type "reset" to boot the machine? Not sure if this works without a monitor though. And you might need to enter the system password first, if it's set.
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Adrian John Paul Adrian Glaubitz wrote: Well, that doesn't really help you though. You want to find the commit in question, just the range isn't enough to solve the issue. Well, a little bit it helped, it is something early in the 5.10 series. Also I have now an apparently working kernel (who knows how stable under load?) 5.9 series If you have a fast second machine available, bisecting the problem shouldn't take too long. Well, this Machine has plenty of ram, disk space and good connection, how fast the CPU is in compiling a kernel I don't know, but we can try. Power consumption is not so much worse than a PC, but it is darn loud! Like a vacuum cleaner... I need to stay out of the room, but I found an acceptable setup. I use a workstation with a serial console connected to it, the connect through ssh to the workstation and through that into the management. Although I am used to compile kernels on Gentoo LInux since 15 years, I never did on Debian. Here we have init images How should I proceed? Which kernel sources? https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official is 4.3 correct for me? 4.6 ? Please guide me Riccardo
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Frank! I suppose the Niagara CPU gives the kernel issue Frank Scheiner wrote: If I remember there was a repository with many snapshots of different versions, already as package, which one can test quickly. That way we can restrict breakage range without git bisect. Do you have a link? I assume you mean "http://snapshot.debian.org"; . Exactly. With this I did some more tests. Still Works: 5.9.0-4-sparc64-smp #1 SMP Debian 5.9.11-1 (2020-11-27) 5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17) Broken: linux-image-5.10.0-trunk-sparc64-smp_5.10.2-1~exp1_sparc64.deb So later series 5.9 series continue to work and even very early 5.10 do not Do you know if I can via serial-console reset the system? I tried sending a break on the serial console, but the errors just keep running. Break is received, since I see it as SC Alert, but I am not put into the console, maybe there is some further trick on these newer machine? I am used to old SparcStations and UltraSparc Netras, where it was sufficient. It is inconvenient at every hang to power-cycle, since at every turn on, it runs a self-test which lasts minutes :) Riccardo
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
On 3/10/21 10:17 AM, Riccardo Mottola wrote: > If I remember there was a repository with many snapshots of different > versions, > already as package, which one can test quickly. That way we can restrict > breakage > range without git bisect. Well, that doesn't really help you though. You want to find the commit in question, just the range isn't enough to solve the issue. If you have a fast second machine available, bisecting the problem shouldn't take too long. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Riccardo, On 10.03.21 10:17, Riccardo Mottola wrote: Frank Scheiner wrote: We have an older UltraSPARC IIIi that has issues with newer kernels, but usually only after longer operation and the issue might be related to the bug that was just fixed recently by Rob Gardner. Which kernel version will have this bug (which one?) fixed, 5.11.x? I can also check with one of my UltraSPARC IIIi powered systems, too, next week. as written in the title, I have issues with: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 I know. If I remember there was a repository with many snapshots of different versions, already as package, which one can test quickly. That way we can restrict breakage range without git bisect. Do you have a link? I assume you mean "http://snapshot.debian.org"; . Cheers, Frank
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi Frank, Frank Scheiner wrote: We have an older UltraSPARC IIIi that has issues with newer kernels, but usually only after longer operation and the issue might be related to the bug that was just fixed recently by Rob Gardner. Which kernel version will have this bug (which one?) fixed, 5.11.x? I can also check with one of my UltraSPARC IIIi powered systems, too, next week. as written in the title, I have issues with: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 If I remember there was a repository with many snapshots of different versions, already as package, which one can test quickly. That way we can restrict breakage range without git bisect. Do you have a link? Riccardo
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
On 3/9/21 11:20 PM, John Paul Adrian Glaubitz wrote: >> Which kernel version will have this bug (which one?) fixed, 5.11.x? I >> can also check with one of my UltraSPARC IIIi powered systems, too, next >> week. > > I have not uploaded that kernel yet, I have it built locally, PR here [1]. The patch is now in Linus' tree so it will be part of 5.12 [1]. Adrian > [1] > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e5e8b80d352ec999d2bba3ea584f541c83f4ca3f -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
On 3/9/21 10:18 PM, Frank Scheiner wrote: >> The oldest buildd we are running is a T5120 and that's a T2. > > And these don't show the problems Riccardo's T1 powered T2000 has? No, the machine runs stable. >> We have an older UltraSPARC IIIi that has issues with newer kernels, but >> usually only after longer operation and the issue might be related to the >> bug that was just fixed recently by Rob Gardner. > > Which kernel version will have this bug (which one?) fixed, 5.11.x? I > can also check with one of my UltraSPARC IIIi powered systems, too, next > week. I have not uploaded that kernel yet, I have it built locally, PR here [1]. Adrian > [1] https://salsa.debian.org/kernel-team/linux/-/merge_requests/339 -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
On 09.03.21 22:09, John Paul Adrian Glaubitz wrote: On 3/9/21 9:38 PM, Frank Scheiner wrote: I have a T1000 with which I could try to reproduce Riccardo's issues. Hardware wise they should be pretty similar. As the T1000 doesn't have a CDROM, I'll try to netboot a few newer kernels and report my findings. Will take me until next week though, as the machine is in (cold) storage now. @Adrian: Aren't there some build servers using UltraSPARC T2 or T2+? Do they run with the latest kernels? The oldest buildd we are running is a T5120 and that's a T2. And these don't show the problems Riccardo's T1 powered T2000 has? We have an older UltraSPARC IIIi that has issues with newer kernels, but usually only after longer operation and the issue might be related to the bug that was just fixed recently by Rob Gardner. Which kernel version will have this bug (which one?) fixed, 5.11.x? I can also check with one of my UltraSPARC IIIi powered systems, too, next week. Cheers, Frank
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
On 3/9/21 9:38 PM, Frank Scheiner wrote: > I have a T1000 with which I could try to reproduce Riccardo's issues. > Hardware wise they should be pretty similar. As the T1000 doesn't have a > CDROM, I'll try to netboot a few newer kernels and report my findings. > Will take me until next week though, as the machine is in (cold) storage > now. > > @Adrian: > Aren't there some build servers using UltraSPARC T2 or T2+? Do they run > with the latest kernels? The oldest buildd we are running is a T5120 and that's a T2. We have an older UltraSPARC IIIi that has issues with newer kernels, but usually only after longer operation and the issue might be related to the bug that was just fixed recently by Rob Gardner. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi guys, On 09.03.21 18:31, John Paul Adrian Glaubitz wrote: Hi! On 3/9/21 6:26 PM, Riccardo Mottola wrote: John Paul Adrian Glaubitz wrote: while I was able to "install" correctly using a slightly older ISO, I get not a bootable system. The kernel appears to crash very early during boot. I think this is more likely a hardware issue. We haven't seen any machines crashing that early. Please make sure the RAM modules in this machine are working properly. I don't think so... I think it is a Kernel issue, since with kernel 5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux the machine is performing fine with network, disk and compiler usage on all 32 CPUs. Then you need to bisect the kernel as I don't have any means to reproduce the issue. I have a T1000 with which I could try to reproduce Riccardo's issues. Hardware wise they should be pretty similar. As the T1000 doesn't have a CDROM, I'll try to netboot a few newer kernels and report my findings. Will take me until next week though, as the machine is in (cold) storage now. @Adrian: Aren't there some build servers using UltraSPARC T2 or T2+? Do they run with the latest kernels? Cheers, Frank
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi! On 3/9/21 6:26 PM, Riccardo Mottola wrote: > John Paul Adrian Glaubitz wrote: >>> while I was able to "install" correctly using a slightly older ISO, I get >>> not a bootable >>> system. The kernel appears to crash very early during boot. >> I think this is more likely a hardware issue. We haven't seen any machines >> crashing that >> early. Please make sure the RAM modules in this machine are working properly. > > I don't think so... I think it is a Kernel issue, since with kernel > 5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux > > the machine is performing fine with network, disk and compiler usage on all > 32 CPUs. Then you need to bisect the kernel as I don't have any means to reproduce the issue. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hi, John Paul Adrian Glaubitz wrote: while I was able to "install" correctly using a slightly older ISO, I get not a bootable system. The kernel appears to crash very early during boot. I think this is more likely a hardware issue. We haven't seen any machines crashing that early. Please make sure the RAM modules in this machine are working properly. I don't think so... I think it is a Kernel issue, since with kernel 5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux the machine is performing fine with network, disk and compiler usage on all 32 CPUs. I tried heavy load of parallel compilations, using git on large repositories as well as using remote X applications at the same time, a combination I know tends to show issues on systems, without problems! Not a simgle error in syslog. Machine powerup-and self-tests are fine too. If I remember, there is a repository of various pre-compiled kernel versions: maybe there are some releases between the two kernels I can try and do some easy rough bisecting. so I'd say RAM, CPUs, Disk and Ethernet are working quite fine Riccardo
Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000
Hello Riccardo! On 3/9/21 1:23 PM, Riccardo Mottola wrote: > while I was able to "install" correctly using a slightly older ISO, I get not > a bootable > system. The kernel appears to crash very early during boot. I think this is more likely a hardware issue. We haven't seen any machines crashing that early. Please make sure the RAM modules in this machine are working properly. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913