Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2022-02-02 Thread Frank Scheiner

Hi Riccardo, all,

On 17.01.22 21:35, Riccardo Mottola wrote:

Hi,


Riccardo Mottola wrote:

John Paul Adrian Glaubitz wrote:

Not nice. I started compiling some stuff and the box froze, I connected
serial console and could not resume due to Fast Data Access MMU miss"

So, this crash occurs with the latest 5.15 kernel on your T2000?

exactly latest kernel.

I will retest it with stress-ng as soon as I finish this email and copy
the dmesg errors.




wow, running the test suite once or twice, I am able to have the system
power-cycle... wow

Frank test latest kernel on yours :)


I yesterday found the time to give Linux 5.15.0-3 a try on my T1000
(UltraSPARC T1) and V210 (US IIIi), but the boot issue is still there -
at least for my use case: The klibc based tools inside of the initramfs
are not able to mount the root FS over NFS (details further below).

But it's still good to see that mounting an on-disk root FS seems to
work now for your T2000, though the instabilities during runtime are not
reassuring.

For me the last good Debian kernel - at least for booting, more on that
shortly - is 5.9.0-5. Both T1000 and V210 boot fine with it (incl.
mounting the root FS via NFS(v3 BTW)). But during operation (tested with
`apt upgrade` on a root FS replicated multiple times for testing from
the same tarball) the V210 crashes (=> kernel panic), the T1000 does
not. For the V210 I also see that for 5.8.0-3. Doing the same with
kernel 4.19.0-5 running on the V210, no problems are seen, not even the
messages below.

The crash when running 5.9.0-5 or 5.8.0-3 is usually "announced" (or at
least accompanied) by one or more occurrence(s) of the following messages:
```
[...]
[  360.489852] CPU[0]: Cheetah+ D-cache parity error at
TPC[005b28c8]
[  360.580300] TPC
[...]
```
...which should be familiar for UltraSPARC IIIi users with newer kernels
(see for example [1] which shows it for 4.16.x). According to [2] this
error should be recoverable (otherwise it would be followed by a panic
and "Irrecoverable Cheetah+ parity error."), which seems to happen,
until it is no longer, but I don't see that second message, so something
else must happen.

[1]: https://www.spinics.net/lists/sparclinux/msg21019.html

[2]:
https://github.com/torvalds/linux/blob/master/arch/sparc/kernel/traps_64.c#L1767..L1799

Of course our CPU's caches don't go pop magically. There is something
broken in the newer kernels (> 4.19.x) for UltraSPARC IIIi (and most
likely all the other related processors, too), apart from the mounting
issues for NFS (see [3] for processors affected by this, update to that:
US II is not affected, too).

[3]: https://lists.debian.org/debian-sparc/2021/12/msg4.html

If I find the time and mood I'll try to bisect this US IIIi specific
issue in the hope that we will eventually get a fix for it, also still
hoping for a fix for [4].

[4]: https://lists.debian.org/debian-sparc/2021/03/msg00045.html

Cheers,
Frank



## T1000 ##

```
[...]
[0.000116] Linux version 5.15.0-3-sparc64-smp
(debian-ker...@lists.debian.org) (gcc-11 (Debian 11.2.0-14) 11.2.0, GNU
ld (GNU Binutils for Debian) 2.37.90.20220123) #1 SMP Debian 5.15.15-2
(2022-01-30)
[...]
[   12.484314] tg3 0001:03:04.0 enP1p3s4f0: Link is up at 1000 Mbps,
full duplex
[   12.484520] tg3 0001:03:04.0 enP1p3s4f0: Flow control is on for TX
and on for RX
[   12.484689] IPv6: ADDRCONF(NETDEV_CHANGE): enP1p3s4f0: link becomes ready
[   16.765173] Unable to handle kernel paging request at virtual address
6120
[   16.765384] tsk->{mm,active_mm}->context = 006e
[   16.765493] tsk->{mm,active_mm}->pgd = 800014af
[   16.765650]   \|/  \|/
[   16.765650]   "@'/ .. \`@"
[   16.765650]   /_| \__/ |_\
[   16.765650]  \__U_/
[   16.765975] nfsmount(374): Oops [#1]
[   16.766167] CPU: 2 PID: 374 Comm: nfsmount Tainted: GE
  5.15.0-3-sparc64-smp #1  Debian 5.15.15-2
[   16.766345] TSTATE: 11001607 TPC: 006a5fe8 TNPC:
006a5fec Y: Tainted: GE
[   16.766642] TPC: 
[   16.766704] g0: 8f2e7451 g1: 0004 g2:
6000 g3: 8001fd786000
[   16.766802] g4: 800014245e80 g5: 8001fd786000 g6:
8f2e4000 g7: 8f2e7c30
[   16.766983] o0: fffe o1: 006fd714 o2:
2000 o3: 8f2cbaf8
[   16.767209] o4: 0008 o5: 0cc0 sp:
8f2e7491 ret_pc: 006fd6d4
[   16.767292] RPC: 
[   16.767456] l0: 800014398408 l1: 8001fedeaa00 l2:
00422db4 l3: 00201e00
[   16.767591] l4: 029c l5: 8001c1a0 l6:
8f2e4000 l7: 006fd660
[   16.767771] i0: 0cc0 i1: 00201ff0 i2:
0001 i3: 8f2e7dd0
[   16.767996] i4:  i5: 6120 i6:
8f2e7561 i7: 006fd714
[   16.768079] I7: 
[   16.768189] Call Trace:
[   16.768326] [<

Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2022-01-17 Thread Riccardo Mottola
Hi,


Riccardo Mottola wrote:
> John Paul Adrian Glaubitz wrote:
>>> Not nice. I started compiling some stuff and the box froze, I connected
>>> serial console and could not resume due to Fast Data Access MMU miss"
>> So, this crash occurs with the latest 5.15 kernel on your T2000?
> exactly latest kernel.
> 
> I will retest it with stress-ng as soon as I finish this email and copy
> the dmesg errors.
> 


wow, running the test suite once or twice, I am able to have the system
power-cycle... wow

Frank test latest kernel on yours :)

Riccardo



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2022-01-17 Thread Riccardo Mottola
Hi,

John Paul Adrian Glaubitz wrote:
>> Not nice. I started compiling some stuff and the box froze, I connected
>> serial console and could not resume due to Fast Data Access MMU miss"
> So, this crash occurs with the latest 5.15 kernel on your T2000?

exactly latest kernel.

I will retest it with stress-ng as soon as I finish this email and copy
the dmesg errors.

> In my experience, the most stable kernels on the older SPARCs are still the
> 4.19 kernels. Thus, we should start bisecting to find out what commit actually
> made the kernel unreliable on these older SPARCs.


We must find a good way to test though. I stress-tested the 5.9 kernel
further. The system sometimes seemed unresponsive, but eventually
recovered (some errors to know more pasted below). Thus I would consider
it "stable". I did run several small burst of tests and then a session
given of 30m minutes but that due to hiccups lasted more like 2 hours,
but afterwards, the machine was still up.

 sudo stress-ng --all 10 --timeout 30m

10 times means more than physical CPUs, but less than logical cores
(32). The system has 16GB of ram, I see some OOMs in dmesg, I wonder if
this is due to certain stress tests specifically going against any limit.

[16195.300448] Unable to handle kernel NULL pointer dereference in mna
handler
[16195.741725]  40014fef
[16195.741793]  at virtual address 00e7
[16195.767936]  b416801c
[16195.767945]  c2592468
[16195.767990] current->{active_,}mm->context = 0bb8
[16195.768848]  b818
[16195.768857]  920126c8
[16195.769673] current->{active_,}mm->pgd = 800089cac000

[16195.770413]   \|/  \|/
 "@'/ .. \`@"
 /_| \__/ |_\
\__U_/
[16196.30] systemd-journald[219777]: /dev/kmsg buffer overrun, some
messages lost.
[16196.304235] stress-ng(234874): Oops [#864]
[16196.304262] CPU: 8 PID: 234874 Comm: stress-ng Tainted: G  D
E  X  5.9.0-5-sparc64-smp #1 Debian 5.9.15-1
[16196.304281] TSTATE: 008811001605 TPC: 0042d8e0 TNPC:
0042d8e4 Y: Tainted: G  D E  X
[16196.304311] TPC: 
[16196.304327] g0: 0040770c g1: 032f g2:
 g3: 80010007c000
[16196.304341] g4: 8003f13f9240 g5: 8003fdaa4000 g6:
800087df8000 g7: 4000
[16196.304355] o0: 01ef o1: 032f o2:
800087df8000 o3: 0007
[16196.304368] o4: 0007 o5: fff2 sp:
800087dfb451 ret_pc: 0042d8c4
[16196.304390] RPC: 
[16196.304404] l0: 030800010304 l1: 0044f0001201 l2:
0040770c l3: 
[16196.304418] l4:  l5: 80010007c000 l6:
800087df8000 l7: 11001002
[16196.304432] i0: 0077 i1: 020f i2:
fff2 i3: 800187dfff70
[16196.304445] i4: fff2 i5: 0007 i6:
800087dfb4d1 i7: 0042d6fc
[16196.304472] I7: 
[16205.284863] aes_sparc64: sparc64 aes opcodes not available.
[16205.753417] Call Trace:
[16205.753453] [<0042d6fc>] do_signal+0x25c/0x560
[16205.753478] [<0042e218>] do_notify_resume+0x58/0xa0
[16205.753500] [<00404b48>] __handle_signal+0xc/0x30
[16205.753525] Caller[0042d6fc]: do_signal+0x25c/0x560
[16205.753546] Caller[0042e218]: do_notify_resume+0x58/0xa0
[16205.753562] Caller[00404b48]: __handle_signal+0xc/0x30
[16205.753575] Caller[0107294c]: 0x107294c
[16205.753580] Instruction DUMP:
[16205.753587]  c029a00d
[16205.753595]  b4168008
[16205.753602]  900761e8
[16205.753610] 
[16205.753616]  40014fef
[16205.753623]  b416801c
[16205.753629]  c2592468
[16205.753636]  b818
[16205.753644]  920126c8


then also these messages. I think they explain the "slowness" and
apparent freeze of the system - I was about to power-cycle but waited
and it recovered:

[16253.233924] ata1.00: qc timeout (cmd 0xa0)
[16335.213786] PM: hibernation: Basic memory bitmaps created
[16830.619976] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16830.620193]  (detected by 18, t=5252 jiffies, g=711181, q=6)
[16830.620215] rcu: All QSes seen, last rcu_sched kthread activity 1191
(4299098242-4299097051), jiffies_till_next_fqs=1, root ->qsmask 0x0
[16830.620491] rcu: rcu_sched kthread starved for 1191 jiffies! g711181
f0x2 RCU_GP_CLEANUP(7) ->state=0x0 ->cpu=30
[16830.620749] rcu: Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[16830.620844] rcu: RCU grace-period kthread stack dump:
[16830.621069] task:rcu_sched   state:R  running task stack:
0 pid:   10 ppid: 2 flags:0x0500
[16830.621095] Call Trace:
[16830.621128] [<00bda560>] _cond_resched+0x40/0x60
[16830.621153] [<004ee1d0>] rcu_gp_kthread+0x9b0/0xe40
[16830.621175] [<00491c48>] kthread+0x108/0x120
[16830.621205] [<004060c8>] ret_from_fork+0x1c/0x2c
[16830.621224] [<0

Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2022-01-17 Thread John Paul Adrian Glaubitz
Hi!

On 1/17/22 14:41, Riccardo Mottola wrote:
>>> The good news is that latest kernel installed seems to boot and takes
>>> all CPUs online. How stable it is I don't know, it needs to be tested.
>>
>> Please run some stress tests such as stress-ng and report back.
> 
> Not nice. I started compiling some stuff and the box froze, I connected
> serial console and could not resume due to Fast Data Access MMU miss"

So, this crash occurs with the latest 5.15 kernel on your T2000?

In my experience, the most stable kernels on the older SPARCs are still the
4.19 kernels. Thus, we should start bisecting to find out what commit actually
made the kernel unreliable on these older SPARCs.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2022-01-17 Thread Riccardo Mottola
I reply to myself.

I did run the old 5.9 kernel from debian - which has proven quite stable.
I did run the same tests... and I found once error in the console indeed.


[  380.918996] Unable to handle kernel NULL pointer dereference
[  380.919198] tsk->{mm,active_mm}->context = 057d
[  380.919326] tsk->{mm,active_mm}->pgd = 8003f1fd4000
[  380.919496]   \|/  \|/
 "@'/ .. \`@"
 /_| \__/ |_\
\__U_/
[  380.919510] stress-ng(1529): Oops [#287]
[  380.919536] CPU: 24 PID: 1529 Comm: stress-ng Tainted: G  D E
 X  5.9.0-5-sparc64-smp #1 Debian 5.9.15-1
[  380.919557] TSTATE: 008811001602 TPC: 0042d8e0 TNPC:
0042d8e4 Y: Tainted: G  D E  X
[  380.919587] TPC: 
[  380.919604] g0: 800100ef7194 g1: 0328 g2:
 g3: 80010002c000
[  380.919620] g4: 8003cf6f6b40 g5: 8003fdea4000 g6:
8003cf9cc000 g7: 4000
[  380.919634] o0: 01e8 o1: 0328 o2:
8003cf9cc000 o3: 0007
[  380.919650] o4: 0007 o5: fff2 sp:
8003cf9cf451 ret_pc: 0042d8c4
[  380.919673] RPC: 
[  380.919690] l0: 020800010404 l1: 0044f226 l2:
800100ef7194 l3: 
[  380.919705] l4:  l5: 0005 l6:
8003cf9cc000 l7: 00698c20
[  380.919719] i0: 0070 i1: 0208 i2:
fff2 i3: 8003cf9eff70
[  380.919732] i4: fff2 i5:  i6:
8003cf9cf4d1 i7: 0042d6fc
[  380.919752] I7: 
[  380.919760] Call Trace:
[  380.919783] [<0042d6fc>] do_signal+0x25c/0x560
[  380.919806] [<0042e218>] do_notify_resume+0x58/0xa0
[  380.919828] [<00404b48>] __handle_signal+0xc/0x30
[  380.919852] Caller[0042d6fc]: do_signal+0x25c/0x560
[  380.919874] Caller[0042e218]: do_notify_resume+0x58/0xa0
[  380.919893] Caller[00404b48]: __handle_signal+0xc/0x30
[  380.919910] Caller[800100ef716c]: 0x800100ef716c
[  380.919916] Instruction DUMP:
[  380.919923]  c029a00d
[  380.919930]  b4168008
[  380.919938]  900761e8
[  380.919945] 
[  380.919952]  40014fef
[  380.919959]  b416801c
[  380.919965]  c2592468
[  380.919972]  b818
[  380.919979]  920126c8

[  380.972358] systemd-journald[66048]: File
/var/log/journal/bdb2a41ce825489ba567bea53add247e/system.journal
corrupted or uncleanly shut down, renaming and replacing.
[  407.494981] systemd[1]: Started Journal Service.


as well as error in the stressors:
stress-ng: info:  [12989] stress-ng-fanotify: 148 open, 41 close write,
128 close nowrite, 96 access, 27 modify
stress-ng: info:  [12908] stress-ng-fanotify: 159 open, 66 close write,
108 close nowrite, 88 access, 43 modify
stress-ng: info:  [12911] stress-ng-fanotify: 147 open, 43 close write,
122 close nowrite, 99 access, 20 modify
stress-ng: info:  [13079] stress-ng-fanotify: 159 open, 60 close write,
112 close nowrite, 97 access, 32 modify
stress-ng: info:  [12820] stress-ng-fanotify: 155 open, 46 close write,
123 close nowrite, 87 access, 27 modify
stress-ng: info:  [913] unsuccessful run completed in 282.58s (4 mins,
42.58 secs)
stress-ng: fail:  [913] chattr instance 2 corrupted bogo-ops counter, 48
vs 0
stress-ng: fail:  [913] chattr instance 2 hash error in bogo-ops counter
and run flag, 1918819509 vs 0
stress-ng: fail:  [913] chattr instance 6 corrupted bogo-ops counter, 50
vs 0
stress-ng: fail:  [913] chattr instance 6 hash error in bogo-ops counter
and run flag, 506138270 vs 0
stress-ng: fail:  [913] dnotify instance 4 corrupted bogo-ops counter,
224 vs 0
info: 5 failures reached, aborting stress process
stress-ng: fail:  [913] dnotify instance 4 hash error in bogo-ops
counter and run flag, 1503783545 vs 0
stress-ng: fail:  [913] dnotify instance 6 corrupted bogo-ops counter,
222 vs 0
stress-ng: fail:  [913] dnotify instance 6 hash error in bogo-ops
counter and run flag, 4199465241 vs 0
stress-ng: fail:  [913] metrics-check: stressor metrics corrupted, data
is compromised


However the machine did not crash.
I did run exactly the same stress command again... and the failures are
reproducible, so I suppose maybe the tests are not 64bit big endian safe
or such.



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2022-01-17 Thread Riccardo Mottola
Hi Adrian,

John Paul Adrian Glaubitz wrote:
> Did you forget to create an initrd? After installing the kernel, run:
> 
> $ update-initramfs -k KERNEL_VERSION -c

I did not run it this way, will do.

I had it however, of a very big size:
316M Jan 14 17:15 initrd.img-5.9.0-rc1+

which filled up my /boot

I removed it, regenerated with your command, but I get dropped into
initramfs with no modules found. Hmm..

> 
>> The good news is that latest kernel installed seems to boot and takes
>> all CPUs online. How stable it is I don't know, it needs to be tested.
> Please run some stress tests such as stress-ng and report back.

Not nice. I started compiling some stuff and the box froze, I connected
serial console and could not resume due to Fast Data Access MMU miss"

I will now stress things again, but keeping serial console attached with
another computer and see.

UP to last week with the old 5.9 kernel I had no issues compiling even
large things as gecko based ArcticFox or the Linux kernel itself. So if
the Fire didn't fail over the weekend it smells as kernel instability.

What should I use in stress-ng? I just tried "--all 8 --timeout 120s"

and the machine locked up after a little and in the serial console I see:

[ 8563.833509] current->{active_,}mm->context = 0fcb

[ 8563.833523] current->{active_,}mm->pgd = 8000d35c8000

[ 8563.846347] Unable to handle kernel NULL pointer dereference in mna
handler
[ 8563.846365]  at virtual address 00e7

[ 8563.846380] current->{active_,}mm->context = 0fcc

[ 8563.846395] current->{active_,}mm->pgd = 8000d2d3c000

[ 8563.856171] Unable to handle kernel NULL pointer dereference

[ 8563.863274] tsk->{mm,active_mm}->context = 0fd2

[ 8563.863294] tsk->{mm,active_mm}->pgd = 8000d3fc

[ 8563.928911] Unable to handle kernel NULL pointer dereference in mna
handler
[ 8563.928935]  at virtual address 00e7

[ 8563.928955] current->{active_,}mm->context = 0fde

[ 8563.928972] current->{active_,}mm->pgd = 8000d32e8000

[ 8563.952221] Unable to handle kernel NULL pointer dereference in mna
handler
[ 8563.952244]  at virtual address 00e7

[ 8563.952261] current->{active_,}mm->context = 0fe3

[ 8563.952278] current->{active_,}mm->pgd = 8000d2f54000

[ 8563.954004] Unable to handle kernel NULL pointer dereference in mna
handler
[ 8563.954022]  at virtual address 00e7

[ 8563.954037] current->{active_,}mm->context = 0fe5

[ 8563.954053] current->{active_,}mm->pgd = 8000d2d5c000

[ 8563.972643] Unable to handle kernel NULL pointer dereference

[ 8563.972660] tsk->{mm,active_mm}->context = 0fea

[ 8563.972677] tsk->{mm,active_mm}->pgd = 8000d31300

These are kernel messages, not OF, so it looks like a kernel problem

Riccardo



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2022-01-14 Thread John Paul Adrian Glaubitz
Hi!

On 1/14/22 17:58, Riccardo Mottola wrote:
> as Frank asked, I compiled myself a kernel using his latest commit
> identified as good:
> 67e306c6906137020267eb9bbdbc127034da3627
> 
> and this kernel works, but then fails to load initramfs.

Did you forget to create an initrd? After installing the kernel, run:

$ update-initramfs -k KERNEL_VERSION -c

> The good news is that latest kernel installed seems to boot and takes
> all CPUs online. How stable it is I don't know, it needs to be tested.

Please run some stress tests such as stress-ng and report back.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2022-01-14 Thread Riccardo Mottola
Hi all,

as Frank asked, I compiled myself a kernel using his latest commit
identified as good:
67e306c6906137020267eb9bbdbc127034da3627

and this kernel works, but then fails to load initramfs.

I don't know if the crash was before or after, so if it is a "proof"
that it is good or it is not conclusive?

The good news is that latest kernel installed seems to boot and takes
all CPUs online. How stable it is I don't know, it needs to be tested.

Riccardo

5.15.0-2-sparc64-smp #1 SMP Debian 5.15.5-2 (2021-12-18) sparc64 GNU/Linux

multix@narya:~$ cat /proc/cpuinfo
cpu : UltraSparc T1 (Niagara)
fpu : UltraSparc T1 integrated FPU
pmu : niagara
prom: OBP 4.30.4.d 2011/07/06 14:29
type: sun4v
ncpus probed: 32
ncpus active: 32
D$ parity tl1   : 0
I$ parity tl1   : 0
cpucaps :
flush,stbar,swap,muldiv,v9,blkinit,mul32,div32,v8plus,ASIBlkInit
Cpu0ClkTck  : 3b9aca00
Cpu1ClkTck  : 3b9aca00
Cpu2ClkTck  : 3b9aca00
Cpu3ClkTck  : 3b9aca00
Cpu4ClkTck  : 3b9aca00
Cpu5ClkTck  : 3b9aca00
Cpu6ClkTck  : 3b9aca00
Cpu7ClkTck  : 3b9aca00
Cpu8ClkTck  : 3b9aca00
Cpu9ClkTck  : 3b9aca00
Cpu10ClkTck : 3b9aca00
Cpu11ClkTck : 3b9aca00
Cpu12ClkTck : 3b9aca00
Cpu13ClkTck : 3b9aca00
Cpu14ClkTck : 3b9aca00
Cpu15ClkTck : 3b9aca00
Cpu16ClkTck : 3b9aca00
Cpu17ClkTck : 3b9aca00
Cpu18ClkTck : 3b9aca00
Cpu19ClkTck : 3b9aca00
Cpu20ClkTck : 3b9aca00
Cpu21ClkTck : 3b9aca00
Cpu22ClkTck : 3b9aca00
Cpu23ClkTck : 3b9aca00
Cpu24ClkTck : 3b9aca00
Cpu25ClkTck : 3b9aca00
Cpu26ClkTck : 3b9aca00
Cpu27ClkTck : 3b9aca00
Cpu28ClkTck : 3b9aca00
Cpu29ClkTck : 3b9aca00
Cpu30ClkTck : 3b9aca00
Cpu31ClkTck : 3b9aca00
MMU Type: Hypervisor (sun4v)
MMU PGSZs   : 8K,64K,4MB,256MB
State:
CPU0:   online
CPU1:   online
CPU2:   online
CPU3:   online
CPU4:   online
CPU5:   online
CPU6:   online
CPU7:   online
CPU8:   online
CPU9:   online
CPU10:  online
CPU11:  online
CPU12:  online
CPU13:  online
CPU14:  online
CPU15:  online
CPU16:  online
CPU17:  online
CPU18:  online
CPU19:  online
CPU20:  online
CPU21:  online
CPU22:  online
CPU23:  online
CPU24:  online
CPU25:  online
CPU26:  online
CPU27:  online
CPU28:  online
CPU29:  online
CPU30:  online
CPU31:  online



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-12-11 Thread Frank Scheiner

Hi guys,

On 11.12.21 18:59, John Paul Adrian Glaubitz wrote:

On 12/11/21 18:40, Riccardo Mottola wrote:

I remember you bisected about the breaking commits. Has there been any progress?
A better place where to report this issue other than this mailing list?


The proper place is to send an email to the author of the breaking commit and
CC the sparclinux Linux kernel mailing list. Most kernel developers don't read
the debian-sparc mailing list.


We actually did discuss this in late March 2021 starting here:

https://lists.debian.org/debian-sparc/2021/03/msg00045.html

...with Christoph Hellwig and CCed to sparcli...@vger.kernel.org and
this list, but no solution back then.



Back in October I did some testing on various UltraSPARC machines to
sort out which processor( generation)s are affected but didn't found the
time to make something out of it apart from notes and a conclusion.

I couldn't get my Ultra 80 to netboot, so no result for UltraSPARC II.

My Ultra 10 with US IIi worked though with kernel 5.14.0-3.

My 280r with US III worked with kernel 5.9.0-5 and with 5.14.0-3 gives:

```
Begin: Retrying nfs mount ... mount: Invalid argument
done.
```

...when trying to mount the root FS.

My v480 crashes with 5.14.0-3 but it crashed with every kernel version I
tried since I own it, so perfectly normal. I don't know what the issue
is, because hardware-wise, the - working with 5.9.0-5 - 280r seems to be
very similar though with only 2 processors instead of 4 for the V480.

My T5220 with T2 crashed once with 5.14.0-3 but worked with 5.14.0-4. It
later also worked with 5.14.0-3. And the crash happened way before a
mount of the root FS was tried, so possibly unrelated.

My T1000 with T1 panics with 5.14.0-3 because it can't mount the root
FS. Using `break=premount` in the kernel command line and issueing the
mount command manually gives;

```
(initramfs) nfsmount -o nolock "172.16.0.2:/srv/nfs/t1000/root" "$rootmnt"
[  641.272949] Unable to handle kernel paging request at virtual address
6120
[  641.273138] tsk->{mm,active_mm}->context = 038f
[  641.273248] tsk->{mm,active_mm}->pgd = 800016c1c000
[  641.273310]   \|/  \|/
[  641.273310]   "@'/ .. \`@"
[  641.273310]   /_| \__/ |_\
[  641.273310]  \__U_/
[  641.273444] nfsmount(750): Oops [#182]
[  641.273497] CPU: 12 PID: 750 Comm: nfsmount Tainted: G  D E
   5.14.0-3-sparc64-smp #1  Debian 5.14.12-1
[  641.273603] TSTATE: 11001607 TPC: 0069ce48 TNPC:
0069ce4c Y: Tainted: G  D E
[  641.273705] TPC: 
[  641.273775] g0: 0006 g1: 0004 g2:
6000 g3: 8001fda18000
[  641.273858] g4: 800013b13340 g5: 8001fda18000 g6:
800016bd g7: 800016bd3c30
[  641.273942] o0: fffe o1: 006f4c94 o2:
2000 o3: 8000146d3aa8
[  641.274024] o4: 0008 o5: 0cc0 sp:
800016bd34a1 ret_pc: 006f4c54
[  641.274107] RPC: 
[  641.274165] l0: 00f1a000 l1: 0111f000 l2:
00422db4 l3: 00201db0
[  641.274292] l4: 029c l5: 8001c1a0 l6:
800016bd l7: 006f4be0
[  641.274377] i0: 0cc0 i1: 00201fe0 i2:
0001 i3: 800016bd3dd0
[  641.274460] i4:  i5: 6120 i6:
800016bd3561 i7: 006f4c94
[  641.274542] I7: 
[  641.274599] Call Trace:
[  641.274640] [<006f4c94>] sys_mount+0xb4/0x1a0
[  641.274712] [<006f4c54>] sys_mount+0x74/0x1a0
[  641.274783] [<00406274>] linux_sparc_syscall+0x34/0x44
[  641.274866] Caller[006f4c94]: sys_mount+0xb4/0x1a0
[  641.274939] Caller[006f4c54]: sys_mount+0x74/0x1a0
[  641.275011] Caller[00406274]: linux_sparc_syscall+0x34/0x44
[  641.275090] Caller[00100aa8]: 0x100aa8
[  641.275143] Instruction DUMP:
[  641.275150]  ba074001
[  641.275192]  bb2f7003
[  641.275233]  ba074002
[  641.275274] 
[  641.275314]  84086001
[  641.275355]  82007fff
[  641.275395]  8378841d
[  641.275436]  ba11
[  641.275525]  c2586008
[  641.275614]
Killed
```

Doing the same on a V210 with US IIIi gives:

```
(initramfs) nfsmount -o nolock "172.16.0.2:/srv/nfs/v210/root" "$rootmnt"
mount: Invalid argument
(initramfs) echo $?
1
```

...so similar to 280r with US III.

From all that, I assume UltraSPARC IIi driven machines (and most likely
also older ones with US II) are not affected by this, as are UltraSPARC
T2 driven ones and possibly machines with newer processors (I didn't
have time to try one of my T5240s with T2+).

UltraSPARC III, IIIi and T1 driven machines are affected and to me it
now looks more like some of the klibc programs from the initramfs are at
fault.

I also tested my V210 with an on-disk root FS and although the mounting
seemed to work for that method with 5.14.0-3 I faced multiple problems
later on that crashed the machine.

M

Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-12-11 Thread John Paul Adrian Glaubitz
On 12/11/21 18:40, Riccardo Mottola wrote:
> I remember you bisected about the breaking commits. Has there been any 
> progress?
> A better place where to report this issue other than this mailing list?

The proper place is to send an email to the author of the breaking commit and
CC the sparclinux Linux kernel mailing list. Most kernel developers don't read
the debian-sparc mailing list.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-12-11 Thread Riccardo Mottola


Hi Frank,

several months have passed… new kernels came into debian and they still do not 
work for me, so let me dig up this matter again.
I can continue using 5.9 for now, but for how long?

On 2021-03-11 23:43:10 +0100 Frank Scheiner  wrote:

>  From [1] I assume T2 CPUs are not affected, but yeah, the issue could
> be that selective that it only affects the very first generation.
> 
> [1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html

Did more people report this issue perhaps on other systems?

I remember you bisected about the breaking commits. Has there been any 
progress? A better place where to report this issue other than this mailing 
list?

Thank you,
Riccardo 

-- 
Sent with GNUMail running on MacOS 10.7



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-04-01 Thread Riccardo Mottola
Hi Anatoly!

Anatoly Pugachev wrote:
> current grub2 version does not support compressed image kernels, do
> the following:
>
> gzip -dc /boot/vmlinuz-5.12.0-rc5+ > /boot/vmlinux-5.12.0-rc5+
> rm /boot/vmlinuz-5.12.0-rc5+
> update-grub
>
> and reboot

oh yes, that was it. Finally, I could boot my own built kernel. Which,
of course, crashes as expected.
At least I can confirm Frank's findings.


Riccardo



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-04-01 Thread Anatoly Pugachev
On Thu, Apr 1, 2021 at 2:40 PM Riccardo Mottola
 wrote:
> multix@narya:~/code/linux-stable$ time sudo make install
> sh ./arch/sparc/boot/install.sh 5.12.0-rc5+ arch/sparc/boot/zImage \
> System.map "/boot"
> run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 5.12.0-rc5+
> /boot/vmlinuz-5.12.0-rc5+
> run-parts: executing /etc/kernel/postinst.d/initramfs-tools 5.12.0-rc5+
> /boot/vmlinuz-5.12.0-rc5+
> update-initramfs: Generating /boot/initrd.img-5.12.0-rc5+
> run-parts: executing /etc/kernel/postinst.d/zz-update-grub 5.12.0-rc5+
> /boot/vmlinuz-5.12.0-rc5+
> Generating grub configuration file ...
> Found linux image: /boot/vmlinuz-5.12.0-rc5+
> Found initrd image: /boot/initrd.img-5.12.0-rc5+
> Found linux image: /boot/vmlinuz-5.12.0-rc5+.old
> Found initrd image: /boot/initrd.img-5.12.0-rc5+
> Found linux image: /boot/vmlinux-5.10.0-4-sparc64-smp
> Found initrd image: /boot/initrd.img-5.10.0-4-sparc64-smp
> Found linux image: /boot/vmlinux-5.10.0-trunk-sparc64-smp
> Found initrd image: /boot/initrd.img-5.10.0-trunk-sparc64-smp
> Found linux image: /boot/vmlinux-5.9.0-5-sparc64-smp
> Found initrd image: /boot/initrd.img-5.9.0-5-sparc64-smp
> done
>
> At boot:
>
> Loading Linux 5.12.0-rc5+ ...
> error: premature end of file /vmlinuz-5.12.0-rc5+.
> Loading initial ramdisk ...
> error: you need to load the kernel first.

current grub2 version does not support compressed image kernels, do
the following:

gzip -dc /boot/vmlinuz-5.12.0-rc5+ > /boot/vmlinux-5.12.0-rc5+
rm /boot/vmlinuz-5.12.0-rc5+
update-grub

and reboot



Re: Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-04-01 Thread Hermann Lauer
Hi Riccardo,

On Thu, Apr 01, 2021 at 01:43:29PM +0200, Riccardo Mottola wrote:
> > Yep, in your kernel config set:
> > CONFIG_SYSTEM_TRUSTED_KEYS=""
> 
> thanks, that was it! Now the kernel build

great!

> Do I need to do somethings special?
> 
> make install
> make modules_install

sorry, don't know. I'm always doing:

make -j bindeb-pkg
dpkg -i ../linux-image*.dpkg

But that is even slower on weak hardware (e.g. BananaUltra) and the above
SHOULD work. Advantage comes when deleting kernels.

> Loading Linux 5.12.0-rc5+ ...
> error: premature end of file /vmlinuz-5.12.0-rc5+.

Somehow your vmlinuz is to short or the loader is not able to put it
in memory.

Good luck and greetings
  Hermann

-- 
Administration/Zentrale Dienste, Interdiziplinaeres 
Zentrum fuer wissenschaftliches Rechnen der Universitaet Heidelberg
IWR; INF 205; 69120 Heidelberg; Tel: (06221)54-14405 Fax: -14427
Email: hermann.la...@iwr.uni-heidelberg.de



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-04-01 Thread Riccardo Mottola
Hhi Hermann,


hermann.la...@uni-heidelberg.de wrote:
> Yep, in your kernel config set:
> CONFIG_SYSTEM_TRUSTED_KEYS=""

thanks, that was it! Now the kernel build

Do I need to do somethings special?

make install
make modules_install

Which shows:

multix@narya:~/code/linux-stable$ time sudo make install
sh ./arch/sparc/boot/install.sh 5.12.0-rc5+ arch/sparc/boot/zImage \
    System.map "/boot"
run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 5.12.0-rc5+
/boot/vmlinuz-5.12.0-rc5+
run-parts: executing /etc/kernel/postinst.d/initramfs-tools 5.12.0-rc5+
/boot/vmlinuz-5.12.0-rc5+
update-initramfs: Generating /boot/initrd.img-5.12.0-rc5+
run-parts: executing /etc/kernel/postinst.d/zz-update-grub 5.12.0-rc5+
/boot/vmlinuz-5.12.0-rc5+
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.12.0-rc5+
Found initrd image: /boot/initrd.img-5.12.0-rc5+
Found linux image: /boot/vmlinuz-5.12.0-rc5+.old
Found initrd image: /boot/initrd.img-5.12.0-rc5+
Found linux image: /boot/vmlinux-5.10.0-4-sparc64-smp
Found initrd image: /boot/initrd.img-5.10.0-4-sparc64-smp
Found linux image: /boot/vmlinux-5.10.0-trunk-sparc64-smp
Found initrd image: /boot/initrd.img-5.10.0-trunk-sparc64-smp
Found linux image: /boot/vmlinux-5.9.0-5-sparc64-smp
Found initrd image: /boot/initrd.img-5.9.0-5-sparc64-smp
done

real    33m3.954s
user    28m18.936s
sys 4m36.889s



At boot:

Loading Linux 5.12.0-rc5+ ...
error: premature end of file /vmlinuz-5.12.0-rc5+.
Loading initial ramdisk ...
error: you need to load the kernel first.


it is interesting how certain operations are very slow on this system,
since a "single" core is slow.. so installing takes longer as a ...
celeron laptop!
It took... 33 minutes ?!


Riccardo



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-04-01 Thread Anatoly Pugachev
On Thu, Apr 1, 2021 at 12:59 PM Riccardo Mottola
 wrote:
> > This seems to only happen when the machines do a long run with high
> > workload and seemingly not when i just power them off again for night
> > with no high workload.
>
> I have a limited experience and can only share that the kernel I
> currently am running on this Fire T2000
>
> Linux narya 5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)
> sparc64 GNU/Linux
>
> Is quite stable for me.
> However, i did not try to run for several days compiling, so I don't
> know if it is stable for a long time.

Riccardo,

if you would like to check sparc64 kernel stability, you might want to run
stress-ng tests, like:

$ ./stress-ng --sequential 4 -v --timeout 3m --metrics-brief

it still successfully kills the latest (git) kernel (5.12.0-rc5) on my
sparc64 test LDOM running on a T5-2 hardware server.
But please take stress-ng from git repo [1] , since it has a few
recent fixes for sparc, not yet packaged into debian.

Thanks.

1. https://github.com/ColinIanKing/stress-ng/



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-04-01 Thread Riccardo Mottola
Hi Connor,

Connor McLaughlan wrote:
> can anyone possible give a list of known stable kernel versions for
> SPARC machines? (is there a difference necessary between
> architectures/old vs. newer machines? sun4u/sun4v)?
>
> Also this instability manifests such that the machine is crashing
> during high workload? (halting? rebooting?)
>
> I ask, because on three different SPARC machines i have been
> experiencing a weird effect when using debian:
> I would start a high compiling load for several days (7-10) where the
> machines are running fine without any apparent error visible in dmesg
> or somewhere else.
> Then when i power off tand on again, the filesystem would be corrupt
> and sometimes impossible to repair without reinstallation.
>
> This seems to only happen when the machines do a long run with high
> workload and seemingly not when i just power them off again for night
> with no high workload.

I have a limited experience and can only share that the kernel I
currently am running on this Fire T2000

Linux narya 5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)
sparc64 GNU/Linux

Is quite stable for me: I did compile with high loads (e.g. compiling
linux kernel on all 32 cores) and sync the git repository of linux
kernel and ArcticFox browser. GIT sync of such repositories in my
experience is a good stress, I had disk drivers crash, network freeze 
on different architectures and systems. But not in this case.
However, i did not try to run for several days compiling, so I don't
know if it is stable for a long time.

Riccardo



Re: Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-29 Thread Hermann . Lauer
Hi Riccardo,

On Sat, Mar 27, 2021 at 01:16:11PM -0600, Stan Johnson wrote:
> > I took the config out of /boot/config of a good kernel, updated it with
> > "make oldconfig"
> > 
> > During compilation I see:
> > 
> >   CC  init/init_task.o
> > make[1]: *** No rule to make target
> > 'debian/certs/debian-uefi-certs.pem', needed by
> > 'certs/x509_certificate_list'.  Stop.
> > make[1]: *** Waiting for unfinished jobs
> > ...
> 
> I think you need to remove all references to debian certs to compile a
> custom kernel.

Yep, in your kernel config set:
CONFIG_SYSTEM_TRUSTED_KEYS=""

Greetings
  Hermann

-- 
Administration/Zentrale Dienste, Interdiziplinaeres 
Zentrum fuer wissenschaftliches Rechnen der Universitaet Heidelberg
IWR; INF 205; 69120 Heidelberg; Tel: (06221)54-14405 Fax: -14427
Email: hermann.la...@iwr.uni-heidelberg.de



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-27 Thread Stan Johnson
Hi Riccardo,

On 3/26/21 6:21 PM, Riccardo Mottola wrote:
> Hi,
> ...
> 
> I cloned linux stable. It took 60 minutes...
> 
> I took the config out of /boot/config of a good kernel, updated it with
> "make oldconfig"
> 
> During compilation I see:
> 
>   CC  init/init_task.o
> make[1]: *** No rule to make target
> 'debian/certs/debian-uefi-certs.pem', needed by
> 'certs/x509_certificate_list'.  Stop.
> make[1]: *** Waiting for unfinished jobs
> ...

I think you need to remove all references to debian certs to compile a
custom kernel.

-Stan Johnson



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-26 Thread Riccardo Mottola

Hi,


I was unable to "hack" for some days due to day-job. I have seen Frank 
and others have done a great deal.


Still, I wanted to try my own compilation, as a first attempt and also 
to build and be able to check eventual patches myself.



On 3/11/21 11:56 PM, Gregor Riepl wrote:

You should clone the upstream Git repo, otherwise bisecting will be much
more difficult.

I think these instructions are still valid:
https://wiki.debian.org/DebianKernel/GitBisect

You can also skip the Debian-specific stuff and simply do
make -j8 && make modules_install && make install

It's better to use at least a compatible kernel config, though.



I cloned linux stable. It took 60 minutes...

I took the config out of /boot/config of a good kernel, updated it with 
"make oldconfig"


During compilation I see:

  CC  init/init_task.o
make[1]: *** No rule to make target 
'debian/certs/debian-uefi-certs.pem', needed by 
'certs/x509_certificate_list'.  Stop.

make[1]: *** Waiting for unfinished jobs


It took 134 minutes to build with -j32. So well, compiling is not the 
strongest point of this CPU, but not so bad either.


real    134m55.288s
user    4111m46.186s
sys 145m12.479s

I actually wonder if the kernel is not "overconfigured" ? building 
things like nouveau make sense on SPARC? I wonder.. maybe sticking a 
PCI-e card would work in a Netra or Fire?



but I can't install:


multix@narya:~/code/linux-stable$ sudo make modules_install
sed: can't read modules.order: No such file or directory

I wonder if it is related with the error above?


Thanks,

Riccardo



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-23 Thread Frank Scheiner

Hi,

On 23.03.21 17:30, Connor McLaughlan wrote:

Hi,

can anyone possible give a list of known stable kernel versions for
SPARC machines? (is there a difference necessary between
architectures/old vs. newer machines? sun4u/sun4v)?

Also this instability manifests such that the machine is crashing during
high workload? (halting? rebooting?)

I ask, because on three different SPARC machines i have been
experiencing a weird effect when using debian:
I would start a high compiling load for several days (7-10) where the
machines are running fine without any apparent error visible in dmesg or
somewhere else.
Then when i power off tand on again, the filesystem would be corrupt and
sometimes impossible to repair without reinstallation.


Can you be sure that your used disks are in full working order? Maybe
you have bad sectors on them and their EOL is nearing, manifesting in
these FS errors? I assume the more accesses you have on your disks the
more a problem is prone to show up. And the accesses happening during
compile runs could be already too much for your disks. If you have
enough RAM, you could try to run your compile jobs in a RAM disk and
check if this makes a difference.


This seems to only happen when the machines do a long run with high
workload and seemingly not when i just power them off again for night
with no high workload.


I believe the error this thread is about is unrelated to what you
experience on your machines. This because the problem happens early on
when the root FS is to be mounted.

Cheers,
Frank



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-23 Thread Connor McLaughlan
Hi,

can anyone possible give a list of known stable kernel versions for SPARC
machines? (is there a difference necessary between architectures/old vs.
newer machines? sun4u/sun4v)?

Also this instability manifests such that the machine is crashing during
high workload? (halting? rebooting?)

I ask, because on three different SPARC machines i have been experiencing a
weird effect when using debian:
I would start a high compiling load for several days (7-10) where the
machines are running fine without any apparent error visible in dmesg or
somewhere else.
Then when i power off tand on again, the filesystem would be corrupt and
sometimes impossible to repair without reinstallation.

This seems to only happen when the machines do a long run with high
workload and seemingly not when i just power them off again for night with
no high workload.

Regards,
Connor


On Tue, Mar 23, 2021 at 4:46 PM Frank Scheiner 
wrote:

> Hi Jan,
>
> On 23.03.21 16:36, Jan Engelhardt wrote:
> > On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:
> >> ```
> >> [...]
> >> Begin: Retrying nfs mount ... [   41.753937] NFS: mount program didn't
> >> pass remote address
> >> mount: Invalid argument
> >
> > I seem to recall that NFS is one of those filesystems that (a) makes use
> of
> > filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount
> helper,
> > /usr/sbin/mount.nfs.
> >
> > Now, with the change in Linux kernel
> 028abd9222df0cf5855dab5014a5ebaf06f90565,
> > I am postulating the hypothesis that that the fs/nfs/ code for parsing
> this
> > binary blob is no longer aware that it is being invoked in a compat32
> context.
>
> That sounds interesting. Can you perhaps post your hypothesis also in
> this thread:
>
> https://marc.info/?t=16164490063&r=1&w=2
>
> Maybe this gives the kernel developers some ideas.
>
> > Since T2 systems were said to be fine and T1 not, perhaps the T1 systems
> in
> > question were all on NFS mounts and the T2 one wasn't?
>
> No, the T5220 was also running diskless, actually using the same root FS
> as the T1000 (in form of a btrfs subvolume snapshot) plus identical
> kernel and initramfs:
>
> ```
> root@nfs:/srv/tftp# ls -la $( host2hex t5220 )*
> lrwxrwxrwx 1 root root 35 Feb 28  2018 AC10026E ->
> boot/grub/sparc64-ieee1275/core.img
> lrwxrwxrwx 1 root root 38 Mar 15 18:16 AC10026E.initrd.img ->
> initrd.img.5.10.0-4.debian.sid.sparc64
> lrwxrwxrwx 1 root root 36 Mar 15 18:16 AC10026E.vmlinuz ->
> linux.mp.5.10.0-4.debian.sid.sparc64
> ```
>
> Cheers,
> Frank
>
>


Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-23 Thread Frank Scheiner

Hi Jan,

On 23.03.21 16:36, Jan Engelhardt wrote:

On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:

```
[...]
Begin: Retrying nfs mount ... [   41.753937] NFS: mount program didn't
pass remote address
mount: Invalid argument


I seem to recall that NFS is one of those filesystems that (a) makes use of
filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount helper,
/usr/sbin/mount.nfs.

Now, with the change in Linux kernel 028abd9222df0cf5855dab5014a5ebaf06f90565,
I am postulating the hypothesis that that the fs/nfs/ code for parsing this
binary blob is no longer aware that it is being invoked in a compat32 context.


That sounds interesting. Can you perhaps post your hypothesis also in
this thread:

https://marc.info/?t=16164490063&r=1&w=2

Maybe this gives the kernel developers some ideas.


Since T2 systems were said to be fine and T1 not, perhaps the T1 systems in
question were all on NFS mounts and the T2 one wasn't?


No, the T5220 was also running diskless, actually using the same root FS
as the T1000 (in form of a btrfs subvolume snapshot) plus identical
kernel and initramfs:

```
root@nfs:/srv/tftp# ls -la $( host2hex t5220 )*
lrwxrwxrwx 1 root root 35 Feb 28  2018 AC10026E ->
boot/grub/sparc64-ieee1275/core.img
lrwxrwxrwx 1 root root 38 Mar 15 18:16 AC10026E.initrd.img ->
initrd.img.5.10.0-4.debian.sid.sparc64
lrwxrwxrwx 1 root root 36 Mar 15 18:16 AC10026E.vmlinuz ->
linux.mp.5.10.0-4.debian.sid.sparc64
```

Cheers,
Frank



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-23 Thread Jan Engelhardt


On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:
>>
>> while I was able to "install" correctly using a slightly older ISO, I
>> get not a bootable system. The kernel appears to crash very early during
>> boot.
>
> From my current testing it looks like "UltraSPARC IIIi"s are also
> affected by this problem with UltraSPARC T1s in some way:
>
> With the latest Linux 5.10.x (from Debian) the root FS can't be
> successfully mounted, with the latest Linux 5.9.x (also from Debian) it
> just works fine. Unfortunately the V245 doesn't fail/work for the exact
> same kernels that I tested during the bisecting for the T1000, e.g. the
> first bad commit version that didn't work on the T1000 seems to work on
> the V245 but some good versions don't with:
>
> ```
> [...]
> Begin: Retrying nfs mount ... [   41.753937] NFS: mount program didn't
> pass remote address
> mount: Invalid argument

I seem to recall that NFS is one of those filesystems that (a) makes use of
filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount helper,
/usr/sbin/mount.nfs.

Now, with the change in Linux kernel 028abd9222df0cf5855dab5014a5ebaf06f90565,
I am postulating the hypothesis that that the fs/nfs/ code for parsing this
binary blob is no longer aware that it is being invoked in a compat32 context.

Since T2 systems were said to be fine and T1 not, perhaps the T1 systems in
question were all on NFS mounts and the T2 one wasn't?



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-23 Thread Frank Scheiner

Hi all,

On 09.03.21 13:23, Riccardo Mottola wrote:

Hi all,

while I was able to "install" correctly using a slightly older ISO, I
get not a bootable system. The kernel appears to crash very early during
boot.

Anybody else has this issue?

   Booting `Debian GNU/Linux'

Loading Linux 5.10.0-4-sparc64-smp ...
Loading initial ramdisk ...



From my current testing it looks like "UltraSPARC IIIi"s are also
affected by this problem with UltraSPARC T1s in some way:

With the latest Linux 5.10.x (from Debian) the root FS can't be
successfully mounted, with the latest Linux 5.9.x (also from Debian) it
just works fine. Unfortunately the V245 doesn't fail/work for the exact
same kernels that I tested during the bisecting for the T1000, e.g. the
first bad commit version that didn't work on the T1000 seems to work on
the V245 but some good versions don't with:

```
[...]
Begin: Retrying nfs mount ... [   41.753937] NFS: mount program didn't
pass remote address
mount: Invalid argument
done.
[...]
```

I'm unsure what could go wrong here, as I always pass the remote address
via the kernel commandline:

```
[...]
[2.928512] Kernel command line: BOOT_IMAGE=(tftp)/AC10027A.vmlinux
root=/dev/nfs
ip=172.16.2.122:172.16.0.2:172.16.0.1:255.255.0.0:v245-2:enp9s4f0:off
nfsroot=172.16.0.2:/srv/nfs/v245-2/root nfsrootdebug rw
[...]
```

Maybe there is some breakage in the klibc based programs in the
initramfs, but why they don't affect both UltraSPARC IIIi and T1 in the
same way is somewhat strange.

Cheers,
Frank



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-17 Thread Frank Scheiner

Hi Adrian,

On 17.03.21 13:39, John Paul Adrian Glaubitz wrote:

On 3/17/21 1:22 PM, Frank Scheiner wrote:

```
johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad
028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit
[...]

Did you verify that reverting this commit or - if reverting is not possible - 
testing
out the revision just before the commit?


I did not yet revert the bad commit in a current kernel and test it, but
from my understanding the parent commit of the first bad one must have
been a good one and indeed, [67e306c6906137020267eb9bbdbc127034da3627]
is the parent of [028abd9222df0cf5855dab5014a5ebaf06f90565] and was
working for me on my T1000:


```
johndoe@x4270:~/git-projects/torvalds/linux$ git bisect log
[...]
# good: [67e306c6906137020267eb9bbdbc127034da3627] fs,nfs: lift compat
nfs4 mount data handling into the nfs code
git bisect good 67e306c6906137020267eb9bbdbc127034da3627
# bad: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs: remove
compat_sys_mount
git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565
# first bad commit: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs:
remove compat_sys_mount
```


[67e306c6906137020267eb9bbdbc127034da3627]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=67e306c6906137020267eb9bbdbc127034da3627

[028abd9222df0cf5855dab5014a5ebaf06f90565]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=028abd9222df0cf5855dab5014a5ebaf06f90565


Just to be safe you found the correct commit.

If that has been verified, please report the issue to the sparclinux LKML and 
CC Christoph.


Will do that soon-ish but maybe also try to revert that commit in
Debian's 5.10.0-4 and test it for additional assurance (then not so
soon-ish - maybe this weekend). I'll put you and Riccardo in CC, too.

Hopefully this will be easier to fix than the kernel breakage on the
rx2800 i2 - assuming that problem is still there ([1], [2]).

[1]: https://marc.info/?l=linux-ia64&m=156114769908890&w=2
[2]: https://marc.info/?l=linux-ia64&m=156144480821712&w=2

Cheers and thanks for the pointers,
Frank



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-17 Thread John Paul Adrian Glaubitz
Hi Frank!

On 3/17/21 1:22 PM, Frank Scheiner wrote:
> Hi Adrian, Riccardo
> 
> so I'm finished with bisecting and it points to the following commit as
> first bad commit:
> 
> ```
> johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad
> 028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit
> commit 028abd9222df0cf5855dab5014a5ebaf06f90565
> Author: Christoph Hellwig 
> Date:   Thu Sep 17 10:22:34 2020 +0200
> 
> fs: remove compat_sys_mount
> 
> compat_sys_mount is identical to the regular sys_mount now, so
> remove it
> and use the native version everywhere.

Did you verify that reverting this commit or - if reverting is not possible - 
testing
out the revision just before the commit? Just to be safe you found the correct 
commit.

If that has been verified, please report the issue to the sparclinux LKML and 
CC Christoph.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-17 Thread Frank Scheiner

Hi Adrian, Riccardo

so I'm finished with bisecting and it points to the following commit as
first bad commit:

```
johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad
028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit
commit 028abd9222df0cf5855dab5014a5ebaf06f90565
Author: Christoph Hellwig 
Date:   Thu Sep 17 10:22:34 2020 +0200

fs: remove compat_sys_mount

compat_sys_mount is identical to the regular sys_mount now, so
remove it
and use the native version everywhere.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Al Viro 

 arch/arm64/include/asm/unistd32.h  |  2 +-
 arch/mips/kernel/syscalls/syscall_n32.tbl  |  2 +-
 arch/mips/kernel/syscalls/syscall_o32.tbl  |  2 +-
 arch/parisc/kernel/syscalls/syscall.tbl|  2 +-
 arch/powerpc/kernel/syscalls/syscall.tbl   |  2 +-
 arch/s390/kernel/syscalls/syscall.tbl  |  2 +-
 arch/sparc/kernel/syscalls/syscall.tbl |  2 +-
 arch/x86/entry/syscalls/syscall_32.tbl |  2 +-
 fs/Makefile|  1 -
 fs/compat.c| 57
--
 fs/internal.h  |  3 --
 fs/namespace.c |  4 +-
 include/linux/compat.h |  6 ---
 include/uapi/asm-generic/unistd.h  |  2 +-
 tools/include/uapi/asm-generic/unistd.h|  2 +-
 tools/perf/arch/powerpc/entry/syscalls/syscall.tbl |  2 +-
 tools/perf/arch/s390/entry/syscalls/syscall.tbl|  2 +-
 17 files changed, 14 insertions(+), 81 deletions(-)
 delete mode 100644 fs/compat.c
```

Seems to be indeed related to mounting (the root FS). Why it only
affects UltraSPARC T1 CPUs is another question. I don't have any other
UltraSPARC II, IIi, IIe, III and IIIi driven machines at hand now for
checking those.

So what now?

Cheers,
Frank

P.S.

Here's the log for reference:

```
johndoe@x4270:~/git-projects/torvalds/linux$ git bisect log
git bisect start
# good: [bbf5c979011a099af5dc76498918ed7df445635b] Linux 5.9
git bisect good bbf5c979011a099af5dc76498918ed7df445635b
# bad: [3650b228f83adda7e5ee532e2b90429c03f7b9ec] Linux 5.10-rc1
git bisect bad 3650b228f83adda7e5ee532e2b90429c03f7b9ec
# bad: [c48b75b7271db23c1b2d1204d6e8496d91f27711] Merge tag
'sound-5.10-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect bad c48b75b7271db23c1b2d1204d6e8496d91f27711
# bad: [7fafb54c7d390e9b273a1d7d377e38d9c408046e] Merge tag
'leds-5.10-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds
git bisect bad 7fafb54c7d390e9b273a1d7d377e38d9c408046e
# bad: [fd5c32d80884268a381ed0e67cccef0b3d37750b] Merge tag
'media/v5.10-1' of
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect bad fd5c32d80884268a381ed0e67cccef0b3d37750b
# bad: [865c50e1d279671728c2936cb7680eb89355eeea] x86/uaccess: utilize
CONFIG_CC_HAS_ASM_GOTO_OUTPUT
git bisect bad 865c50e1d279671728c2936cb7680eb89355eeea
# good: [13cb73490f475f8e7669f9288be0bcfa85399b1f] Merge tag
'x86-entry-2020-10-12' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 13cb73490f475f8e7669f9288be0bcfa85399b1f
# good: [dd502a81077a5f3b3e19fa9a1accffdcab5ad5bc] Merge tag
'core-static_call-2020-10-12' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good dd502a81077a5f3b3e19fa9a1accffdcab5ad5bc
# good: [ced3a9eb3cd0d07462cdbaa8a0f3d46e5aaeadec] Merge tag
'ia64_for_5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
git bisect good ced3a9eb3cd0d07462cdbaa8a0f3d46e5aaeadec
# good: [fc67d5bc876b6b224538c8848fc02e70f269ec99]
Documentation/admin-guide: README & svga: remove use of "rdev"
git bisect good fc67d5bc876b6b224538c8848fc02e70f269ec99
# good: [c90578360c92c71189308ebc71087197080e94c3] Merge branch
'work.csum_and_copy' of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
git bisect good c90578360c92c71189308ebc71087197080e94c3
# good: [85ed13e78dbedf9433115a62c85429922bc5035c] Merge branch
'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
git bisect good 85ed13e78dbedf9433115a62c85429922bc5035c
# bad: [22230cd2c55bd27ee2c3a3def97c0d5577a75b82] Merge branch
'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
git bisect bad 22230cd2c55bd27ee2c3a3def97c0d5577a75b82
# good: [e18afa5bfa4a2f0e07b0864370485df701dacbc1] Merge branch
'work.quota-compat' of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
git bisect good e18afa5bfa4a2f0e07b0864370485df701dacbc1
# good: [67e306c6906137020267eb9bbdbc127034da3627] fs,nfs: lift compat
nfs4 mount data handling into the nfs code
git bisect good 67e306c6906137020267eb9bbdbc127034da3627
# bad: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs: remove
compat_sys_mount
git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565
# first bad commit: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs:
remov

Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-16 Thread Frank Scheiner

Hi Adrian,

On 16.03.21 14:27, John Paul Adrian Glaubitz wrote:

Hello Frank!

On 3/16/21 2:07 PM, Frank Scheiner wrote:

After a first cross compile run, I can confirm that 5.10-rc1 is also
broken on my T1000. I'll take this version (parent commit:
33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
good means more than 5000 commits in between. Linus's tree doesn't
contain v5.9.16 or at least I didn't find it there. How can I get "good"
closer to "bad"? I don't want to check too many good versions if I know
that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is
good? Should I switch to the stable kernel sources from GKH?


I'm not sure I am understand your problem here. The bisecting algorithm
has a runtime O(ln(n)), so even with 5000 commits, it will converge quite
quickly.


Yeah, you're right, I think I make this error every time I try to bisect
the kernel - i.e. once every two years... ;-)


Just make sure you are using a fast machine when compiling the kernel
as otherwise it won't be fun.


Other topic: As the compile times are actually taking less time than the
preparation of the test boot (copy over modules to T1000 root FS, boot
T1000 with working kernel, create initramfs, reboot with kernel in
question and that initramfs), is there a way to create the initramfs
(for sparc64) on the cross compile host (amd64)?

Cheers,
Frank



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-16 Thread John Paul Adrian Glaubitz
Hello Frank!

On 3/16/21 2:07 PM, Frank Scheiner wrote:
> After a first cross compile run, I can confirm that 5.10-rc1 is also
> broken on my T1000. I'll take this version (parent commit:
> 33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
> good means more than 5000 commits in between. Linus's tree doesn't
> contain v5.9.16 or at least I didn't find it there. How can I get "good"
> closer to "bad"? I don't want to check too many good versions if I know
> that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is
> good? Should I switch to the stable kernel sources from GKH?

I'm not sure I am understand your problem here. The bisecting algorithm
has a runtime O(ln(n)), so even with 5000 commits, it will converge quite
quickly.

Just make sure you are using a fast machine when compiling the kernel
as otherwise it won't be fun.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-16 Thread Frank Scheiner

Hi again,

On 16.03.21 14:07, Frank Scheiner wrote:

@Adrian:
After a first cross compile run, I can confirm that 5.10-rc1 is also
broken on my T1000. I'll take this version (parent commit:
33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
good means more than 5000 commits in between. Linus's tree doesn't
contain v5.9.16 or at least I didn't find it there. How can I get "good"
closer to "bad"? I don't want to check too many good versions if I know
that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is
good? Should I switch to the stable kernel sources from GKH?


Forget about that, [1] shows 5000+ commits between v5.9.16 and
v5.10-rc1, too. So no difference.

[1]: https://github.com/gregkh/linux/compare/v5.9.16...v5.10-rc1

Cheers,
Frank



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-16 Thread Frank Scheiner

Hi Riccardo, Adrian,

so I did some testing yesterday and also see your problem on my T1000.
Because of some kernel command line misconfiguration, my machine at
first couldn't find its root FS as it tried to use a non-existent NIC.
This lead to a lot of kernel oopses (I assume at least one per hardware
thread) that looked very similar to the ones you see. And this happens
even with "working" kernels (tested 4.19.x and 5.9.x). So the actual
result of that problem in 5.10.x seems to be that the kernel can't find
its root FS.

On 11.03.21 23:43, Frank Scheiner wrote:

On 11.03.21 23:03, Riccardo Mottola wrote:

I suppose the Niagara CPU gives the kernel issue


 From [1] I assume T2 CPUs are not affected, but yeah, the issue could
be that selective that it only affects the very first generation.

[1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html


I can also indeed confirm that this problem only affects the T1 CPU, as
my T5220 with T2 CPU works w/o problems with kernel 5.10.x.

I didn't get any further yesterday as it took a lot of time to update
the root FSes of my T1000 and my X4270 - my intended machine for cross
compilation, not sure if it will be "fast" enough*. In addition cloning
Linus's linux tree alone took a lot of time (about an hour).

* it will:

```
## with config of Debian's 5.9.0-5 kernel as `.config`
$ make ARCH=sparc64 CROSS_COMPILE=sparc64-linux-gnu- olddefconfig
[...]
## with lsmod output from T1000
$ make ARCH=sparc64 CROSS_COMPILE=sparc64-linux-gnu-
LSMOD=$HOME/t1000-lsmod localmodconfig
[...]
$ time make -j16 ARCH=sparc64 CROSS_COMPILE=sparc64-linux-gnu- all
[...]
  kernel: arch/sparc/boot/zImage is ready

real3m12.264s
user42m5.325s
sys 3m27.843s
```

@Adrian:
After a first cross compile run, I can confirm that 5.10-rc1 is also
broken on my T1000. I'll take this version (parent commit:
33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
good means more than 5000 commits in between. Linus's tree doesn't
contain v5.9.16 or at least I didn't find it there. How can I get "good"
closer to "bad"? I don't want to check too many good versions if I know
that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is
good? Should I switch to the stable kernel sources from GKH?

Cheers,
Frank




Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-12 Thread Jan Engelhardt


On Thursday 2021-03-11 23:43, Frank Scheiner wrote:
>>
>> Do you know if I can via serial-console reset the system?
>
> Reset from the serial console might work via the kernel with the [magic
> system request] functionality.
>
> [magic system request]:
> https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html
>
> But you can always reset the system using the SC. The T1000 (and the
> T2000, too) has both serial (on T2000 right of the DB-9 ttya port,
> should work with a blue Cisco serial cable) and network port (on T2000
> above the two USB ports). The serial port of the SC automatically
> switches to the system console after some (configurable) time

SER MGT is a RS232-ish serial line, just with a RJ-45 connector for size.
Once the SC has finished booting, system console is the default mode.
Since SER has no notion of connections, it should be staying in whatever mode
it was left in. Maybe there is a autoswitch, but I never observed it (but I
would not want to wait a lot of minutes either just to observe it).

For NET MGT, when you start a new SSH connection, it always starts
out in system console mode and #. is needed.

>> I tried sending a break on the serial console, but the errors just keep
>> running.
>> Break is received, since I see it as SC Alert, but I am not put into the
>> console, maybe there is some further trick on these newer machine?
>
> So you already got access to the SC. Then you can reset the machine from
> there, too.

Because NET does not have an equivalent of the serial pin used to traditionally
signal "break", a synthetic break can be issued from SC. But it's a bit
awkward, because you immediately need to go back into system console mode to
type the desired sysrq character.

sc> break
confirm (y/n)y
sc> console
confirm (y/n)y
type <>
Linux kernel: ah yes I received SYSRQ-s

>> I am
>> used to old SparcStations and UltraSparc Netras, where it was sufficient.
>> It is inconvenient at every hang to power-cycle, since at every turn on,
>> it runs a self-test which lasts minutes :)
>
> I think depending on the SC configuration, these machines also run a
> self-test for every X resets, but this should be configurable.

It's the first thing you want to turn off as a private user.

diag_trigger none

and probably

diag_mode off



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-11 Thread Gregor Riepl
> How should I proceed? Which kernel sources?
> 
> https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official
> 
> 
> is 4.3 correct for me? 4.6 ?

You should clone the upstream Git repo, otherwise bisecting will be much
more difficult.

I think these instructions are still valid:
https://wiki.debian.org/DebianKernel/GitBisect

You can also skip the Debian-specific stuff and simply do
make -j8 && make modules_install && make install

It's better to use at least a compatible kernel config, though.



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-11 Thread Frank Scheiner

Hi Riccardo,

On 11.03.21 23:03, Riccardo Mottola wrote:

Hi Frank!

I suppose the Niagara CPU gives the kernel issue


From [1] I assume T2 CPUs are not affected, but yeah, the issue could
be that selective that it only affects the very first generation.

[1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html



Frank Scheiner wrote:

If I remember there was a repository with many snapshots of different
versions, already as package, which one can test quickly. That way we
can restrict breakage range without git bisect.

Do you have a link?


I assume you mean "http://snapshot.debian.org"; .


Exactly. With this I did some more tests.

Still Works:
5.9.0-4-sparc64-smp #1 SMP Debian 5.9.11-1 (2020-11-27)
5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)

Broken:

linux-image-5.10.0-trunk-sparc64-smp_5.10.2-1~exp1_sparc64.deb

So later series 5.9 series continue to work and even very early 5.10 do not

Do you know if I can via serial-console reset the system?


Reset from the serial console might work via the kernel with the [magic
system request] functionality.

[magic system request]:
https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html

But you can always reset the system using the SC. The T1000 (and the
T2000, too) has both serial (on T2000 right of the DB-9 ttya port,
should work with a blue Cisco serial cable) and network port (on T2000
above the two USB ports). The serial port of the SC automatically
switches to the system console after some (configurable) time and you
need to escape to the SC login prompt with a configurable key sequence
(`#.` by default, see [2]).

[2]:
https://docs.oracle.com/cd/E19076-01/t2k.srvr/819-2549-12/ontario-consoleConfig.html#28277


I tried sending a break on the serial console, but the errors just keep
running.
Break is received, since I see it as SC Alert, but I am not put into the
console, maybe there is some further trick on these newer machine?


So you already got access to the SC. Then you can reset the machine from
there, too.


I am
used to old SparcStations and UltraSparc Netras, where it was sufficient.
It is inconvenient at every hang to power-cycle, since at every turn on,
it runs a self-test which lasts minutes :)


I think depending on the SC configuration, these machines also run a
self-test for every X resets, but this should be configurable.

Hope that helps
Cheers,
Frank



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-11 Thread Gregor Riepl


> Do you know if I can via serial-console reset the system?
> I tried sending a break on the serial console, but the errors just keep
> running.
> Break is received, since I see it as SC Alert, but I am not put into the
> console, maybe there is some further trick on these newer machine? I am
> used to old SparcStations and UltraSparc Netras, where it was sufficient.
> It is inconvenient at every hang to power-cycle, since at every turn on,
> it runs a self-test which lasts minutes :)

According to this, you should be able to reach the system console
through the SER MGT port:
https://unixed.com/index.php/2013/06/16/accessing-the-sparc-system-console/
NET MGT is probably easier, but you'll have to set it up first.

Perhaps you can also attach a USB keyboard and press the break key to
get into the system console, then type "reset" to boot the machine? Not
sure if this works without a monitor though. And you might need to enter
the system password first, if it's set.



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-11 Thread Riccardo Mottola

Hi Adrian

John Paul Adrian Glaubitz wrote:

Well, that doesn't really help you though. You want to find the commit in 
question,
just the range isn't enough to solve the issue.


Well, a little bit it helped, it is something early in the 5.10 series.
Also I have now an apparently working kernel (who knows how stable under 
load?) 5.9 series



If you have a fast second machine available, bisecting the problem shouldn't 
take
too long.


Well, this Machine has plenty of ram, disk space and good connection, 
how fast the CPU is in compiling a kernel I don't know, but we can try.
Power consumption is not so much worse than a PC, but it is darn loud! 
Like a vacuum cleaner... I need to stay out of the room, but I found an 
acceptable setup. I use a workstation with a serial console connected to 
it, the connect through ssh to the workstation and through that into the 
management.


Although I am used to compile kernels on Gentoo LInux since 15 years, I 
never did on Debian. Here we have init images



How should I proceed? Which kernel sources?

https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official

is 4.3 correct for me? 4.6 ?

Please guide me

Riccardo



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-11 Thread Riccardo Mottola

Hi Frank!

I suppose the Niagara CPU gives the kernel issue

Frank Scheiner wrote:

If I remember there was a repository with many snapshots of different
versions, already as package, which one can test quickly. That way we
can restrict breakage range without git bisect.

Do you have a link?


I assume you mean "http://snapshot.debian.org"; .


Exactly. With this I did some more tests.

Still Works:
5.9.0-4-sparc64-smp #1 SMP Debian 5.9.11-1 (2020-11-27)
5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)

Broken:

linux-image-5.10.0-trunk-sparc64-smp_5.10.2-1~exp1_sparc64.deb

So later series 5.9 series continue to work and even very early 5.10 do not

Do you know if I can via serial-console reset the system?
I tried sending a break on the serial console, but the errors just keep 
running.
Break is received, since I see it as SC Alert, but I am not put into the 
console, maybe there is some further trick on these newer machine? I am 
used to old SparcStations and UltraSparc Netras, where it was sufficient.
It is inconvenient at every hang to power-cycle, since at every turn on, 
it runs a self-test which lasts minutes :)


Riccardo



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-10 Thread John Paul Adrian Glaubitz
On 3/10/21 10:17 AM, Riccardo Mottola wrote:
> If I remember there was a repository with many snapshots of different 
> versions,
> already as package, which one can test quickly. That way we can restrict 
> breakage
> range without git bisect.

Well, that doesn't really help you though. You want to find the commit in 
question,
just the range isn't enough to solve the issue.

If you have a fast second machine available, bisecting the problem shouldn't 
take
too long.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-10 Thread Frank Scheiner

Hi Riccardo,

On 10.03.21 10:17, Riccardo Mottola wrote:

Frank Scheiner wrote:

We have an older UltraSPARC IIIi that has issues with newer kernels, but
usually only after longer operation and the issue might be related to
the
bug that was just fixed recently by Rob Gardner.


Which kernel version will have this bug (which one?) fixed, 5.11.x? I
can also check with one of my UltraSPARC IIIi powered systems, too, next
week.


as written in the title, I have issues with:
5.10.0-4-sparc64-smp #1 Debian 5.10.19-1


I know.


If I remember there was a repository with many snapshots of different
versions, already as package, which one can test quickly. That way we
can restrict breakage range without git bisect.

Do you have a link?


I assume you mean "http://snapshot.debian.org"; .

Cheers,
Frank



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-10 Thread Riccardo Mottola

Hi Frank,


Frank Scheiner wrote:

We have an older UltraSPARC IIIi that has issues with newer kernels, but
usually only after longer operation and the issue might be related to the
bug that was just fixed recently by Rob Gardner.


Which kernel version will have this bug (which one?) fixed, 5.11.x? I
can also check with one of my UltraSPARC IIIi powered systems, too, next
week.


as written in the title, I have issues with:
5.10.0-4-sparc64-smp #1 Debian 5.10.19-1

If I remember there was a repository with many snapshots of different 
versions, already as package, which one can test quickly. That way we 
can restrict breakage range without git bisect.


Do you have a link?

Riccardo



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-09 Thread John Paul Adrian Glaubitz
On 3/9/21 11:20 PM, John Paul Adrian Glaubitz wrote:
>> Which kernel version will have this bug (which one?) fixed, 5.11.x? I
>> can also check with one of my UltraSPARC IIIi powered systems, too, next
>> week.
> 
> I have not uploaded that kernel yet, I have it built locally, PR here [1].

The patch is now in Linus' tree so it will be part of 5.12 [1].

Adrian

> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e5e8b80d352ec999d2bba3ea584f541c83f4ca3f

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-09 Thread John Paul Adrian Glaubitz
On 3/9/21 10:18 PM, Frank Scheiner wrote:
>> The oldest buildd we are running is a T5120 and that's a T2.
> 
> And these don't show the problems Riccardo's T1 powered T2000 has?

No, the machine runs stable.

>> We have an older UltraSPARC IIIi that has issues with newer kernels, but
>> usually only after longer operation and the issue might be related to the
>> bug that was just fixed recently by Rob Gardner.
> 
> Which kernel version will have this bug (which one?) fixed, 5.11.x? I
> can also check with one of my UltraSPARC IIIi powered systems, too, next
> week.

I have not uploaded that kernel yet, I have it built locally, PR here [1].

Adrian

> [1] https://salsa.debian.org/kernel-team/linux/-/merge_requests/339

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-09 Thread Frank Scheiner

On 09.03.21 22:09, John Paul Adrian Glaubitz wrote:

On 3/9/21 9:38 PM, Frank Scheiner wrote:

I have a T1000 with which I could try to reproduce Riccardo's issues.
Hardware wise they should be pretty similar. As the T1000 doesn't have a
CDROM, I'll try to netboot a few newer kernels and report my findings.
Will take me until next week though, as the machine is in (cold) storage
now.

@Adrian:
Aren't there some build servers using UltraSPARC T2 or T2+? Do they run
with the latest kernels?


The oldest buildd we are running is a T5120 and that's a T2.


And these don't show the problems Riccardo's T1 powered T2000 has?


We have an older UltraSPARC IIIi that has issues with newer kernels, but
usually only after longer operation and the issue might be related to the
bug that was just fixed recently by Rob Gardner.


Which kernel version will have this bug (which one?) fixed, 5.11.x? I
can also check with one of my UltraSPARC IIIi powered systems, too, next
week.

Cheers,
Frank



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-09 Thread John Paul Adrian Glaubitz
On 3/9/21 9:38 PM, Frank Scheiner wrote:
> I have a T1000 with which I could try to reproduce Riccardo's issues.
> Hardware wise they should be pretty similar. As the T1000 doesn't have a
> CDROM, I'll try to netboot a few newer kernels and report my findings.
> Will take me until next week though, as the machine is in (cold) storage
> now.
> 
> @Adrian:
> Aren't there some build servers using UltraSPARC T2 or T2+? Do they run
> with the latest kernels?

The oldest buildd we are running is a T5120 and that's a T2.

We have an older UltraSPARC IIIi that has issues with newer kernels, but
usually only after longer operation and the issue might be related to the
bug that was just fixed recently by Rob Gardner.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-09 Thread Frank Scheiner

Hi guys,

On 09.03.21 18:31, John Paul Adrian Glaubitz wrote:

Hi!

On 3/9/21 6:26 PM, Riccardo Mottola wrote:

John Paul Adrian Glaubitz wrote:

while I was able to "install" correctly using a slightly older ISO, I get not a 
bootable
system. The kernel appears to crash very early during boot.

I think this is more likely a hardware issue. We haven't seen any machines 
crashing that
early. Please make sure the RAM modules in this machine are working properly.


I don't think so... I think it is a Kernel issue, since with kernel
5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux

the machine is performing fine with network, disk and compiler usage on all 32 
CPUs.


Then you need to bisect the kernel as I don't have any means to reproduce the 
issue.


I have a T1000 with which I could try to reproduce Riccardo's issues.
Hardware wise they should be pretty similar. As the T1000 doesn't have a
CDROM, I'll try to netboot a few newer kernels and report my findings.
Will take me until next week though, as the machine is in (cold) storage
now.

@Adrian:
Aren't there some build servers using UltraSPARC T2 or T2+? Do they run
with the latest kernels?

Cheers,
Frank



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-09 Thread John Paul Adrian Glaubitz
Hi!

On 3/9/21 6:26 PM, Riccardo Mottola wrote:
> John Paul Adrian Glaubitz wrote:
>>> while I was able to "install" correctly using a slightly older ISO, I get 
>>> not a bootable
>>> system. The kernel appears to crash very early during boot.
>> I think this is more likely a hardware issue. We haven't seen any machines 
>> crashing that
>> early. Please make sure the RAM modules in this machine are working properly.
> 
> I don't think so... I think it is a Kernel issue, since with kernel
> 5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux
> 
> the machine is performing fine with network, disk and compiler usage on all 
> 32 CPUs.

Then you need to bisect the kernel as I don't have any means to reproduce the 
issue.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-09 Thread Riccardo Mottola

Hi,

John Paul Adrian Glaubitz wrote:

while I was able to "install" correctly using a slightly older ISO, I get not a 
bootable
system. The kernel appears to crash very early during boot.

I think this is more likely a hardware issue. We haven't seen any machines 
crashing that
early. Please make sure the RAM modules in this machine are working properly.


I don't think so... I think it is a Kernel issue, since with kernel
5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux

the machine is performing fine with network, disk and compiler usage on 
all 32 CPUs. I tried heavy load of parallel compilations, using git on 
large repositories as well as using remote X applications at the same 
time, a combination I know tends to show issues on systems, without 
problems! Not a simgle error in syslog.

Machine powerup-and self-tests are fine too.

If I remember, there is a repository of various pre-compiled kernel 
versions: maybe there are some releases between the two kernels I can 
try and do some easy rough bisecting.


so I'd say RAM, CPUs, Disk and Ethernet are working quite fine

Riccardo



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2021-03-09 Thread John Paul Adrian Glaubitz
Hello Riccardo!

On 3/9/21 1:23 PM, Riccardo Mottola wrote:
> while I was able to "install" correctly using a slightly older ISO, I get not 
> a bootable
> system. The kernel appears to crash very early during boot.

I think this is more likely a hardware issue. We haven't seen any machines 
crashing that
early. Please make sure the RAM modules in this machine are working properly.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913