Re: nasty bug in /usr/sbin/grub-probe

2022-04-04 Thread Mike Tremaine
I have working Ultra 5 if it is of any help…..

root@xray:/boot/grub# uname -a
Linux xray 5.10.0-8-sparc64 #1 Debian 5.10.46-4 (2021-08-03) sparc64 GNU/Linux

> On Apr 3, 2022, at 8:28 PM, Stan Johnson  wrote:
> 
> On 4/3/22 11:04 AM, John Paul Adrian Glaubitz wrote:
>> Hi Stan!
>> 
>> On 4/3/22 16:39, Stan Johnson wrote:
>>> If this problem is expected to occur on an Ultra 5 or an Ultra 30,
>>> please let me know and I'll be happy to help with a git bisect, using a
>>> spare 9 GB disk for the installation.
>> 
>> I think you should see the issue on both the Ultra 5 and Ultra 30.
>> ...
> 
> I wasn't able to get my Ultra 5 working; the video signal kept cycling
> on and off for some reason, and the CD drive wasn't seen, though it was
> seen well enough to boot the installation and get to the point where it
> said no CD drive was found.
> 
> But I was able to confirm that the "grub-probe" bug doesn't seem to
> affect the Ultra 30.


root@xray:/boot/grub# /sbin/grub-probe --target=drive --device /dev/sda1
(hostdisk//dev/sda,sun1)

> 
> 
> -
> 
> There were a few oddities, but only #6 is serious (apparently a libc
> bug, not a kernel bug).
> 
> 1) I see that /dev/sda1 is mounted as /boot, not /boot/grub. So all the
> kernels will end up in /dev/sda1. I haven't tested how (or whether) that
> will affect kernels for other operating systems (e.g. Gentoo).
> 
> 2) Please confirm that grub-install never needs to be run. It appears
> not to be needed, since update-grub updates /boot/grub/grub.cfg directly.
> 
> 3) At system boot, when GRUB runs, it complains that it is out of
> memory, but it seems to work anyway.
> 
> 4) During installation, the disk partitioner said "The disk has 562253
> cylinders which is greater than the maximum of 65536.", but that error
> didn't seem to affect anything.


All 4 of the above have also been witnessed on my machine.

> 
> 6) In Xfce, a login at the console worked once, but it is now failing
> consistently (even after a reboot), with this error message in dmesg:
> 
> xfce4-session[3980]: segfault at 0 ip f8010263c9b4 (rpc
> f801020efbb8) sp 07feff8dc451 error 1 in
> libc-2.33.so[f801025b+164000]

I’m using lightdm and it seems to be fine. But support for the graphics card is 
lacking as explained on other emails to this list.

mgt@xray:~$ inxi -G
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] RV100 [Radeon 7000 / 
Radeon VE] driver: radeonfb v: N/A 
   Display: server: X.org 1.20.11 driver: loaded: ati,fbdev unloaded: 
modesetting,radeon tty: 138x43 
   Message: Advanced graphics data unavailable in console. Try -G 
--display 








Re: nasty bug in /usr/sbin/grub-probe

2022-04-03 Thread John Paul Adrian Glaubitz
On 4/3/22 23:54, Dennis Clarke wrote:
> No, I am not. I am going with whatever is in the Makefile.
> 
> https://github.com/torvalds/linux/commit/fc7c028dcdbfe981bca75d2a7b95f363eb691ef3
> 
> So this was seen before regardless.

Please just follow my advise and use Linus' tree and don't use any LTS kernels
which may have some changes from the latest kernels backported.

I know that cross-compiling and bisecting works 100%, so we really don't need to
argue about this. You didn't find some hidden bug that the daily CI hasn't 
caught.

The kernel developers are regularly rebuilding the kernel for most targets with
all the various standard kernel configuration presets, so it's extremely 
unlikely
that you run a kernel that won't compile due to a bug.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: nasty bug in /usr/sbin/grub-probe

2022-04-03 Thread Dennis Clarke

On 4/3/22 12:13, John Paul Adrian Glaubitz wrote:

On 4/3/22 17:19, Dennis Clarke wrote:

I am curious if you can get the linux-4.19.114 kernel to compile.  For me it 
just blows up with :

.
.
.
arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name':
arch/sparc/kernel/mdesc.c:648:22: error: 'strcmp' reading 1 or more bytes from 
a region of size 0 [-Werror=stringop-overread]
   648 | if (!strcmp(names + ep[ret].name_offset, name))
   |  ^
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 'mdesc' 
of size 16
78 | struct mdesc_hdrmdesc;
   | ^
arch/sparc/kernel/mdesc.c: In function 'mdesc_get_property':
arch/sparc/kernel/mdesc.c:693:22: error: 'strcmp' reading 1 or more bytes from 
a region of size 0 [-Werror=stringop-overread]
   693 | if (!strcmp(names + ep->name_offset, name)) {
   |  ^
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 'mdesc' 
of size 16
78 | struct mdesc_hdrmdesc;
   | ^
arch/sparc/kernel/mdesc.c: In function 'mdesc_next_arc':
arch/sparc/kernel/mdesc.c:720:21: error: 'strcmp' reading 1 or more bytes from 
a region of size 0 [-Werror=stringop-overread]
   720 | if (strcmp(names + ep->name_offset, arc_type))
   | ^
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 'mdesc' 
of size 16
78 | struct mdesc_hdrmdesc;
   | ^
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:304: arch/sparc/kernel/mdesc.o] Error 1
make[1]: *** [scripts/Makefile.build:544: arch/sparc/kernel] Error 2
make: *** [Makefile:1053: arch/sparc] Error 2


Not sure what to make of that.


Well, it's up right there, you are building with -Werror enabled. You have to 
disable that.



No, I am not. I am going with whatever is in the Makefile.

https://github.com/torvalds/linux/commit/fc7c028dcdbfe981bca75d2a7b95f363eb691ef3

So this was seen before regardless.


--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
GreyBeard and suspenders optional



Re: nasty bug in /usr/sbin/grub-probe

2022-04-03 Thread John Paul Adrian Glaubitz
Hi Stan!

On 4/3/22 16:39, Stan Johnson wrote:
> If this problem is expected to occur on an Ultra 5 or an Ultra 30,
> please let me know and I'll be happy to help with a git bisect, using a
> spare 9 GB disk for the installation.

I think you should see the issue on both the Ultra 5 and Ultra 30.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: nasty bug in /usr/sbin/grub-probe

2022-04-03 Thread John Paul Adrian Glaubitz
On 4/3/22 14:23, Dennis Clarke wrote:
> Are you sure of 4.19 ?  I see that 4.19.237 exists but I will guess the
> same bug exists there also. I was going to begin with 4.19.114 which was
> released 02-Apr-2020. A solid two years ago seems like as good a place
> to start as any. However building the kernel will require that I create
> an initrd and also update grub etc etc. I can do that manually and then
> bypass the "update-grub" process entirely.

Of course, there is a kernel 4.19 tag:

> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/refs/tags

4.19.114 an LTS release of the 4.19.x series, you don't need it. Just use Linus'
tree and start bisecting between 4.19 and HEAD.

# git bisect start
# git bisect good 4.19
# git bisect bad HEAD

then compile and test. Mark good commits with "git bisect good", bad ones with
"git bisect bad".

>> Not really. You cross-build the kernel, transfer it to the machine and see if
>> update-grub works.
> 
> Hold on.  This sounds like a chicken and egg scenario. The update-grub
> will fail every time. I will need to do the process by hand with an edit
> to grub.cfg and with the files needed dropped into /boot with the few
> kernel modules needed in /lib/modules/foo. That should be enough to at
> least boot.

Well, that depends where update-grub fails. If it leaves a broken GRUB 
configuration
behind, it will be a bit tricky. But if it already fails before writing 
anything to
disk, you should be safe.

> I have already started the process but I am starting with 4.19.114.

That makes no sense. You're not on Linus' tree but on the stable tree.

Stable: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git

Linus' tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/

It makes no sense to test the stable tree since the different stable versions
lie on different branches, so you will never see the regression in the first
place.

You must Linus' tree since that's the only tree with a linear history.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: nasty bug in /usr/sbin/grub-probe

2022-04-03 Thread John Paul Adrian Glaubitz
On 4/3/22 17:19, Dennis Clarke wrote:
> I am curious if you can get the linux-4.19.114 kernel to compile.  For me it 
> just blows up with :
> 
> .
> .
> .
> arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name':
> arch/sparc/kernel/mdesc.c:648:22: error: 'strcmp' reading 1 or more bytes 
> from a region of size 0 [-Werror=stringop-overread]
>   648 | if (!strcmp(names + ep[ret].name_offset, name))
>   |  ^
> arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 
> 'mdesc' of size 16
>78 | struct mdesc_hdrmdesc;
>   | ^
> arch/sparc/kernel/mdesc.c: In function 'mdesc_get_property':
> arch/sparc/kernel/mdesc.c:693:22: error: 'strcmp' reading 1 or more bytes 
> from a region of size 0 [-Werror=stringop-overread]
>   693 | if (!strcmp(names + ep->name_offset, name)) {
>   |  ^
> arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 
> 'mdesc' of size 16
>78 | struct mdesc_hdrmdesc;
>   | ^
> arch/sparc/kernel/mdesc.c: In function 'mdesc_next_arc':
> arch/sparc/kernel/mdesc.c:720:21: error: 'strcmp' reading 1 or more bytes 
> from a region of size 0 [-Werror=stringop-overread]
>   720 | if (strcmp(names + ep->name_offset, arc_type))
>   | ^
> arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 
> 'mdesc' of size 16
>78 | struct mdesc_hdrmdesc;
>   | ^
> cc1: all warnings being treated as errors
> make[2]: *** [scripts/Makefile.build:304: arch/sparc/kernel/mdesc.o] Error 1
> make[1]: *** [scripts/Makefile.build:544: arch/sparc/kernel] Error 2
> make: *** [Makefile:1053: arch/sparc] Error 2
> 
> 
> Not sure what to make of that.

Well, it's up right there, you are building with -Werror enabled. You have to 
disable that.

> My intuition here tells me the bug is likely in arch/sparc/kernel/syscalls.S
> which changed slightly since the 4.19.114 days. Looking
> previous I see no change in that source file. Regardless, this is just a
> hunch without a shred of proof. Yet.

There is no bug. Just your compiler set to treat warning as errors as can be 
seen
from the error message above. You have to disable CONFIG_WERROR.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: nasty bug in /usr/sbin/grub-probe

2022-04-03 Thread Dennis Clarke



I am curious if you can get the linux-4.19.114 kernel to compile.  For 
me it just blows up with :


.
.
.
arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name':
arch/sparc/kernel/mdesc.c:648:22: error: 'strcmp' reading 1 or more 
bytes from a region of size 0 [-Werror=stringop-overread]

   648 | if (!strcmp(names + ep[ret].name_offset, name))
   |  ^
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 
'mdesc' of size 16

    78 | struct mdesc_hdr    mdesc;
   | ^
arch/sparc/kernel/mdesc.c: In function 'mdesc_get_property':
arch/sparc/kernel/mdesc.c:693:22: error: 'strcmp' reading 1 or more 
bytes from a region of size 0 [-Werror=stringop-overread]

   693 | if (!strcmp(names + ep->name_offset, name)) {
   |  ^
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 
'mdesc' of size 16

    78 | struct mdesc_hdr    mdesc;
   | ^
arch/sparc/kernel/mdesc.c: In function 'mdesc_next_arc':
arch/sparc/kernel/mdesc.c:720:21: error: 'strcmp' reading 1 or more 
bytes from a region of size 0 [-Werror=stringop-overread]

   720 | if (strcmp(names + ep->name_offset, arc_type))
   | ^
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 
'mdesc' of size 16

    78 | struct mdesc_hdr    mdesc;
   | ^
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:304: arch/sparc/kernel/mdesc.o] 
Error 1

make[1]: *** [scripts/Makefile.build:544: arch/sparc/kernel] Error 2
make: *** [Makefile:1053: arch/sparc] Error 2


Not sure what to make of that.



Drat :

https://github.com/torvalds/linux/commit/fc7c028dcdbfe981bca75d2a7b95f363eb691ef3


So something after 4.19.114 may work.



--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
GreyBeard and suspenders optional



Re: nasty bug in /usr/sbin/grub-probe

2022-04-03 Thread Dennis Clarke

On 4/3/22 10:39, Stan Johnson wrote:

Hello Adrian and Dennis,

If this problem is expected to occur on an Ultra 5 or an Ultra 30,
please let me know and I'll be happy to help with a git bisect, using a
spare 9 GB disk for the installation.



The Ultra 5 is even older. At least I think so. There were two flavours
of those bizarre PC style machines where one was a tower looking thing
called the Ultra 10 and it could have a Creator3D OpenGL hardware frame
buffer whereas the Ultra 5 was a modified weird pizza box thing. Both
are from well before the UltraSparc III existed.

Please feel free to jump in.

I am curious if you can get the linux-4.19.114 kernel to compile.  For 
me it just blows up with :


.
.
.
arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name':
arch/sparc/kernel/mdesc.c:648:22: error: 'strcmp' reading 1 or more 
bytes from a region of size 0 [-Werror=stringop-overread]

  648 | if (!strcmp(names + ep[ret].name_offset, name))
  |  ^
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 
'mdesc' of size 16

   78 | struct mdesc_hdrmdesc;
  | ^
arch/sparc/kernel/mdesc.c: In function 'mdesc_get_property':
arch/sparc/kernel/mdesc.c:693:22: error: 'strcmp' reading 1 or more 
bytes from a region of size 0 [-Werror=stringop-overread]

  693 | if (!strcmp(names + ep->name_offset, name)) {
  |  ^
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 
'mdesc' of size 16

   78 | struct mdesc_hdrmdesc;
  | ^
arch/sparc/kernel/mdesc.c: In function 'mdesc_next_arc':
arch/sparc/kernel/mdesc.c:720:21: error: 'strcmp' reading 1 or more 
bytes from a region of size 0 [-Werror=stringop-overread]

  720 | if (strcmp(names + ep->name_offset, arc_type))
  | ^
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 
'mdesc' of size 16

   78 | struct mdesc_hdrmdesc;
  | ^
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:304: arch/sparc/kernel/mdesc.o] Error 1
make[1]: *** [scripts/Makefile.build:544: arch/sparc/kernel] Error 2
make: *** [Makefile:1053: arch/sparc] Error 2


Not sure what to make of that.

My intuition here tells me the bug is likely in 
arch/sparc/kernel/syscalls.S which changed slightly since the 4.19.114 
days. Looking

previous I see no change in that source file. Regardless, this is just a
hunch without a shred of proof. Yet.


--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
GreyBeard and suspenders optional



Re: nasty bug in /usr/sbin/grub-probe

2022-04-03 Thread Stan Johnson
Hello Adrian and Dennis,

If this problem is expected to occur on an Ultra 5 or an Ultra 30,
please let me know and I'll be happy to help with a git bisect, using a
spare 9 GB disk for the installation.

-Stan

-

On 4/3/22 5:57 AM, John Paul Adrian Glaubitz wrote:
> Hello!
> 
> On 4/3/22 13:42, Dennis Clarke wrote:
>>> But since you seem to have a reliable reproducer, you can start trying to 
>>> bisect
>>> the kernel to find the commit that introduced this regression.
>>
>> That will be nearly impossible. I can not even recall when the bug first
>> appeared or when was the last time that I could run update-grub without
>> the machine locking up. At least two years now. Maybe three.
> 
> What do you mean is impossible? Bisecting the bug or the fact that it is
> a kernel bug? I know very well it's a kernel bug because it does not occur
> when using the 4.19 kernel on any of the affected SPARCs and it does not
> occur on any of the newer SPARCs with a current kernel.
> 
> The SPARC T2 and T5 we are using don't have the problem at all, for example.
> 
>> Also this is an even older UltraSparc IIi type machine. Really I should
>> have tossed it out long ago but the next machine I have handy is a
>> Fujitsu M3000 unit and I thought I had heard it was impossible to get
>> Linux on such a beast for unknown reasons. Could be myth or rumour but I
>> thought the M3000 was somehow "special". The larger M4000 seems to be
>> fine but those are just nasty large beasts to run in a home lab.
>>
>> Dragging the deep waters looking for that kernel bug will take a lot of
>> time. Possibly even some luck.
> 
> Not really. You cross-build the kernel, transfer it to the machine and see if
> update-grub works. If it doesn't, you mark the commit as bad. If it does, you
> mark the commit as good. You start from a good known working kernel such as
> 4.19.
> 
> But I can do it myself if I find the time, I have an Ultra 45 that can be used
> for that. Thought it would just be nice if I can get a helping hand, 
> especially
> since cross-compiling and bisecting the kernel isn't really hard, it just 
> takes
> time.
> 
> Adrian
> 



Re: nasty bug in /usr/sbin/grub-probe

2022-04-03 Thread Dennis Clarke

On 4/3/22 07:57, John Paul Adrian Glaubitz wrote:

Hello!

On 4/3/22 13:42, Dennis Clarke wrote:

But since you seem to have a reliable reproducer, you can start trying to bisect
the kernel to find the commit that introduced this regression.


That will be nearly impossible. I can not even recall when the bug first
appeared or when was the last time that I could run update-grub without
the machine locking up. At least two years now. Maybe three.


What do you mean is impossible? Bisecting the bug or the fact that it is
a kernel bug? I know very well it's a kernel bug because it does not occur
when using the 4.19 kernel on any of the affected SPARCs and it does not
occur on any of the newer SPARCs with a current kernel.


Are you sure of 4.19 ?  I see that 4.19.237 exists but I will guess the
same bug exists there also. I was going to begin with 4.19.114 which was
released 02-Apr-2020. A solid two years ago seems like as good a place
to start as any. However building the kernel will require that I create
an initrd and also update grub etc etc. I can do that manually and then
bypass the "update-grub" process entirely.



The SPARC T2 and T5 we are using don't have the problem at all, for example.


Also this is an even older UltraSparc IIi type machine. Really I should
have tossed it out long ago but the next machine I have handy is a
Fujitsu M3000 unit and I thought I had heard it was impossible to get
Linux on such a beast for unknown reasons. Could be myth or rumour but I
thought the M3000 was somehow "special". The larger M4000 seems to be
fine but those are just nasty large beasts to run in a home lab.

Dragging the deep waters looking for that kernel bug will take a lot of
time. Possibly even some luck.


Not really. You cross-build the kernel, transfer it to the machine and see if
update-grub works.


Hold on.  This sounds like a chicken and egg scenario. The update-grub
will fail every time. I will need to do the process by hand with an edit
to grub.cfg and with the files needed dropped into /boot with the few
kernel modules needed in /lib/modules/foo. That should be enough to at
least boot.



But I can do it myself if I find the time, I have an Ultra 45 that can be used
for that. Thought it would just be nice if I can get a helping hand, especially
since cross-compiling and bisecting the kernel isn't really hard, it just takes
time.


Right. The one thing that no one can save or store or get more.  Time.

I have already started the process but I am starting with 4.19.114.



--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
GreyBeard and suspenders optional



Re: nasty bug in /usr/sbin/grub-probe

2022-04-03 Thread John Paul Adrian Glaubitz
Hello!

On 4/3/22 13:42, Dennis Clarke wrote:
>> But since you seem to have a reliable reproducer, you can start trying to 
>> bisect
>> the kernel to find the commit that introduced this regression.
> 
> That will be nearly impossible. I can not even recall when the bug first
> appeared or when was the last time that I could run update-grub without
> the machine locking up. At least two years now. Maybe three.

What do you mean is impossible? Bisecting the bug or the fact that it is
a kernel bug? I know very well it's a kernel bug because it does not occur
when using the 4.19 kernel on any of the affected SPARCs and it does not
occur on any of the newer SPARCs with a current kernel.

The SPARC T2 and T5 we are using don't have the problem at all, for example.

> Also this is an even older UltraSparc IIi type machine. Really I should
> have tossed it out long ago but the next machine I have handy is a
> Fujitsu M3000 unit and I thought I had heard it was impossible to get
> Linux on such a beast for unknown reasons. Could be myth or rumour but I
> thought the M3000 was somehow "special". The larger M4000 seems to be
> fine but those are just nasty large beasts to run in a home lab.
> 
> Dragging the deep waters looking for that kernel bug will take a lot of
> time. Possibly even some luck.

Not really. You cross-build the kernel, transfer it to the machine and see if
update-grub works. If it doesn't, you mark the commit as bad. If it does, you
mark the commit as good. You start from a good known working kernel such as
4.19.

But I can do it myself if I find the time, I have an Ultra 45 that can be used
for that. Thought it would just be nice if I can get a helping hand, especially
since cross-compiling and bisecting the kernel isn't really hard, it just takes
time.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: nasty bug in /usr/sbin/grub-probe

2022-04-03 Thread Dennis Clarke

On 4/2/22 03:30, John Paul Adrian Glaubitz wrote:

Hello Dennis!

On 4/2/22 03:34, Dennis Clarke wrote:

I am not so sure about this yet until I can rebuild the required grub
binaries with full debug info. For at least a year ( or more ) I have
seen "really bad things"(tm) happen when I try to make a new initrd on
sparc64. Generally the machine seems to pack up and go away with nary a
single packet out to the world. To look into this problem I use a serial
attached good old 9600 baud console and watch what happens when I try to
do a make install from within the Linux source tree :
(...)
So therefore I think that there is a bug in /usr/sbin/grub-probe and it
really kills the whole "make install" process from within the Linux
kernel source tree or any other way you choose to run it.

Has anyone else seen this ?


This isn't a bug in GRUB but a kernel bug that affects older SPARC machines
like your UltraSPARC IIIi. Unfortunately, no one has had the time yet to bisect
this issue.

But since you seem to have a reliable reproducer, you can start trying to bisect
the kernel to find the commit that introduced this regression.



That will be nearly impossible. I can not even recall when the bug first
appeared or when was the last time that I could run update-grub without
the machine locking up. At least two years now. Maybe three.

Also this is an even older UltraSparc IIi type machine. Really I should
have tossed it out long ago but the next machine I have handy is a
Fujitsu M3000 unit and I thought I had heard it was impossible to get
Linux on such a beast for unknown reasons. Could be myth or rumour but I
thought the M3000 was somehow "special". The larger M4000 seems to be
fine but those are just nasty large beasts to run in a home lab.

Dragging the deep waters looking for that kernel bug will take a lot of
time. Possibly even some luck.


--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
GreyBeard and suspenders optional



Re: nasty bug in /usr/sbin/grub-probe

2022-04-02 Thread John Paul Adrian Glaubitz
Hello Dennis!

On 4/2/22 03:34, Dennis Clarke wrote:
> I am not so sure about this yet until I can rebuild the required grub
> binaries with full debug info. For at least a year ( or more ) I have
> seen "really bad things"(tm) happen when I try to make a new initrd on
> sparc64. Generally the machine seems to pack up and go away with nary a
> single packet out to the world. To look into this problem I use a serial
> attached good old 9600 baud console and watch what happens when I try to
> do a make install from within the Linux source tree :
> (...)
> So therefore I think that there is a bug in /usr/sbin/grub-probe and it
> really kills the whole "make install" process from within the Linux
> kernel source tree or any other way you choose to run it.
> 
> Has anyone else seen this ?

This isn't a bug in GRUB but a kernel bug that affects older SPARC machines
like your UltraSPARC IIIi. Unfortunately, no one has had the time yet to bisect
this issue.

But since you seem to have a reliable reproducer, you can start trying to bisect
the kernel to find the commit that introduced this regression.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: nasty bug in /usr/sbin/grub-probe

2022-04-01 Thread Dennis Clarke

On 4/1/22 22:03, Stan Johnson wrote:

Hi Dennis,

Unless you already know that your system's memory is ok...



Sparc machines generally have ECC memory and the diagnostics are quite
well trusted.

However ... just for giggles ( yes the battery is crap ) :


root@hades:~#
root@hades:~# shutdown -h 'now'
root@hades:~# [  OK  ] Removed slic Stopping Rescue Shell...
 Stopping Load/Save Random Seed...
[  OK  ] Stopped Rescue Shell.
[  OK  ] Stopped target System Initialization.
[  OK  ] Unset automount Arbitrary b File System Automount Point.
[  OK  ] Stopped target Local Encrypted Volumes.
[  OK  ] Stopped Dispatch Password b to Console Directory Watch.
[  OK  ] Stopped target Local Integrity Protected Volumes.
[  OK  ] Stopped target Swaps.
[  OK  ] Stopped target Local Verity Protected Volumes.
 Deactivating swap /dev/disb&_3CD0ZHE27120K5BC-part2...
 Stopping Record System Boot/Shutdown in UTMP...
[  OK  ] Deactivated swap /dev/diskb&00:01:02.0-scsi-0:0:0:0-part2.
[  OK  ] Deactivated swap /dev/diskb&6G_3CD0ZHE27120K5BC-part2.
[  OK  ] Deactivated swap /dev/sda2.
[  OK  ] Stopped ifup for eth0.
[  OK  ] Stopped Load/Save Random Seed.
[  OK  ] Deactivated swap /dev/diskb&2-e35a-4ff4-b7f5-f7d028c4ca2c.
[  OK  ] Stopped Record System Boot/Shutdown in UTMP.
[  OK  ] Stopped Apply Kernel Variables.
[  OK  ] Stopped Load Kernel Modules.
[  OK  ] Stopped Create Volatile Files and Directories.
[  OK  ] Stopped target Local File Systems.
 Unmounting /boot...
 Unmounting /home...
 Unmounting /run/credentials/systemd-sysusers.service...
 Unmounting /usr/local...
[  OK  ] Unmounted /boot.
[  OK  ] Unmounted /home.
[  OK  ] Unmounted /run/credentials/systemd-sysusers.service.
[  OK  ] Unmounted /usr/local.
[  OK  ] Reached target Unmount All Filesystems.
[  OK  ] Stopped File System Check b&6-88fb-4134-88c5-485c34d4614c.
[  OK  ] Stopped File System Check b
[  OK  ] Stopped File System Check b&9-00f1-497b-8b83-d726e914d044.
[  OK  ] Removed slice Slice /system/systemd-fsck.
[  OK  ] Stopped target Preparation for Local File Systems.
[  OK  ] Stopped Create Static Device Nodes in /dev.
[  OK  ] Stopped Create System Users.
[  OK  ] Stopped Remount Root and Kernel File Systems.
[  OK  ] Reached target System Shutdown.
[  OK  ] Reached target Late Shutdown Services.
[  OK  ] Finished System Power Off.
[  OK  ] Reached target System Power Off.
[ 4732.049137] systemd-shutdown[1]: Syncing filesystems and block devices.
[ 4732.541119] systemd-shutdown[1]: Sending SIGTERM to remaining 
processes...
[ 4732.650924] systemd-journald[183]: Received SIGTERM from PID 1 
(systemd-shutdow).
[ 4732.886493] systemd-shutdown[1]: Sending SIGKILL to remaining 
processes...

[ 4732.995664] systemd-shutdown[1]: Unmounting file systems.
[ 4733.075318] [308]: Remounting '/' read-only with options 
'errors=remount-ro'.
[ 4733.223777] EXT4-fs (sda4): re-mounted. Opts: errors=remount-ro. 
Quota mode: none.

[ 4733.335759] systemd-shutdown[1]: All filesystems unmounted.
[ 4733.409207] systemd-shutdown[1]: Deactivating swaps.
[ 4733.474932] systemd-shutdown[1]: All swaps deactivated.
[ 4733.543741] systemd-shutdown[1]: Detaching loop devices.
[ 4733.614459] systemd-shutdown[1]: All loop devices detached.
[ 4733.687858] systemd-shutdown[1]: Stopping MD devices.
[ 4733.755436] systemd-shutdown[1]: All MD devices stopped.
[ 4733.825318] systemd-shutdown[1]: Detaching DM devices.
[ 4733.893652] systemd-shutdown[1]: All DM devices detached.
[ 4733.964778] systemd-shutdown[1]: All filesystems, swaps, loop 
devices, MD devices and DM devices detached.

[ 4734.136742] systemd-shutdown[1]: Syncing filesystems and block devices.
[ 4734.227936] systemd-shutdown[1]: Powering off.
[ 4734.286653] sd 0:0:1:0: [sdb] Synchronizing SCSI cache
[ 4734.354622] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 4734.423032] reboot: Power down
lom>
LOM event: power off
lom>

Get a coffee or whiskey or both and let the electrons settle...


lom>
LOM event: power on

ps/2 kbd check: ...00fe
Checking Sun KB Done
%o0 = ..0055.4001

Executing Power On SelfTest


SPARCengine(tm)Ultra CP 1500 POST 1.17 ME created 03/06/00
 WARRNING: NVRAM battery is either bad or just replaced!
Time Stamp [hour:min:sec] 33:30:02

Init POST BSS
Init System BSS

Probing system keyboard : Done
DMMU TLB Tags
DMMU TLB Tag Access Test
DMMU TLB RAM
DMMU TLB RAM Access Test
Ecache Tests
Probe Ecache
ecache_size = 0x0020
Ecache RAM Addr Test
Ecache Tag Addr Test
Ecache RAM Test
Ecache Tag Test
Invalidate Ecache Tags
All CPU Basic Tests
V9 Instruction Test
CPU Tick and Tick Compare Reg Test
CPU Soft Trap Test
CPU Softint Reg and Int Test
All Basic MMU Tests
DMMU Primary Context Reg Test
DMMU Secondary Context Reg Test
DMMU TSB Reg Test
DMMU Tag Access Reg Test
DMMU 

nasty bug in /usr/sbin/grub-probe

2022-04-01 Thread Dennis Clarke



I am not so sure about this yet until I can rebuild the required grub
binaries with full debug info. For at least a year ( or more ) I have
seen "really bad things"(tm) happen when I try to make a new initrd on
sparc64. Generally the machine seems to pack up and go away with nary a
single packet out to the world. To look into this problem I use a serial
attached good old 9600 baud console and watch what happens when I try to
do a make install from within the Linux source tree :


root@hades:~# [80684.783560] watchdog: BUG: soft lockup - CPU#0 stuck 
for 26s! [grub-probe:47798]
[80684.880888] Modules linked in: sg(E) envctrl(E) display7seg(E) 
flash(E) fuse(E) drm(E) drm_panel_orientation_quirks(E) i2c_core(E) 
configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) 
mbcache(E) jbd2(E) crc32c_generic(E) sd_mod(E) t10_pi(E) crc_t10dif(E) 
crct10dif_generic(E) crct10dif_common(E) sym53c8xx(E) 
scsi_transport_spi(E) scsi_mod(E) scsi_common(E) sunhme(E)
[80685.320414] CPU: 0 PID: 47798 Comm: grub-probe Tainted: G 
E 5.16.0-6-sparc64 #1  Debian 5.16.18-1
[80685.454308] TSTATE: 11001607 TPC: 009555d0 TNPC: 
009555d4 Y: Tainted: GE

[80685.601952] TPC: 
[80685.652373] g0:  g1: 0098 g2: 
 g3: 0714ebe0
[80685.766856] g4: f8000137ae40 g5: 62462570 g6: 
f800097a8000 g7: 00a88958
[80685.881335] o0: 00fa7a08 o1: f800097ab8ec o2: 
f82f72d0 o3: 0001
[80685.995814] o4: f887f968 o5:  sp: 
f800097aaf81 ret_pc: 009555a0

[80686.114867] RPC: 
[80686.165292] l0: f81b9800 l1: 00fa7800 l2: 
00685d20 l3: 0006dc08077e
[80686.279793] l4: 0470 l5: ff9c l6: 
f800097a8000 l7: 0067e1a0
[80686.394261] i0: f887f968 i1: f80001121960 i2: 
00fa7800 i3: 00fa7a20
[80686.508740] i4: 00ec i5: 10074830 i6: 
f800097ab031 i7: 00686958

[80686.623218] I7: 
[80686.674783] Call Trace:
[80686.706910] [<00686958>] chrdev_open+0x98/0x1c0
[80686.775642] [<0067bef0>] do_dentry_open+0x170/0x420
[80686.848946] [<0067da08>] vfs_open+0x28/0x40
[80686.913099] [<00693500>] path_openat+0xb20/0x10e0
[80686.984115] [<006948c0>] do_filp_open+0x60/0x100
[80687.053986] [<0067dcf0>] do_sys_openat2+0x70/0x180
[80687.126146] [<0067e1e8>] sys_openat+0x48/0xc0
[80687.192588] [<00406174>] linux_sparc_syscall+0x34/0x44
[80708.777890] watchdog: BUG: soft lockup - CPU#0 stuck for 48s! 
[grub-probe:47798]
[80708.875209] Modules linked in: sg(E) envctrl(E) display7seg(E) 
flash(E) fuse(E) drm(E) drm_panel_orientation_quirks(E) i2c_core(E) 
configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) 
mbcache(E) jbd2(E) crc32c_generic(E) sd_mod(E) t10_pi(E) crc_t10dif(E) 
crct10dif_generic(E) crct10dif_common(E) sym53c8xx(E) 
scsi_transport_spi(E) scsi_mod(E) scsi_common(E) sunhme(E)
[80709.314756] CPU: 0 PID: 47798 Comm: grub-probe Tainted: G 
EL5.16.0-6-sparc64 #1  Debian 5.16.18-1
[80709.448637] TSTATE: 11001607 TPC: 009555d0 TNPC: 
009555d4 Y: Tainted: GEL

[80709.596283] TPC: 
[80709.646701] g0:  g1: 0098 g2: 
 g3: 0714ebe0
[80709.761187] g4: f8000137ae40 g5: 62462570 g6: 
f800097a8000 g7: 00a88958
[80709.875665] o0: 00fa7a08 o1: f800097ab8ec o2: 
f82f72d0 o3: 0001
[80709.990145] o4: f887f968 o5:  sp: 
f800097aaf81 ret_pc: 009555a0

[80710.109199] RPC: 
[80710.159618] l0: f81b9800 l1: 00fa7800 l2: 
00685d20 l3: 0006dc08077e
[80710.274105] l4: 0470 l5: ff9c l6: 
f800097a8000 l7: 0067e1a0
[80710.388583] i0: f887f968 i1: f80001121960 i2: 
00fa7800 i3: 00fa7a20
[80710.503061] i4: 00ec i5: 10074830 i6: 
f800097ab031 i7: 00686958

[80710.617539] I7: 
[80710.669104] Call Trace:
[80710.701126] [<00686958>] chrdev_open+0x98/0x1c0
[80710.769859] [<0067bef0>] do_dentry_open+0x170/0x420
[80710.843164] [<0067da08>] vfs_open+0x28/0x40
[80710.907316] [<00693500>] path_openat+0xb20/0x10e0
[80710.978333] [<006948c0>] do_filp_open+0x60/0x100
[80711.048204] [<0067dcf0>] do_sys_openat2+0x70/0x180
[80711.120364] [<0067e1e8>] sys_openat+0x48/0xc0
[80711.186805] [<00406174>] linux_sparc_syscall+0x34/0x44
[80732.772223] watchdog: BUG: soft lockup - CPU#0 stuck for 71s! 
[grub-probe:47798]
[80732.869531] Modules linked in: sg(E) envctrl(E) display7seg(E) 
flash(E) fuse(E) drm(E) drm_panel_orientation_quirks(E) i2c_core(E) 
configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) 
mbcache(E) jbd2(E) crc32c_generic(E)