Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!

2023-05-24 Thread Salvatore Bonaccorso
Hi Paul,

On Sun, Jul 03, 2022 at 09:57:59PM +0200, Paul Gevers wrote:
> Hi all,
> 
> Just a minor follow-up. I just had to restart one of my arm64 workers again.
> 
> root@ci-worker-arm64-05:~# uname -a
> Linux ci-worker-arm64-05 5.10.0-15-arm64 #1 SMP Debian 5.10.120-1
> (2022-06-09) aarch64 GNU/Linux
> 
> Anything you want me to extract from the current logs?

Replicating our short discussion this morning, assuming you have not
seen the issue anymore in recent updates and runs, can we close this
issue? (Still sad, that we cannot isolate the cause ...)

Regards,
Salvatore



Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!

2022-07-03 Thread Paul Gevers

Hi all,

Just a minor follow-up. I just had to restart one of my arm64 workers again.

root@ci-worker-arm64-05:~# uname -a
Linux ci-worker-arm64-05 5.10.0-15-arm64 #1 SMP Debian 5.10.120-1 
(2022-06-09) aarch64 GNU/Linux


Anything you want me to extract from the current logs?

Paul


OpenPGP_signature
Description: OpenPGP digital signature


Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!

2022-06-23 Thread Diederik de Haas
Hi Paul,

On Thursday, 23 June 2022 10:44:49 CEST Paul Gevers wrote:
> Hi Diederik,
> 
> On 22-06-2022 23:15, Diederik de Haas wrote:
> > Hmm ...interesting. AFAIK that is a watchdog's task.
> > 
> > On Saturday, 4 December 2021 22:44:38 CEST Paul Gevers wrote:
> >> I noticed in the logs that *after* the reported kernel bug but before
> >> the actual hang, I see multiple instances of:
> >> watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [apt-get:2204621]
> >> and
> >> watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kcompactd0:40]
> >> on ci-worker-arm64-07.
> > 
> > And here is where I saw it. (My watchdog issue doesn't cause a hang btw)
> 
> That might be, but this doesn't result in a successful reboot (of the 
> system, maybe you meant a reboot of the core?).

That was actually my point :-)
AFAIK (which is limited), the whole point of the watchdog is to reboot (the 
system I'd guess) when things get stuck.
That that didn't happen, is worth noting

> > If you have access to the host, APT should be able to tell you.
> 
> Depends on what you mean with "the host". Our VM (our host) is 
> provisioned by Huawei (their host). I have access to our host.

To talk in Xen terms, I meant dom0 as host. I'd guess that qemu would create a 
VM from that host. (and in Xen terms, the created VM would be a domU).

> root@ci-worker-arm64-02:~# apt list *qemu* --installed
> qemu-utils/stable-security,now 1:5.2+dfsg-11+deb11u2 arm64 
> [installed,automatic]

Maybe things work different wrt qemu, but that's the version I was looking for.

> > Via sources.list.erb I found that "<%= node['debian_release']
> > %>-backports" gets enabled, which I assume results in Stable-backports.
> 
> Correct, but currently we don't install anything from there.

Ack. It is what I thought (but didn't know).

> > It appears that various tools get installed (but I don't see qemu
> > mentioned (explicitly), but I do see 'virt-what' and the package
> > description seems to indicate it may be useful to figure out detail of
> > the VM.
> 
> root@ci-worker-arm64-02:~# virt-what
> qemu
> root@ci-worker-arm64-02:~# virt-what --version
> 1.19

Less useful then I'd hoped, but you earlier already found the qemu version :-)

signature.asc
Description: This is a digitally signed message part.


Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!

2022-06-23 Thread Paul Gevers

Hi Diederik,

On 22-06-2022 23:15, Diederik de Haas wrote:

Hmm ...interesting. AFAIK that is a watchdog's task.
And I was certain I saw sth about it as I've seen (a yet to be reported) an
issue related to watchdog myself, hence why I remembered it.

On Saturday, 4 December 2021 22:44:38 CEST Paul Gevers wrote:

I noticed in the logs that *after* the reported kernel bug but before
the actual hang, I see multiple instances of:
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [apt-get:2204621]
and
watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kcompactd0:40]
on ci-worker-arm64-07.


And here is where I saw it. (My watchdog issue doesn't cause a hang btw)


That might be, but this doesn't result in a successful reboot (of the 
system, maybe you meant a reboot of the core?).



If you have access to the host, APT should be able to tell you.


Depends on what you mean with "the host". Our VM (our host) is 
provisioned by Huawei (their host). I have access to our host.


root@ci-worker-arm64-02:~# apt list *qemu* --installed
Listing... Done
qemu-utils/stable-security,now 1:5.2+dfsg-11+deb11u2 arm64 
[installed,automatic]

N: There is 1 additional version. Please use the '-a' switch to see it


Via sources.list.erb I found that "<%= node['debian_release'] %>-backports"
gets enabled, which I assume results in Stable-backports.


Correct, but currently we don't install anything from there.


It appears that various tools get installed (but I don't see qemu mentioned
(explicitly), but I do see 'virt-what' and the package description seems to
indicate it may be useful to figure out detail of the VM.


root@ci-worker-arm64-02:~# virt-what
qemu
root@ci-worker-arm64-02:~# virt-what --version
1.19

Paul


OpenPGP_signature
Description: OpenPGP digital signature


Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!

2022-06-22 Thread Diederik de Haas
On Wednesday, 22 June 2022 23:15:46 CEST Diederik de Haas wrote:
> Via sources.list.erb I found that "<%= node['debian_release'] %>-backports"
> gets enabled, which I assume results in Stable-backports.
> It appears that various tools get installed (but I don't see qemu mentioned
> (explicitly)), but I do see 'virt-what' and the package description seems to
> indicate it may be useful to figure out detail of the VM.

Forgot to add: Backports seems available but it needs to be explicitly 
specified to install packages from it, which _I_ didn't see, but I'm not 
familiar with your (build) systems.
I don't know if it's an option, but stable-bpo has 1:7.0+dfsg-2~bpo11+2

signature.asc
Description: This is a digitally signed message part.


Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!

2022-06-22 Thread Diederik de Haas
Hi Paul,

On Wednesday, 22 June 2022 21:57:06 CEST Paul Gevers wrote:
> On 21-06-2022 23:19, Diederik de Haas wrote:
> 
> > I think that the install logs aren't that important (anymore) as the
> > issue/symptoms appear to be the same:
> > - some swap action resulting in some failure
> > - CPU gets stuck
> > - watchdog triggers a reboot
> 
> If the reboot would actually happen/finish, I wouldn't have problems of 
> the hanging host. The issues I spotted required a manual reboot (and 
> that's why I spotted them).

Hmm ...interesting. AFAIK that is a watchdog's task.
And I was certain I saw sth about it as I've seen (a yet to be reported) an 
issue related to watchdog myself, hence why I remembered it.

On Saturday, 4 December 2021 22:44:38 CEST Paul Gevers wrote:
> I noticed in the logs that *after* the reported kernel bug but before
> the actual hang, I see multiple instances of:
> watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [apt-get:2204621]
> and
> watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kcompactd0:40]
> on ci-worker-arm64-07.

And here is where I saw it. (My watchdog issue doesn't cause a hang btw)

> > How is swap configured on these devices?
> 
> https://salsa.debian.org/ci-team/debian-ci-config/-/blob/master/cookbooks/ba
> sics/default.rb#L3 until line 11

Not familiar with Ruby, but IIUC a swap file get created half the size of RAM.
I _think_ the swapon command isn't technically needed as it will be done on 
bootup through fstab, but shouldn't hurt either. Seems fine :)

> > I *assumed* it was running on arm64 (native) hardware and was about to
> > ask specifics about it and then I noticed this:
> > Host bridge [0600]: Red Hat, Inc. QEMU PCIe Host bridge [1b36:0008]
> > 
> > Qemu. Quite likely unrelated, but a while back I had an issue with qemu
> > in building arm64 images: https://bugs.debian.org/988174
> 
> hmm, OK, right (I forgot that I knew this).
> 
> > I think it would be useful to know which qemu version(s) were used.
> 
> Is there any way to know from inside the VM?

If you have access to the host, APT should be able to tell you.

Via sources.list.erb I found that "<%= node['debian_release'] %>-backports" 
gets enabled, which I assume results in Stable-backports.
It appears that various tools get installed (but I don't see qemu mentioned 
(explicitly), but I do see 'virt-what' and the package description seems to 
indicate it may be useful to figure out detail of the VM.

@mjt, I have two questions for you:
1) do you know if/how the qemu version can be queried from within the VM?
2) Are you aware of potential issues wrt hangs in arm64 VM created with Qemu?
Or IOW, could you take a look at this bug and can you give tips which could 
help in tracking down the cause and subsequently the solution?

TIA,
  Diederik

signature.asc
Description: This is a digitally signed message part.


Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!

2022-06-22 Thread Paul Gevers

Hi Diederik,

On 21-06-2022 23:19, Diederik de Haas wrote:

I think that the install logs aren't that important (anymore) as the issue/
symptoms appear to be the same:
- some swap action resulting in some failure
- CPU gets stuck
- watchdog triggers a reboot


If the reboot would actually happen/finish, I wouldn't have problems of 
the hanging host. The issues I spotted required a manual reboot (and 
that's why I spotted them).



How is swap configured on these devices?


https://salsa.debian.org/ci-team/debian-ci-config/-/blob/master/cookbooks/basics/default.rb#L3 
until line 11



Yeah, I _assumed_ as such, but assumptions can be dangerous ;-)


Total ACK.


Normally I scroll (hard) by the hardware listings as that rarely says anything
to me. And I did that before too, but just now I made an important discovery.

I *assumed* it was running on arm64 (native) hardware and was about to ask
specifics about it and then I noticed this:
Host bridge [0600]: Red Hat, Inc. QEMU PCIe Host bridge [1b36:0008]

Qemu. Quite likely unrelated, but a while back I had an issue with qemu in
building arm64 images: https://bugs.debian.org/988174


hmm, OK, right (I forgot that I knew this).


I think it would be useful to know which qemu version(s) were used.


Is there any way to know from inside the VM?


If the issue does occur again, I think it would be useful to bring 'upstream'
into the conversation. They likely can bring much more useful input into this
then (f.e.) I could. Also, if upstream is made aware there is an issue (even
infrequent), then they can make the most informed choice what to do with it.


Ack.

Paul


OpenPGP_signature
Description: OpenPGP digital signature


Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!

2022-06-21 Thread Diederik de Haas
Hi,

On Tuesday, 21 June 2022 22:31:45 CEST Paul Gevers wrote:
> On 21-06-2022 22:07, Diederik de Haas wrote:
> 
> > Do these errors still occur? Still with 5.10.103-1 or a later one?
> 
> The last occurrence of a machine hang I had is from 5 May 2022, but I'm 
> not sure if I checked if it was this same issue. Normally our kernels 
> are up-to-date, but I don't recall what we had at the time. We have 
> recommissioned our arm64 hosts, so the install logs are lost by now.

It's good for ci.debian.net that there are such large gaps between failures, 
but it makes debugging a bit harder.
I think that the install logs aren't that important (anymore) as the issue/
symptoms appear to be the same:
- some swap action resulting in some failure
- CPU gets stuck
- watchdog triggers a reboot

How is swap configured on these devices?

> > Is it only on arm64 machines? Or is this just an example which also
> > occurs on other arches?
> 
> I'm pretty sure I haven't seen this on other arches, otherwise I'm sure 
> I would have reported it to this bug.

Yeah, I _assumed_ as such, but assumptions can be dangerous ;-)

Normally I scroll (hard) by the hardware listings as that rarely says anything 
to me. And I did that before too, but just now I made an important discovery.

I *assumed* it was running on arm64 (native) hardware and was about to ask 
specifics about it and then I noticed this:
Host bridge [0600]: Red Hat, Inc. QEMU PCIe Host bridge [1b36:0008]

Qemu. Quite likely unrelated, but a while back I had an issue with qemu in 
building arm64 images: https://bugs.debian.org/988174

I think it would be useful to know which qemu version(s) were used.
(It's unlikely I'll be able to help find the cause/solution, mostly gathering 
hopefully useful bits of information for people who could)

> > If it still occurs, then the likely only way to get a possible resolve is
> > reporting it to upstream.
> 
> 1.5 months is quite long for it to be gone, although, before that it was 
> 2.5 months.

If the issue does occur again, I think it would be useful to bring 'upstream' 
into the conversation. They likely can bring much more useful input into this 
then (f.e.) I could. Also, if upstream is made aware there is an issue (even 
infrequent), then they can make the most informed choice what to do with it.

Cheers,
  Diederik

signature.asc
Description: This is a digitally signed message part.


Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!

2022-06-21 Thread Paul Gevers

Hi Diederik,

On 21-06-2022 22:07, Diederik de Haas wrote:

Do these errors still occur? Still with 5.10.103-1 or a later one?


The last occurrence of a machine hang I had is from 5 May 2022, but I'm 
not sure if I checked if it was this same issue. Normally our kernels 
are up-to-date, but I don't recall what we had at the time. We have 
recommissioned our arm64 hosts, so the install logs are lost by now.



Is it only on arm64 machines? Or is this just an example which also occurs
on other arches?


I'm pretty sure I haven't seen this on other arches, otherwise I'm sure 
I would have reported it to this bug.



If it still occurs, then the likely only way to get a possible resolve is
reporting it to upstream.


1.5 months is quite long for it to be gone, although, before that it was 
2.5 months.


Paul


OpenPGP_signature
Description: OpenPGP digital signature


Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!

2022-06-21 Thread Diederik de Haas
Control: found -1 linux/5.10.103-1

Hi Paul,

On Tuesday, 29 March 2022 20:58:59 CEST Paul Gevers wrote:
> On 20-02-2022 13:44, Paul Gevers wrote:
> 
> > Sad to say, but this week we had two hangs again.
> 
> And this week another two.
> 
>  ci-worker-arm64-07 ==
> 
> Mar 26 10:15:55 ci-worker-arm64-07 kernel: kernel BUG at 
> include/linux/swapops.h:204!
> Mar 26 10:15:55 ci-worker-arm64-07 kernel: Internal error: Oops - BUG: 0 
> [#1] SMP
> 
> Linux kernel from before the last point release:
> Linux version 5.10.0-12-arm64 (debian-ker...@lists.debian.org) (gcc-10 
> (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2>
> 
>  ci-worker-arm64-08 ==
> Mar 25 22:13:44 ci-worker-arm64-08 kernel: kernel BUG at 
> include/linux/swapops.h:204!
> Mar 25 22:13:44 ci-worker-arm64-08 kernel: Internal error: Oops - BUG: 0 
> [#1] SMP

Do these errors still occur? Still with 5.10.103-1 or a later one?
Is it only on arm64 machines? Or is this just an example which also occurs
on other arches?
Is it possible to try newer kernel versions from Stable-backports to see
whether the issue occurs there too?

If it still occurs, then the likely only way to get a possible resolve is 
reporting it to upstream. For 'swapops.h' that should be this:

~/dev/kernel.org/linux$ scripts/get_maintainer.pl include/linux/swapops.h
Andrew Morton 
Peter Xu 
David Hildenbrand 
Alistair Popple 
Miaohe Lin 
Naoya Horiguchi 
linux-ker...@vger.kernel.org (open list)

But I'm not sure that's the right list as it is from the include directory,
so the actual problem may be somewhere else.
But I guess it would be a good start?

Cheers,
  Diederik

signature.asc
Description: This is a digitally signed message part.


Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!

2022-01-26 Thread Paul Gevers

Hi all,

On 04-12-2021 22:44, Paul Gevers wrote:

On Thu, 02 Dec 2021 13:44:15 +0100 Paul Gevers  wrote:

The last couple of days, two of the ci.debian.net arm64 workers became
unresponsive. The systems were rebooted and I found the message in
the journal pasted below.


Of course the absence of these failures doesn't prove the bug is gone, 
but since upgrading our systems to 5.10.84-1 (on 20 December 2021), I 
have not seen this failure again. Maybe it's about time we close this 
bug and assume it's fixed in version 5.10.84-1?


Paul


OpenPGP_signature
Description: OpenPGP digital signature


Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!

2021-12-04 Thread Paul Gevers

Hi,

On Thu, 02 Dec 2021 13:44:15 +0100 Paul Gevers  wrote:

The last couple of days, two of the ci.debian.net arm64 workers became
unresponsive. The systems were rebooted and I found the message in
the journal pasted below.

Please let me know if you need more info about these systems.


As requested by carnil on IRC, let me try to add some things I checked.

In contrast to the previous kernel bug I reported, this time the two 
machines that hang were testing different packages (syslog-ng being one 
of them) that succeed often on arm64.


I noticed in the logs that *after* the reported kernel bug but before 
the actual hang, I see multiple instances of:

watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [apt-get:2204621]
and
watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kcompactd0:40]
on ci-worker-arm64-07.

The other system (ci-worker-arm64-02) has
watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [khugepaged:42]
and
watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [apt-get:4191233]

I found a third system that had to be rebooted recently 
(ci-worker-arm64-08 on 18 November):

watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [apt-get:3325970]
and
watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [python3:3275229]

Although the journal is lost by now, we had more arm64 VM's hang;
ci-worker-arm64-03 on 6 November 2021

Probably worth to mention, albeit hopefully unrelated, we had issues in 
the recent past (ci-worker-arm64-06 on 29 October 2021) with virtio_gpu 
so we blocked that module on all our workers from loading as we believe 
we don't need it.
 [drm:virtio_gpu_dequeue_ctrl_func [virtio_gpu]] *ERROR* response 
0x1202 (command 0x103)


Paul


OpenPGP_signature
Description: OpenPGP digital signature