Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!
Hi Paul, On Sun, Jul 03, 2022 at 09:57:59PM +0200, Paul Gevers wrote: > Hi all, > > Just a minor follow-up. I just had to restart one of my arm64 workers again. > > root@ci-worker-arm64-05:~# uname -a > Linux ci-worker-arm64-05 5.10.0-15-arm64 #1 SMP Debian 5.10.120-1 > (2022-06-09) aarch64 GNU/Linux > > Anything you want me to extract from the current logs? Replicating our short discussion this morning, assuming you have not seen the issue anymore in recent updates and runs, can we close this issue? (Still sad, that we cannot isolate the cause ...) Regards, Salvatore
Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!
Hi all, Just a minor follow-up. I just had to restart one of my arm64 workers again. root@ci-worker-arm64-05:~# uname -a Linux ci-worker-arm64-05 5.10.0-15-arm64 #1 SMP Debian 5.10.120-1 (2022-06-09) aarch64 GNU/Linux Anything you want me to extract from the current logs? Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!
Hi Paul, On Thursday, 23 June 2022 10:44:49 CEST Paul Gevers wrote: > Hi Diederik, > > On 22-06-2022 23:15, Diederik de Haas wrote: > > Hmm ...interesting. AFAIK that is a watchdog's task. > > > > On Saturday, 4 December 2021 22:44:38 CEST Paul Gevers wrote: > >> I noticed in the logs that *after* the reported kernel bug but before > >> the actual hang, I see multiple instances of: > >> watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [apt-get:2204621] > >> and > >> watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kcompactd0:40] > >> on ci-worker-arm64-07. > > > > And here is where I saw it. (My watchdog issue doesn't cause a hang btw) > > That might be, but this doesn't result in a successful reboot (of the > system, maybe you meant a reboot of the core?). That was actually my point :-) AFAIK (which is limited), the whole point of the watchdog is to reboot (the system I'd guess) when things get stuck. That that didn't happen, is worth noting > > If you have access to the host, APT should be able to tell you. > > Depends on what you mean with "the host". Our VM (our host) is > provisioned by Huawei (their host). I have access to our host. To talk in Xen terms, I meant dom0 as host. I'd guess that qemu would create a VM from that host. (and in Xen terms, the created VM would be a domU). > root@ci-worker-arm64-02:~# apt list *qemu* --installed > qemu-utils/stable-security,now 1:5.2+dfsg-11+deb11u2 arm64 > [installed,automatic] Maybe things work different wrt qemu, but that's the version I was looking for. > > Via sources.list.erb I found that "<%= node['debian_release'] > > %>-backports" gets enabled, which I assume results in Stable-backports. > > Correct, but currently we don't install anything from there. Ack. It is what I thought (but didn't know). > > It appears that various tools get installed (but I don't see qemu > > mentioned (explicitly), but I do see 'virt-what' and the package > > description seems to indicate it may be useful to figure out detail of > > the VM. > > root@ci-worker-arm64-02:~# virt-what > qemu > root@ci-worker-arm64-02:~# virt-what --version > 1.19 Less useful then I'd hoped, but you earlier already found the qemu version :-) signature.asc Description: This is a digitally signed message part.
Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!
Hi Diederik, On 22-06-2022 23:15, Diederik de Haas wrote: Hmm ...interesting. AFAIK that is a watchdog's task. And I was certain I saw sth about it as I've seen (a yet to be reported) an issue related to watchdog myself, hence why I remembered it. On Saturday, 4 December 2021 22:44:38 CEST Paul Gevers wrote: I noticed in the logs that *after* the reported kernel bug but before the actual hang, I see multiple instances of: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [apt-get:2204621] and watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kcompactd0:40] on ci-worker-arm64-07. And here is where I saw it. (My watchdog issue doesn't cause a hang btw) That might be, but this doesn't result in a successful reboot (of the system, maybe you meant a reboot of the core?). If you have access to the host, APT should be able to tell you. Depends on what you mean with "the host". Our VM (our host) is provisioned by Huawei (their host). I have access to our host. root@ci-worker-arm64-02:~# apt list *qemu* --installed Listing... Done qemu-utils/stable-security,now 1:5.2+dfsg-11+deb11u2 arm64 [installed,automatic] N: There is 1 additional version. Please use the '-a' switch to see it Via sources.list.erb I found that "<%= node['debian_release'] %>-backports" gets enabled, which I assume results in Stable-backports. Correct, but currently we don't install anything from there. It appears that various tools get installed (but I don't see qemu mentioned (explicitly), but I do see 'virt-what' and the package description seems to indicate it may be useful to figure out detail of the VM. root@ci-worker-arm64-02:~# virt-what qemu root@ci-worker-arm64-02:~# virt-what --version 1.19 Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!
On Wednesday, 22 June 2022 23:15:46 CEST Diederik de Haas wrote: > Via sources.list.erb I found that "<%= node['debian_release'] %>-backports" > gets enabled, which I assume results in Stable-backports. > It appears that various tools get installed (but I don't see qemu mentioned > (explicitly)), but I do see 'virt-what' and the package description seems to > indicate it may be useful to figure out detail of the VM. Forgot to add: Backports seems available but it needs to be explicitly specified to install packages from it, which _I_ didn't see, but I'm not familiar with your (build) systems. I don't know if it's an option, but stable-bpo has 1:7.0+dfsg-2~bpo11+2 signature.asc Description: This is a digitally signed message part.
Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!
Hi Paul, On Wednesday, 22 June 2022 21:57:06 CEST Paul Gevers wrote: > On 21-06-2022 23:19, Diederik de Haas wrote: > > > I think that the install logs aren't that important (anymore) as the > > issue/symptoms appear to be the same: > > - some swap action resulting in some failure > > - CPU gets stuck > > - watchdog triggers a reboot > > If the reboot would actually happen/finish, I wouldn't have problems of > the hanging host. The issues I spotted required a manual reboot (and > that's why I spotted them). Hmm ...interesting. AFAIK that is a watchdog's task. And I was certain I saw sth about it as I've seen (a yet to be reported) an issue related to watchdog myself, hence why I remembered it. On Saturday, 4 December 2021 22:44:38 CEST Paul Gevers wrote: > I noticed in the logs that *after* the reported kernel bug but before > the actual hang, I see multiple instances of: > watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [apt-get:2204621] > and > watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kcompactd0:40] > on ci-worker-arm64-07. And here is where I saw it. (My watchdog issue doesn't cause a hang btw) > > How is swap configured on these devices? > > https://salsa.debian.org/ci-team/debian-ci-config/-/blob/master/cookbooks/ba > sics/default.rb#L3 until line 11 Not familiar with Ruby, but IIUC a swap file get created half the size of RAM. I _think_ the swapon command isn't technically needed as it will be done on bootup through fstab, but shouldn't hurt either. Seems fine :) > > I *assumed* it was running on arm64 (native) hardware and was about to > > ask specifics about it and then I noticed this: > > Host bridge [0600]: Red Hat, Inc. QEMU PCIe Host bridge [1b36:0008] > > > > Qemu. Quite likely unrelated, but a while back I had an issue with qemu > > in building arm64 images: https://bugs.debian.org/988174 > > hmm, OK, right (I forgot that I knew this). > > > I think it would be useful to know which qemu version(s) were used. > > Is there any way to know from inside the VM? If you have access to the host, APT should be able to tell you. Via sources.list.erb I found that "<%= node['debian_release'] %>-backports" gets enabled, which I assume results in Stable-backports. It appears that various tools get installed (but I don't see qemu mentioned (explicitly), but I do see 'virt-what' and the package description seems to indicate it may be useful to figure out detail of the VM. @mjt, I have two questions for you: 1) do you know if/how the qemu version can be queried from within the VM? 2) Are you aware of potential issues wrt hangs in arm64 VM created with Qemu? Or IOW, could you take a look at this bug and can you give tips which could help in tracking down the cause and subsequently the solution? TIA, Diederik signature.asc Description: This is a digitally signed message part.
Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!
Hi Diederik, On 21-06-2022 23:19, Diederik de Haas wrote: I think that the install logs aren't that important (anymore) as the issue/ symptoms appear to be the same: - some swap action resulting in some failure - CPU gets stuck - watchdog triggers a reboot If the reboot would actually happen/finish, I wouldn't have problems of the hanging host. The issues I spotted required a manual reboot (and that's why I spotted them). How is swap configured on these devices? https://salsa.debian.org/ci-team/debian-ci-config/-/blob/master/cookbooks/basics/default.rb#L3 until line 11 Yeah, I _assumed_ as such, but assumptions can be dangerous ;-) Total ACK. Normally I scroll (hard) by the hardware listings as that rarely says anything to me. And I did that before too, but just now I made an important discovery. I *assumed* it was running on arm64 (native) hardware and was about to ask specifics about it and then I noticed this: Host bridge [0600]: Red Hat, Inc. QEMU PCIe Host bridge [1b36:0008] Qemu. Quite likely unrelated, but a while back I had an issue with qemu in building arm64 images: https://bugs.debian.org/988174 hmm, OK, right (I forgot that I knew this). I think it would be useful to know which qemu version(s) were used. Is there any way to know from inside the VM? If the issue does occur again, I think it would be useful to bring 'upstream' into the conversation. They likely can bring much more useful input into this then (f.e.) I could. Also, if upstream is made aware there is an issue (even infrequent), then they can make the most informed choice what to do with it. Ack. Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!
Hi, On Tuesday, 21 June 2022 22:31:45 CEST Paul Gevers wrote: > On 21-06-2022 22:07, Diederik de Haas wrote: > > > Do these errors still occur? Still with 5.10.103-1 or a later one? > > The last occurrence of a machine hang I had is from 5 May 2022, but I'm > not sure if I checked if it was this same issue. Normally our kernels > are up-to-date, but I don't recall what we had at the time. We have > recommissioned our arm64 hosts, so the install logs are lost by now. It's good for ci.debian.net that there are such large gaps between failures, but it makes debugging a bit harder. I think that the install logs aren't that important (anymore) as the issue/ symptoms appear to be the same: - some swap action resulting in some failure - CPU gets stuck - watchdog triggers a reboot How is swap configured on these devices? > > Is it only on arm64 machines? Or is this just an example which also > > occurs on other arches? > > I'm pretty sure I haven't seen this on other arches, otherwise I'm sure > I would have reported it to this bug. Yeah, I _assumed_ as such, but assumptions can be dangerous ;-) Normally I scroll (hard) by the hardware listings as that rarely says anything to me. And I did that before too, but just now I made an important discovery. I *assumed* it was running on arm64 (native) hardware and was about to ask specifics about it and then I noticed this: Host bridge [0600]: Red Hat, Inc. QEMU PCIe Host bridge [1b36:0008] Qemu. Quite likely unrelated, but a while back I had an issue with qemu in building arm64 images: https://bugs.debian.org/988174 I think it would be useful to know which qemu version(s) were used. (It's unlikely I'll be able to help find the cause/solution, mostly gathering hopefully useful bits of information for people who could) > > If it still occurs, then the likely only way to get a possible resolve is > > reporting it to upstream. > > 1.5 months is quite long for it to be gone, although, before that it was > 2.5 months. If the issue does occur again, I think it would be useful to bring 'upstream' into the conversation. They likely can bring much more useful input into this then (f.e.) I could. Also, if upstream is made aware there is an issue (even infrequent), then they can make the most informed choice what to do with it. Cheers, Diederik signature.asc Description: This is a digitally signed message part.
Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!
Hi Diederik, On 21-06-2022 22:07, Diederik de Haas wrote: Do these errors still occur? Still with 5.10.103-1 or a later one? The last occurrence of a machine hang I had is from 5 May 2022, but I'm not sure if I checked if it was this same issue. Normally our kernels are up-to-date, but I don't recall what we had at the time. We have recommissioned our arm64 hosts, so the install logs are lost by now. Is it only on arm64 machines? Or is this just an example which also occurs on other arches? I'm pretty sure I haven't seen this on other arches, otherwise I'm sure I would have reported it to this bug. If it still occurs, then the likely only way to get a possible resolve is reporting it to upstream. 1.5 months is quite long for it to be gone, although, before that it was 2.5 months. Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!
Control: found -1 linux/5.10.103-1 Hi Paul, On Tuesday, 29 March 2022 20:58:59 CEST Paul Gevers wrote: > On 20-02-2022 13:44, Paul Gevers wrote: > > > Sad to say, but this week we had two hangs again. > > And this week another two. > > ci-worker-arm64-07 == > > Mar 26 10:15:55 ci-worker-arm64-07 kernel: kernel BUG at > include/linux/swapops.h:204! > Mar 26 10:15:55 ci-worker-arm64-07 kernel: Internal error: Oops - BUG: 0 > [#1] SMP > > Linux kernel from before the last point release: > Linux version 5.10.0-12-arm64 (debian-ker...@lists.debian.org) (gcc-10 > (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2> > > ci-worker-arm64-08 == > Mar 25 22:13:44 ci-worker-arm64-08 kernel: kernel BUG at > include/linux/swapops.h:204! > Mar 25 22:13:44 ci-worker-arm64-08 kernel: Internal error: Oops - BUG: 0 > [#1] SMP Do these errors still occur? Still with 5.10.103-1 or a later one? Is it only on arm64 machines? Or is this just an example which also occurs on other arches? Is it possible to try newer kernel versions from Stable-backports to see whether the issue occurs there too? If it still occurs, then the likely only way to get a possible resolve is reporting it to upstream. For 'swapops.h' that should be this: ~/dev/kernel.org/linux$ scripts/get_maintainer.pl include/linux/swapops.h Andrew Morton Peter Xu David Hildenbrand Alistair Popple Miaohe Lin Naoya Horiguchi linux-ker...@vger.kernel.org (open list) But I'm not sure that's the right list as it is from the include directory, so the actual problem may be somewhere else. But I guess it would be a good start? Cheers, Diederik signature.asc Description: This is a digitally signed message part.
Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!
Hi all, On 04-12-2021 22:44, Paul Gevers wrote: On Thu, 02 Dec 2021 13:44:15 +0100 Paul Gevers wrote: The last couple of days, two of the ci.debian.net arm64 workers became unresponsive. The systems were rebooted and I found the message in the journal pasted below. Of course the absence of these failures doesn't prove the bug is gone, but since upgrading our systems to 5.10.84-1 (on 20 December 2021), I have not seen this failure again. Maybe it's about time we close this bug and assume it's fixed in version 5.10.84-1? Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#1001001: linux-image-5.10.0-9-arm64: kernel BUG at include/linux/swapops.h:204!
Hi, On Thu, 02 Dec 2021 13:44:15 +0100 Paul Gevers wrote: The last couple of days, two of the ci.debian.net arm64 workers became unresponsive. The systems were rebooted and I found the message in the journal pasted below. Please let me know if you need more info about these systems. As requested by carnil on IRC, let me try to add some things I checked. In contrast to the previous kernel bug I reported, this time the two machines that hang were testing different packages (syslog-ng being one of them) that succeed often on arm64. I noticed in the logs that *after* the reported kernel bug but before the actual hang, I see multiple instances of: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [apt-get:2204621] and watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kcompactd0:40] on ci-worker-arm64-07. The other system (ci-worker-arm64-02) has watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [khugepaged:42] and watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [apt-get:4191233] I found a third system that had to be rebooted recently (ci-worker-arm64-08 on 18 November): watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [apt-get:3325970] and watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [python3:3275229] Although the journal is lost by now, we had more arm64 VM's hang; ci-worker-arm64-03 on 6 November 2021 Probably worth to mention, albeit hopefully unrelated, we had issues in the recent past (ci-worker-arm64-06 on 29 October 2021) with virtio_gpu so we blocked that module on all our workers from loading as we believe we don't need it. [drm:virtio_gpu_dequeue_ctrl_func [virtio_gpu]] *ERROR* response 0x1202 (command 0x103) Paul OpenPGP_signature Description: OpenPGP digital signature