Hi! just a heads up: Shortly after midnight one of our SLES15 SP3 cluster nodes started to send SIGSEGV to processes, eventually also to pacemaker, resulting in a node fence. I suspect some kernel problem. The configuration was running since a week or so (since last reboot), however. (For SP2 we once had a kernel lock-up, and the suspect was that it might be related to BtrFS balancing or snapshotting, but that was just as suspect. At midnight snapper was activated, too, so who knows...) Kernel: 5.3.18-150300.59.43-default
Summary of events: Feb 10 00:00:02 h16 dbus-daemon[5905]: [system] Successfully activated service 'org.opensuse.Snapper' Feb 10 00:00:02 h16 systemd[1]: Started DBus interface for snapper. Feb 10 00:00:02 h16 systemd[1]: snapper-timeline.service: Succeeded. Feb 10 00:00:02 h16 kernel: traps: mandb[4484] general protection fault ip:7f4f21876160 sp:7ffe25a71ff8 error:0 in libc-2.31.so[7f4f217fa000+1cb000] Feb 10 00:00:03 h16 systemd-coredump[4488]: Process 4484 (mandb) of user 13 dumped core. Feb 10 00:00:03 h16 kernel: BUG: Bad rss-counter state mm:00000000d1a9d1f5 idx:1 val:4 Feb 10 00:00:03 h16 kernel: mandb[4547]: segfault at 8b86 ip 0000000000008b86 sp 00007ffe25a73058 error 14 in mandb[55dcc3a12000+20000] Feb 10 00:00:03 h16 kernel: Code: Bad RIP value. Feb 10 00:00:04 h16 systemd-coredump[4549]: Process 4547 (mandb) of user 13 dumped core. Feb 10 00:00:04 h16 kernel: BUG: Bad rss-counter state mm:00000000c4f00529 idx:1 val:5 Feb 10 00:00:05 h16 kernel: BUG: Bad rss-counter state mm:00000000aae27ee5 idx:1 val:59 Feb 10 00:00:06 h16 systemd-coredump[4610]: Process 4606 (mandb) of user 13 dumped core. Feb 10 00:00:06 h16 kernel: traps: mandb[4640] general protection fault ip:7f4f218c6caf sp:7ffe25a73110 error:0 in libc-2.31.so[7f4f217fa000+1cb000] Feb 10 00:00:06 h16 kernel: BUG: Bad rss-counter state mm:00000000babee882 idx:1 val:2 Feb 10 00:00:08 h16 systemd-coredump[4645]: Process 4643 (systemd) of user 0 dumped core. That doesn't sound good, does it? Feb 10 00:00:08 h16 systemd[4642]: Caught <SEGV>, dumped core as pid 4643. Feb 10 00:00:08 h16 systemd[4642]: Freezing execution. Feb 10 00:00:29 h16 kernel: pacemaker-execd[4704]: segfault at 3a46 ip 0000000000003a46 sp 00007ffe2c700508 error 14 in pacemaker-execd[55e474755000+b000] Feb 10 00:00:29 h16 kernel: Code: Bad RIP value. Feb 10 00:00:30 h16 kernel: BUG: Bad rss-counter state mm:00000000b1203e21 idx:1 val:2 Feb 10 00:00:34 h16 kernel: libvirtd[5685]: segfault at 0 ip 00007f745c487e73 sp 00007ffc70e95a58 error 6 in libc-2.31.so[7f745c3fe000+1cb000] Feb 10 00:00:34 h16 kernel: Code: Bad RIP value. Feb 10 00:00:34 h16 kernel: BUG: Bad rss-counter state mm:00000000d755caae idx:1 val:69691 Feb 10 00:00:34 h16 kernel: VirtualDomain[5781]: segfault at 0 ip 0000000000000000 sp 00007ffdc5c98660 error 14 in bash[55669b8cb000+f1000] Feb 10 00:00:34 h16 kernel: Code: Bad RIP value. Feb 10 00:00:35 h16 systemd-coredump[5742]: Process 5689 (Filesystem) of user 0 dumped core. Feb 10 00:00:35 h16 kernel: BUG: Bad rss-counter state mm:0000000042171789 idx:1 val:2 Feb 10 00:00:36 h16 systemd-coredump[5803]: Process 5781 (VirtualDomain) of user 0 dumped core. Feb 10 00:00:36 h16 kernel: BUG: Bad rss-counter state mm:00000000713058ae idx:1 val:6 ...many more... Feb 10 00:03:33 h16 systemd-coredump[13479]: Process 13400 (systemd) of user 0 dumped core. -- Reboot -- Feb 10 00:06:59 h16 kernel: Linux version 5.3.18-150300.59.43-default (geeko@buildhost) (gcc version 7.5.0 (SUSE Linux)) #1 SMP Sun Jan 23 19:27:23 UTC 2022 (c76af22) (eventually) Another reboot: Feb 10 00:08:18 h16 sbd[7067]: emerg: do_exit: Rebooting system: reboot -- Reboot -- Feb 10 00:11:43 h16 kernel: Linux version 5.3.18-150300.59.43-default (geeko@buildhost) (gcc version 7.5.0 (SUSE Linux)) #1 SMP Sun Jan 23 19:27:23 UTC 2022 (c76af22) Since then the node (Dell PowerEdge R7415) is running normally again. Regards, Ulrich _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/