Update for the Update: I had installed SLES Updates in one VM and rebooted it via cluster. While installing the updates in the VM the Xen host got RAM corruption (it seems any disk I/O on the host, either locally or via a VM image causes RAM corruption):
Apr 27 10:56:44 h19 kernel: pacemaker-execd[39797]: segfault at 3a46 ip 0000000000003a46 sp 00007ffd1c92e8e8 error 14 in pacemaker-execd[5565921cc000+b000] Fortunately that wasn't fatal and my rescue script kicked in before things get really bad: Apr 27 11:00:01 h19 reboot-before-panic[40630]: RAM corruption detected, starting pro-active reboot All VMs could be live-migrated away before reboot, but this SLES release is completely unusable! Regards, Ulrich >>> Ulrich Windl schrieb am 27.04.2022 um 08:02 in Nachricht <6268DC91.C1D : 161 : 60728>: > Hi! > > I want to give a non-update on the issue: > The kernel still segfaults random processes, and there is really nothing > from support within two months that could help improve the situation. > The cluster is logging all kinds on non-funny messages like these: > > Apr 27 02:20:49 h18 systemd-coredump[22319]: [🡕] Process 22317 (controld) > of user 0 dumped core. > Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000246ea08b > idx:1 val:3 > Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000259b58a0 > idx:1 val:7 > Apr 27 02:20:49 h18 controld(prm_DLM)[22330]: ERROR: Uncontrolled lockspace > exists, system must reboot. Executing suicide fencing > > For a hypervisor host this means that many VMs are reset the hard way! > Other resources weren't stopped properly, too, of course. > > > There also two NULL-pointer outputs in messages on the DC: > Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Found 18 entries > for 118/(null): 0 in progress, 17 completed > Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Node 118/(null) > last kicked at: 1650418762 > > I guess that NULL pointer should have been the host name (h18) in reality. > > Also it seems h18 fenced itself, and the DC h16 seeing that wants to fence > again (to make sure, maybe), but there is some odd problem: > > Apr 27 02:21:07 h16 pacemaker-controld[7453]: notice: Requesting fencing > (reboot) of node h18 > Apr 27 02:21:07 h16 pacemaker-fenced[7443]: notice: Client > pacemaker-controld.7453.a9d67c8b wants to fence (reboot) 'h18' with device > '(any)' > Apr 27 02:21:07 h16 pacemaker-fenced[7443]: notice: Merging stonith action > 'reboot' targeting h18 originating from client > pacemaker-controld.7453.73d8bbd6 with identical request from > [email protected] (360> > > Apr 27 02:22:52 h16 pacemaker-fenced[7443]: warning: fence_legacy_reboot_1 > process (PID 39749) timed out > Apr 27 02:22:52 h16 pacemaker-fenced[7443]: warning: > fence_legacy_reboot_1[39749] timed out after 120000ms > Apr 27 02:22:52 h16 pacemaker-fenced[7443]: error: Operation 'reboot' > [39749] (call 2 from stonith_admin.controld.22336) for host 'h18' with device > 'prm_stonith_sbd' returned: -62 (Timer expired) > > I never saw such message before. Evenbtually: > > Apr 27 02:24:53 h16 pacemaker-controld[7453]: notice: Stonith operation > 31/1:3347:0:48bafcab-fecf-4ea0-84a8-c31ab1694b3a: OK (0) > Apr 27 02:24:53 h16 pacemaker-controld[7453]: notice: Peer h18 was > terminated (reboot) by h16 on behalf of pacemaker-controld.7453: OK > > The olny thing I found out was that the kernel running without Xen does not > show RAM corruption. > > Regards, > Ulrich > > > > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
