Hi! I want to give a non-update on the issue: The kernel still segfaults random processes, and there is really nothing from support within two months that could help improve the situation. The cluster is logging all kinds on non-funny messages like these:
Apr 27 02:20:49 h18 systemd-coredump[22319]: [🡕] Process 22317 (controld) of user 0 dumped core. Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000246ea08b idx:1 val:3 Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000259b58a0 idx:1 val:7 Apr 27 02:20:49 h18 controld(prm_DLM)[22330]: ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing For a hypervisor host this means that many VMs are reset the hard way! Other resources weren't stopped properly, too, of course. There also two NULL-pointer outputs in messages on the DC: Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Found 18 entries for 118/(null): 0 in progress, 17 completed Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Node 118/(null) last kicked at: 1650418762 I guess that NULL pointer should have been the host name (h18) in reality. Also it seems h18 fenced itself, and the DC h16 seeing that wants to fence again (to make sure, maybe), but there is some odd problem: Apr 27 02:21:07 h16 pacemaker-controld[7453]: notice: Requesting fencing (reboot) of node h18 Apr 27 02:21:07 h16 pacemaker-fenced[7443]: notice: Client pacemaker-controld.7453.a9d67c8b wants to fence (reboot) 'h18' with device '(any)' Apr 27 02:21:07 h16 pacemaker-fenced[7443]: notice: Merging stonith action 'reboot' targeting h18 originating from client pacemaker-controld.7453.73d8bbd6 with identical request from [email protected] (360> Apr 27 02:22:52 h16 pacemaker-fenced[7443]: warning: fence_legacy_reboot_1 process (PID 39749) timed out Apr 27 02:22:52 h16 pacemaker-fenced[7443]: warning: fence_legacy_reboot_1[39749] timed out after 120000ms Apr 27 02:22:52 h16 pacemaker-fenced[7443]: error: Operation 'reboot' [39749] (call 2 from stonith_admin.controld.22336) for host 'h18' with device 'prm_stonith_sbd' returned: -62 (Timer expired) I never saw such message before. Evenbtually: Apr 27 02:24:53 h16 pacemaker-controld[7453]: notice: Stonith operation 31/1:3347:0:48bafcab-fecf-4ea0-84a8-c31ab1694b3a: OK (0) Apr 27 02:24:53 h16 pacemaker-controld[7453]: notice: Peer h18 was terminated (reboot) by h16 on behalf of pacemaker-controld.7453: OK The olny thing I found out was that the kernel running without Xen does not show RAM corruption. Regards, Ulrich _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
