>>> "Gao,Yan" <[email protected]> schrieb am 27.04.2022 um 14:31 in Nachricht <[email protected]>: > Hi Ulrich, > > On 2022/4/27 11:13, Ulrich Windl wrote: >> Update for the Update: >> >> I had installed SLES Updates in one VM and rebooted it via cluster. While >> installing the updates in the VM the Xen host got RAM corruption (it seems > any >> disk I/O on the host, either locally or via a VM image causes RAM > corruption): > > I totally understand your frustrations on this, but I don't really see > how much the potential kernel issue is relevant to this mailing list.
Well, you use some HA solution based on pacemaker, and that solution fails miserably. I guess that users don't want to have the same experience while relying on their services to run. The other thing is a kind of "product warning": Don't use SLES15 SP3 now with Xen nad cluster if your really want HA. That was my idea. I understand that SUSE does not like to see such messsges in public, but maybe "24x7" support should try a bit harder to solve the issue or at least provide a wor-around than they did. > > I believe SUSE support has been working and trying to address it and > they will update you once there's further progress. Well, it's more than two months since reporting... No need to say more. > > About the topics related to cluster, please find the comments in below. OK. > >> >> Apr 27 10:56:44 h19 kernel: pacemaker-execd[39797]: segfault at 3a46 ip >> 0000000000003a46 sp 00007ffd1c92e8e8 error 14 in >> pacemaker-execd[5565921cc000+b000] >> >> Fortunately that wasn't fatal and my rescue script kicked in before things > get >> really bad: >> Apr 27 11:00:01 h19 reboot-before-panic[40630]: RAM corruption detected, >> starting pro-active reboot >> >> All VMs could be live-migrated away before reboot, but this SLES release is >> completely unusable! >> >> Regards, >> Ulrich >> >> >> >>>>> Ulrich Windl schrieb am 27.04.2022 um 08:02 in Nachricht <6268DC91.C1D : >> 161 : >> 60728>: >>> Hi! >>> >>> I want to give a non-update on the issue: >>> The kernel still segfaults random processes, and there is really nothing >>> from support within two months that could help improve the situation. >>> The cluster is logging all kinds on non-funny messages like these: >>> >>> Apr 27 02:20:49 h18 systemd-coredump[22319]: [] Process 22317 (controld) >>> of user 0 dumped core. >>> Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000246ea08b >>> idx:1 val:3 >>> Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000259b58a0 >>> idx:1 val:7 >>> Apr 27 02:20:49 h18 controld(prm_DLM)[22330]: ERROR: Uncontrolled lockspace >> >>> exists, system must reboot. Executing suicide fencing >>> >>> For a hypervisor host this means that many VMs are reset the hard way! >>> Other resources weren't stopped properly, too, of course. >>> >>> >>> There also two NULL-pointer outputs in messages on the DC: >>> Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Found 18 entries >>> for 118/(null): 0 in progress, 17 completed >>> Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Node 118/(null) >>> last kicked at: 1650418762 >>> >>> I guess that NULL pointer should have been the host name (h18) in reality. > > It's as expected being NULL here. DLM requests fencing through > pacemaker's stonith api targeting a node by its corosync nodeid (118 > here), which it has the knowledge of, rather than the node name. > Pacemaker will do the interpretation and eventually issue the fencing. > >>> >>> Also it seems h18 fenced itself, and the DC h16 seeing that wants to fence >>> again (to make sure, maybe), but there is some odd problem: >>> >>> Apr 27 02:21:07 h16 pacemaker-controld[7453]: notice: Requesting fencing >>> (reboot) of node h18 >>> Apr 27 02:21:07 h16 pacemaker-fenced[7443]: notice: Client >>> pacemaker-controld.7453.a9d67c8b wants to fence (reboot) 'h18' with device >>> '(any)' >>> Apr 27 02:21:07 h16 pacemaker-fenced[7443]: notice: Merging stonith action >> >>> 'reboot' targeting h18 originating from client >>> pacemaker-controld.7453.73d8bbd6 with identical request from >>> [email protected] (360> > > This is also as expected when DLM is used. Despite the fencing > previously proactively requested by DLM, pacemaker also has its reason > to issue a fencing targeting the node. And fenced daemons is aware > there's already the pending/on-going fencing targeting the same node, so > it doesn't really need to issue it once again. > >>> >>> Apr 27 02:22:52 h16 pacemaker-fenced[7443]: warning: fence_legacy_reboot_1 >> >>> process (PID 39749) timed out >>> Apr 27 02:22:52 h16 pacemaker-fenced[7443]: warning: >>> fence_legacy_reboot_1[39749] timed out after 120000ms >>> Apr 27 02:22:52 h16 pacemaker-fenced[7443]: error: Operation 'reboot' >>> [39749] (call 2 from stonith_admin.controld.22336) for host 'h18' with >> device >>> 'prm_stonith_sbd' returned: -62 (Timer expired) > > Please make sure: > stonith-timeout > sbd_msgwait + pcmk_delay_max Checked that; it's true. > > If it was already the case, probably sbd was encountering certain > difficulties writing the poison pill at that time ... Yes, as said before: With that kernel, most HA mechanisms just fail. Regards, Ulrich > > Regards, > Yan > >>> >>> I never saw such message before. Evenbtually: >>> >>> Apr 27 02:24:53 h16 pacemaker-controld[7453]: notice: Stonith operation >>> 31/1:3347:0:48bafcab-fecf-4ea0-84a8-c31ab1694b3a: OK (0) >>> Apr 27 02:24:53 h16 pacemaker-controld[7453]: notice: Peer h18 was >>> terminated (reboot) by h16 on behalf of pacemaker-controld.7453: OK >>> >>> The olny thing I found out was that the kernel running without Xen does not >> >>> show RAM corruption. >>> >>> Regards, >>> Ulrich >>> >>> >>> >>> >> >> >> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
