Hi Ulrich,
On 2022/4/27 11:13, Ulrich Windl wrote:
Update for the Update:
I had installed SLES Updates in one VM and rebooted it via cluster. While
installing the updates in the VM the Xen host got RAM corruption (it seems any
disk I/O on the host, either locally or via a VM image causes RAM corruption):
I totally understand your frustrations on this, but I don't really see
how much the potential kernel issue is relevant to this mailing list.
I believe SUSE support has been working and trying to address it and
they will update you once there's further progress.
About the topics related to cluster, please find the comments in below.
Apr 27 10:56:44 h19 kernel: pacemaker-execd[39797]: segfault at 3a46 ip
0000000000003a46 sp 00007ffd1c92e8e8 error 14 in
pacemaker-execd[5565921cc000+b000]
Fortunately that wasn't fatal and my rescue script kicked in before things get
really bad:
Apr 27 11:00:01 h19 reboot-before-panic[40630]: RAM corruption detected,
starting pro-active reboot
All VMs could be live-migrated away before reboot, but this SLES release is
completely unusable!
Regards,
Ulrich
Ulrich Windl schrieb am 27.04.2022 um 08:02 in Nachricht <6268DC91.C1D :
161 :
60728>:
Hi!
I want to give a non-update on the issue:
The kernel still segfaults random processes, and there is really nothing
from support within two months that could help improve the situation.
The cluster is logging all kinds on non-funny messages like these:
Apr 27 02:20:49 h18 systemd-coredump[22319]: [����] Process 22317 (controld)
of user 0 dumped core.
Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000246ea08b
idx:1 val:3
Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000259b58a0
idx:1 val:7
Apr 27 02:20:49 h18 controld(prm_DLM)[22330]: ERROR: Uncontrolled lockspace
exists, system must reboot. Executing suicide fencing
For a hypervisor host this means that many VMs are reset the hard way!
Other resources weren't stopped properly, too, of course.
There also two NULL-pointer outputs in messages on the DC:
Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Found 18 entries
for 118/(null): 0 in progress, 17 completed
Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Node 118/(null)
last kicked at: 1650418762
I guess that NULL pointer should have been the host name (h18) in reality.
It's as expected being NULL here. DLM requests fencing through
pacemaker's stonith api targeting a node by its corosync nodeid (118
here), which it has the knowledge of, rather than the node name.
Pacemaker will do the interpretation and eventually issue the fencing.
Also it seems h18 fenced itself, and the DC h16 seeing that wants to fence
again (to make sure, maybe), but there is some odd problem:
Apr 27 02:21:07 h16 pacemaker-controld[7453]: notice: Requesting fencing
(reboot) of node h18
Apr 27 02:21:07 h16 pacemaker-fenced[7443]: notice: Client
pacemaker-controld.7453.a9d67c8b wants to fence (reboot) 'h18' with device
'(any)'
Apr 27 02:21:07 h16 pacemaker-fenced[7443]: notice: Merging stonith action
'reboot' targeting h18 originating from client
pacemaker-controld.7453.73d8bbd6 with identical request from
[email protected] (360>
This is also as expected when DLM is used. Despite the fencing
previously proactively requested by DLM, pacemaker also has its reason
to issue a fencing targeting the node. And fenced daemons is aware
there's already the pending/on-going fencing targeting the same node, so
it doesn't really need to issue it once again.
Apr 27 02:22:52 h16 pacemaker-fenced[7443]: warning: fence_legacy_reboot_1
process (PID 39749) timed out
Apr 27 02:22:52 h16 pacemaker-fenced[7443]: warning:
fence_legacy_reboot_1[39749] timed out after 120000ms
Apr 27 02:22:52 h16 pacemaker-fenced[7443]: error: Operation 'reboot'
[39749] (call 2 from stonith_admin.controld.22336) for host 'h18' with
device
'prm_stonith_sbd' returned: -62 (Timer expired)
Please make sure:
stonith-timeout > sbd_msgwait + pcmk_delay_max
If it was already the case, probably sbd was encountering certain
difficulties writing the poison pill at that time ...
Regards,
Yan
I never saw such message before. Evenbtually:
Apr 27 02:24:53 h16 pacemaker-controld[7453]: notice: Stonith operation
31/1:3347:0:48bafcab-fecf-4ea0-84a8-c31ab1694b3a: OK (0)
Apr 27 02:24:53 h16 pacemaker-controld[7453]: notice: Peer h18 was
terminated (reboot) by h16 on behalf of pacemaker-controld.7453: OK
The olny thing I found out was that the kernel running without Xen does not
show RAM corruption.
Regards,
Ulrich
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/