Hi Ulrich,

On 2022/4/27 11:13, Ulrich Windl wrote:
Update for the Update:

I had installed SLES Updates in one VM and rebooted it via cluster. While
installing the updates in the VM the Xen host got RAM corruption (it seems any
disk I/O on the host, either locally or via a VM image causes RAM corruption):

I totally understand your frustrations on this, but I don't really see how much the potential kernel issue is relevant to this mailing list.

I believe SUSE support has been working and trying to address it and they will update you once there's further progress.

About the topics related to cluster, please find the comments in below.


Apr 27 10:56:44 h19 kernel: pacemaker-execd[39797]: segfault at 3a46 ip
0000000000003a46 sp 00007ffd1c92e8e8 error 14 in
pacemaker-execd[5565921cc000+b000]

Fortunately that wasn't fatal and my rescue script kicked in before things get
really bad:
Apr 27 11:00:01 h19 reboot-before-panic[40630]: RAM corruption detected,
starting pro-active reboot

All VMs could be live-migrated away before reboot, but this SLES release is
completely unusable!

Regards,
Ulrich



Ulrich Windl schrieb am 27.04.2022 um 08:02 in Nachricht <6268DC91.C1D :
161 :
60728>:
Hi!

I want to give a non-update on the issue:
The kernel still segfaults random processes, and there is really nothing
from support within two months that could help improve the situation.
The cluster is logging all kinds on non-funny messages like these:

Apr 27 02:20:49 h18 systemd-coredump[22319]: [����] Process 22317 (controld)
of user 0 dumped core.
Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000246ea08b
idx:1 val:3
Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000259b58a0
idx:1 val:7
Apr 27 02:20:49 h18 controld(prm_DLM)[22330]: ERROR: Uncontrolled lockspace

exists, system must reboot. Executing suicide fencing

For a hypervisor host this means that many VMs are reset the hard way!
Other resources weren't stopped properly, too, of course.


There also two NULL-pointer outputs in messages on the DC:
Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Found 18 entries
for 118/(null): 0 in progress, 17 completed
Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Node 118/(null)
last kicked at: 1650418762

I guess that NULL pointer should have been the host name (h18) in reality.

It's as expected being NULL here. DLM requests fencing through pacemaker's stonith api targeting a node by its corosync nodeid (118 here), which it has the knowledge of, rather than the node name. Pacemaker will do the interpretation and eventually issue the fencing.


Also it seems h18 fenced itself, and the DC h16 seeing that wants to fence
again (to make sure, maybe), but there is some odd problem:

Apr 27 02:21:07 h16 pacemaker-controld[7453]:  notice: Requesting fencing
(reboot) of node h18
Apr 27 02:21:07 h16 pacemaker-fenced[7443]:  notice: Client
pacemaker-controld.7453.a9d67c8b wants to fence (reboot) 'h18' with device
'(any)'
Apr 27 02:21:07 h16 pacemaker-fenced[7443]:  notice: Merging stonith action

'reboot' targeting h18 originating from client
pacemaker-controld.7453.73d8bbd6 with identical request from
[email protected] (360>

This is also as expected when DLM is used. Despite the fencing previously proactively requested by DLM, pacemaker also has its reason to issue a fencing targeting the node. And fenced daemons is aware there's already the pending/on-going fencing targeting the same node, so it doesn't really need to issue it once again.


Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  warning: fence_legacy_reboot_1

process (PID 39749) timed out
Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  warning:
fence_legacy_reboot_1[39749] timed out after 120000ms
Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  error: Operation 'reboot'
[39749] (call 2 from stonith_admin.controld.22336) for host 'h18' with
device
'prm_stonith_sbd' returned: -62 (Timer expired)

Please make sure:
stonith-timeout > sbd_msgwait + pcmk_delay_max

If it was already the case, probably sbd was encountering certain difficulties writing the poison pill at that time ...

Regards,
  Yan


I never saw such message before. Evenbtually:

Apr 27 02:24:53 h16 pacemaker-controld[7453]:  notice: Stonith operation
31/1:3347:0:48bafcab-fecf-4ea0-84a8-c31ab1694b3a: OK (0)
Apr 27 02:24:53 h16 pacemaker-controld[7453]:  notice: Peer h18 was
terminated (reboot) by h16 on behalf of pacemaker-controld.7453: OK

The olny thing I found out was that the kernel running without Xen does not

show RAM corruption.

Regards,
Ulrich








_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to