Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?
On Wed, Jun 15, 2022 at 2:10 PM Ulrich Windl wrote: > > >>> Klaus Wenninger schrieb am 15.06.2022 um 13:22 in > Nachricht > : > > On Wed, Jun 15, 2022 at 10:33 AM Ulrich Windl > > wrote: > >> > > ... > > >> (As said above it may be some RAM corruption where SMI (system management > >> interrupts, or so) play a role, but Dell says the hardware is OK, and using > >> SLES we don't have software support with Dell, so they won't even consider > > that > >> fact.) > > > > That happens inside of VMs right? I mean nodes being VMs. > > No, it happens on the hypervisor nodes that are part of the cluster. > What I described below as well froze the whole machine - till it was taken down by the hardware-watchdog. > > A couple of years back I had an issue running protected mode inside > > of kvm-virtual machines on Lenovo laptops. > > That was really an SMI issue (obviously issues when an SMI interrupt > > was invoked during the CPU being in protected mode) that went away > > disabling SMI interrupts. > > I have no idea if that is still possible with current chipsets. And I'm not > > telling you to do that in production but it might be interesting to narrow > > the issue down still. One might run into thermal issues and such > > SMI is taking care of on that hardware. > > Well, as I have no better idea, I'd probably even give "kick it hard with the > foot" a chance ;-) Don't know if it is of much use but this is what I was using iirc https://github.com/zultron/smictrl. Jan back then wrote it for his laptop and mine showed the same behavior and being close enough chipset-wise it did the trick on mine as well. Obviously reading uefi-variables from the os as well triggers some SMI action. So booting with a legacy bios - if possible - might be an interesting test-case. > > Regards, > Ulrich > > > > > Klaus > >> > >> But actually I start believing such a system is a good playground for any > >> HA > >> solution ;-) > >> Unfortunately here it's much more production than playground... > >> > >> Regards, > >> Ulrich > >> > >> > >> ___ > >> Manage your subscription: > >> https://lists.clusterlabs.org/mailman/listinfo/users > >> > >> ClusterLabs home: https://www.clusterlabs.org/ > > > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Pacemaker 2.1.4 final release now available
Hi all, The final release of Pacemaker 2.1.4 is now available at: https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.1.4 This is a bug fix release, fixing regressions in the recent 2.1.3 release. Many thanks to all contributors of source code to this release, including Chris Lumens, Ken Gaillot, Petr Pavlu, and Reid Wahl. -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?
>>> Klaus Wenninger schrieb am 15.06.2022 um 13:22 in Nachricht : > On Wed, Jun 15, 2022 at 10:33 AM Ulrich Windl > wrote: >> ... >> (As said above it may be some RAM corruption where SMI (system management >> interrupts, or so) play a role, but Dell says the hardware is OK, and using >> SLES we don't have software support with Dell, so they won't even consider > that >> fact.) > > That happens inside of VMs right? I mean nodes being VMs. No, it happens on the hypervisor nodes that are part of the cluster. > A couple of years back I had an issue running protected mode inside > of kvm-virtual machines on Lenovo laptops. > That was really an SMI issue (obviously issues when an SMI interrupt > was invoked during the CPU being in protected mode) that went away > disabling SMI interrupts. > I have no idea if that is still possible with current chipsets. And I'm not > telling you to do that in production but it might be interesting to narrow > the issue down still. One might run into thermal issues and such > SMI is taking care of on that hardware. Well, as I have no better idea, I'd probably even give "kick it hard with the foot" a chance ;-) Regards, Ulrich > > Klaus >> >> But actually I start believing such a system is a good playground for any HA >> solution ;-) >> Unfortunately here it's much more production than playground... >> >> Regards, >> Ulrich >> >> >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?
On Wed, Jun 15, 2022 at 10:33 AM Ulrich Windl wrote: > > >>> Klaus Wenninger schrieb am 15.06.2022 um 10:00 in > Nachricht > : > > On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl > > wrote: > >> > >> >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174 > : 161 > > : > >> 60728>: > >> > >> ... > >> > Yes it's odd, but isn't the cluster just to protect us from odd > situations? > >> > ;‑) > >> > >> I have more odd stuff: > >> Jun 14 20:40:09 rksaph18 pacemaker‑execd[7020]: warning: > > prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out > >> ... > >> Jun 14 20:40:14 h18 pacemaker‑execd[7020]: crit: > > prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die! > >> ... > >> Jun 14 20:40:53 h18 pacemaker‑controld[7026]: warning: lrmd IPC request > 525 > > failed: Connection timed out after 5000ms > >> Jun 14 20:40:53 h18 pacemaker‑controld[7026]: error: Couldn't perform > > lrmd_rsc_cancel operation (timeout=0): ‑110: Connection timed out (110) > >> ... > >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: error: Couldn't perform > > lrmd_rsc_exec operation (timeout=9): ‑114: Connection timed out (110) > >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: error: Operation stop on > > prm_lockspace_ocfs2 failed: ‑70 > >> ... > >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: warning: Input I_FAIL > received > > in state S_NOT_DC from do_lrm_rsc_op > >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: notice: State transition > > S_NOT_DC ‑> S_RECOVERY > >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: warning: Fast‑tracking > shutdown > > in response to errors > >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: error: Input I_TERMINATE > > received in state S_RECOVERY from do_recover > >> Jun 14 20:42:28 h18 pacemaker‑controld[7026]: warning: Sending IPC to lrmd > > > disabled until pending reply received > >> Jun 14 20:42:28 h18 pacemaker‑controld[7026]: error: Couldn't perform > > lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110) > >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: warning: Sending IPC to lrmd > > > disabled until pending reply received > >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: error: Couldn't perform > > lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110) > >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: notice: Stopped 2 recurring > > > operations at shutdown (0 remaining) > >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: error: 3 resources were > active > > at shutdown > >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: notice: Disconnected from > the > > executor > >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: notice: Disconnected from > > Corosync > >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: notice: Disconnected from > the > > CIB manager > >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: error: Could not recover > from > > internal error > >> Jun 14 20:42:33 h18 pacemakerd[7003]: error: pacemaker‑controld[7026] > exited > > with status 1 (Error occurred) > >> Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping > pacemaker‑schedulerd > >> Jun 14 20:42:33 h18 pacemaker‑schedulerd[7024]: notice: Caught > 'Terminated' > > signal > >> Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker‑attrd > >> Jun 14 20:42:33 h18 pacemaker‑attrd[7022]: notice: Caught 'Terminated' > > signal > >> Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker‑execd > >> Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: pcmk health > > check: UNHEALTHY > >> Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: Servant pcmk is > > > outdated (age: 41877) > >> (SBD Fencing) > >> > > > > Rolling it up from the back I guess the reaction to self‑fence in case > > pacemaker > > is telling it doesn't know ‑ and isn't able to find out ‑ about the > > state of the resources > > is basically correct. > > > > Seeing the issue with the fake‑age being printed ‑ possibly causing > > confusion ‑ it reminds > > me that this should be addressed. Thought we had already but obviously > > a false memory. > > Hi Klaus and others! > > Well that is the current update state of SLES15 SP3; maybe upstream updates > did not make it into SLES yet; I don't know. > > > > > Would be interesting if pacemaker would recover the sub‑processes > > without sbd around > > and other ways of fencing ‑ that should kick in in a similar way ‑ > > would need a significant > > time. > > As pacemakerd recently started to ping the sub‑daemons via ipc ‑ > > instead of just listening > > for signals ‑ it would be interesting if logs we are seeing are > > already from that code. > > The "code" probably is: > pacemaker-2.0.5+20201202.ba59be712-150300.4.21.1.x86_64 > > > > > That what is happening with the monitor‑process kicked off by execd seems to > > > hog > > the ipc for a significant time might be an issue to look after. > > I don't know the details (even support at SUSE doesn't know what's going
[ClusterLabs] Antw: Re: Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?
>>> Klaus Wenninger schrieb am 15.06.2022 um 10:00 in Nachricht : > On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl > wrote: >> >> >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174 : 161 > : >> 60728>: >> >> ... >> > Yes it's odd, but isn't the cluster just to protect us from odd situations? >> > ;‑) >> >> I have more odd stuff: >> Jun 14 20:40:09 rksaph18 pacemaker‑execd[7020]: warning: > prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out >> ... >> Jun 14 20:40:14 h18 pacemaker‑execd[7020]: crit: > prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die! >> ... >> Jun 14 20:40:53 h18 pacemaker‑controld[7026]: warning: lrmd IPC request 525 > failed: Connection timed out after 5000ms >> Jun 14 20:40:53 h18 pacemaker‑controld[7026]: error: Couldn't perform > lrmd_rsc_cancel operation (timeout=0): ‑110: Connection timed out (110) >> ... >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: error: Couldn't perform > lrmd_rsc_exec operation (timeout=9): ‑114: Connection timed out (110) >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: error: Operation stop on > prm_lockspace_ocfs2 failed: ‑70 >> ... >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: warning: Input I_FAIL received > in state S_NOT_DC from do_lrm_rsc_op >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: notice: State transition > S_NOT_DC ‑> S_RECOVERY >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: warning: Fast‑tracking shutdown > in response to errors >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: error: Input I_TERMINATE > received in state S_RECOVERY from do_recover >> Jun 14 20:42:28 h18 pacemaker‑controld[7026]: warning: Sending IPC to lrmd > disabled until pending reply received >> Jun 14 20:42:28 h18 pacemaker‑controld[7026]: error: Couldn't perform > lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110) >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: warning: Sending IPC to lrmd > disabled until pending reply received >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: error: Couldn't perform > lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110) >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: notice: Stopped 2 recurring > operations at shutdown (0 remaining) >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: error: 3 resources were active > at shutdown >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: notice: Disconnected from the > executor >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: notice: Disconnected from > Corosync >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: notice: Disconnected from the > CIB manager >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: error: Could not recover from > internal error >> Jun 14 20:42:33 h18 pacemakerd[7003]: error: pacemaker‑controld[7026] exited > with status 1 (Error occurred) >> Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker‑schedulerd >> Jun 14 20:42:33 h18 pacemaker‑schedulerd[7024]: notice: Caught 'Terminated' > signal >> Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker‑attrd >> Jun 14 20:42:33 h18 pacemaker‑attrd[7022]: notice: Caught 'Terminated' > signal >> Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker‑execd >> Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: pcmk health > check: UNHEALTHY >> Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: Servant pcmk is > outdated (age: 41877) >> (SBD Fencing) >> > > Rolling it up from the back I guess the reaction to self‑fence in case > pacemaker > is telling it doesn't know ‑ and isn't able to find out ‑ about the > state of the resources > is basically correct. > > Seeing the issue with the fake‑age being printed ‑ possibly causing > confusion ‑ it reminds > me that this should be addressed. Thought we had already but obviously > a false memory. Hi Klaus and others! Well that is the current update state of SLES15 SP3; maybe upstream updates did not make it into SLES yet; I don't know. > > Would be interesting if pacemaker would recover the sub‑processes > without sbd around > and other ways of fencing ‑ that should kick in in a similar way ‑ > would need a significant > time. > As pacemakerd recently started to ping the sub‑daemons via ipc ‑ > instead of just listening > for signals ‑ it would be interesting if logs we are seeing are > already from that code. The "code" probably is: pacemaker-2.0.5+20201202.ba59be712-150300.4.21.1.x86_64 > > That what is happening with the monitor‑process kicked off by execd seems to > hog > the ipc for a significant time might be an issue to look after. I don't know the details (even support at SUSE doesn't know what's going on in the kernel, it seems), but it looks as if one "stalled" monitor process can cause the node to be fenced. I had been considering this extreme paranoid idea: What if you could configure three (different) monitor operations for a resource, and an action will be triggered
Re: [ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?
On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl wrote: > > >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174 : > >>> 161 : > 60728>: > > ... > > Yes it's odd, but isn't the cluster just to protect us from odd situations? > > ;-) > > I have more odd stuff: > Jun 14 20:40:09 rksaph18 pacemaker-execd[7020]: warning: > prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out > ... > Jun 14 20:40:14 h18 pacemaker-execd[7020]: crit: > prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die! > ... > Jun 14 20:40:53 h18 pacemaker-controld[7026]: warning: lrmd IPC request 525 > failed: Connection timed out after 5000ms > Jun 14 20:40:53 h18 pacemaker-controld[7026]: error: Couldn't perform > lrmd_rsc_cancel operation (timeout=0): -110: Connection timed out (110) > ... > Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Couldn't perform > lrmd_rsc_exec operation (timeout=9): -114: Connection timed out (110) > Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Operation stop on > prm_lockspace_ocfs2 failed: -70 > ... > Jun 14 20:42:23 h18 pacemaker-controld[7026]: warning: Input I_FAIL received > in state S_NOT_DC from do_lrm_rsc_op > Jun 14 20:42:23 h18 pacemaker-controld[7026]: notice: State transition > S_NOT_DC -> S_RECOVERY > Jun 14 20:42:23 h18 pacemaker-controld[7026]: warning: Fast-tracking > shutdown in response to errors > Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Input I_TERMINATE > received in state S_RECOVERY from do_recover > Jun 14 20:42:28 h18 pacemaker-controld[7026]: warning: Sending IPC to lrmd > disabled until pending reply received > Jun 14 20:42:28 h18 pacemaker-controld[7026]: error: Couldn't perform > lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110) > Jun 14 20:42:33 h18 pacemaker-controld[7026]: warning: Sending IPC to lrmd > disabled until pending reply received > Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: Couldn't perform > lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110) > Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Stopped 2 recurring > operations at shutdown (0 remaining) > Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: 3 resources were active > at shutdown > Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from the > executor > Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from > Corosync > Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from the > CIB manager > Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: Could not recover from > internal error > Jun 14 20:42:33 h18 pacemakerd[7003]: error: pacemaker-controld[7026] exited > with status 1 (Error occurred) > Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-schedulerd > Jun 14 20:42:33 h18 pacemaker-schedulerd[7024]: notice: Caught 'Terminated' > signal > Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-attrd > Jun 14 20:42:33 h18 pacemaker-attrd[7022]: notice: Caught 'Terminated' signal > Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-execd > Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: pcmk health check: > UNHEALTHY > Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: Servant pcmk is > outdated (age: 41877) > (SBD Fencing) > Rolling it up from the back I guess the reaction to self-fence in case pacemaker is telling it doesn't know - and isn't able to find out - about the state of the resources is basically correct. Seeing the issue with the fake-age being printed - possibly causing confusion - it reminds me that this should be addressed. Thought we had already but obviously a false memory. Would be interesting if pacemaker would recover the sub-processes without sbd around and other ways of fencing - that should kick in in a similar way - would need a significant time. As pacemakerd recently started to ping the sub-daemons via ipc - instead of just listening for signals - it would be interesting if logs we are seeing are already from that code. That what is happening with the monitor-process kicked off by execd seems to hog the ipc for a significant time might be an issue to look after. Although the new implementation in pacemakerd might kick in and recover execd - for what that is worth in the end. This all seems to be kicked off by an RA that might not be robust enough or the node is in a state that just doesn't allow a better answer. Guess timeouts and retries required to give a timely answer about the state of a resource should be taken care of inside the RA. Guess the last 2 are at least something totally different than fork segfaulting although that might as well be a sign that there is something really wrong with the node. Klaus > Regards, > Ulrich > > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs
[ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?
>>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174 : >>> 161 : 60728>: ... > Yes it's odd, but isn't the cluster just to protect us from odd situations? > ;-) I have more odd stuff: Jun 14 20:40:09 rksaph18 pacemaker-execd[7020]: warning: prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out ... Jun 14 20:40:14 h18 pacemaker-execd[7020]: crit: prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die! ... Jun 14 20:40:53 h18 pacemaker-controld[7026]: warning: lrmd IPC request 525 failed: Connection timed out after 5000ms Jun 14 20:40:53 h18 pacemaker-controld[7026]: error: Couldn't perform lrmd_rsc_cancel operation (timeout=0): -110: Connection timed out (110) ... Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Couldn't perform lrmd_rsc_exec operation (timeout=9): -114: Connection timed out (110) Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Operation stop on prm_lockspace_ocfs2 failed: -70 ... Jun 14 20:42:23 h18 pacemaker-controld[7026]: warning: Input I_FAIL received in state S_NOT_DC from do_lrm_rsc_op Jun 14 20:42:23 h18 pacemaker-controld[7026]: notice: State transition S_NOT_DC -> S_RECOVERY Jun 14 20:42:23 h18 pacemaker-controld[7026]: warning: Fast-tracking shutdown in response to errors Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Input I_TERMINATE received in state S_RECOVERY from do_recover Jun 14 20:42:28 h18 pacemaker-controld[7026]: warning: Sending IPC to lrmd disabled until pending reply received Jun 14 20:42:28 h18 pacemaker-controld[7026]: error: Couldn't perform lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110) Jun 14 20:42:33 h18 pacemaker-controld[7026]: warning: Sending IPC to lrmd disabled until pending reply received Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: Couldn't perform lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110) Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Stopped 2 recurring operations at shutdown (0 remaining) Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: 3 resources were active at shutdown Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from the executor Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from Corosync Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from the CIB manager Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: Could not recover from internal error Jun 14 20:42:33 h18 pacemakerd[7003]: error: pacemaker-controld[7026] exited with status 1 (Error occurred) Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-schedulerd Jun 14 20:42:33 h18 pacemaker-schedulerd[7024]: notice: Caught 'Terminated' signal Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-attrd Jun 14 20:42:33 h18 pacemaker-attrd[7022]: notice: Caught 'Terminated' signal Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-execd Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: pcmk health check: UNHEALTHY Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: Servant pcmk is outdated (age: 41877) (SBD Fencing) Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/