Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

2022-06-15 Thread Klaus Wenninger
On Wed, Jun 15, 2022 at 2:10 PM Ulrich Windl
 wrote:
>
> >>> Klaus Wenninger  schrieb am 15.06.2022 um 13:22 in
> Nachricht
> :
> > On Wed, Jun 15, 2022 at 10:33 AM Ulrich Windl
> >  wrote:
> >>
>
> ...
>
> >> (As said above it may be some RAM corruption where SMI (system management
> >> interrupts, or so) play a role, but Dell says the hardware is OK, and using
> >> SLES we don't have software support with Dell, so they won't even consider
> > that
> >> fact.)
> >
> > That happens inside of VMs right? I mean nodes being VMs.
>
> No, it happens on the hypervisor nodes that are part of the cluster.
>

What I described below as well froze the whole machine - till
it was taken down by the hardware-watchdog.

> > A couple of years back I had an issue running protected mode inside
> > of kvm-virtual machines on Lenovo laptops.
> > That was really an SMI issue (obviously issues when an SMI interrupt
> > was invoked during the CPU being in protected mode) that went away
> > disabling SMI interrupts.
> > I have no idea if that is still possible with current chipsets. And I'm not
> > telling you to do that in production but it might be interesting to narrow
> > the issue down still. One might run into thermal issues and such
> > SMI is taking care of on that hardware.
>
> Well, as I have no better idea, I'd probably even give "kick it hard with the 
> foot" a chance ;-)

Don't know if it is of much use but this is what I was using iirc
https://github.com/zultron/smictrl.
Jan back then wrote it for his laptop and mine showed the same behavior and
being close enough chipset-wise it did the trick on mine as well.

Obviously reading uefi-variables from the os as well triggers some SMI action.
So booting with a legacy bios - if possible - might be an interesting test-case.

>
> Regards,
> Ulrich
>
> >
> > Klaus
> >>
> >> But actually I start believing such a system is a good playground for any 
> >> HA
> >> solution ;-)
> >> Unfortunately here it's much more production than playground...
> >>
> >> Regards,
> >> Ulrich
> >>
> >>
> >> ___
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Pacemaker 2.1.4 final release now available

2022-06-15 Thread Ken Gaillot
Hi all,

The final release of Pacemaker 2.1.4 is now available at:

  https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.1.4

This is a bug fix release, fixing regressions in the recent 2.1.3
release.

Many thanks to all contributors of source code to this release,
including Chris Lumens, Ken Gaillot, Petr Pavlu, and Reid Wahl.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

2022-06-15 Thread Ulrich Windl
>>> Klaus Wenninger  schrieb am 15.06.2022 um 13:22 in
Nachricht
:
> On Wed, Jun 15, 2022 at 10:33 AM Ulrich Windl
>  wrote:
>>

...

>> (As said above it may be some RAM corruption where SMI (system management
>> interrupts, or so) play a role, but Dell says the hardware is OK, and using
>> SLES we don't have software support with Dell, so they won't even consider 
> that
>> fact.)
> 
> That happens inside of VMs right? I mean nodes being VMs.

No, it happens on the hypervisor nodes that are part of the cluster.

> A couple of years back I had an issue running protected mode inside
> of kvm-virtual machines on Lenovo laptops.
> That was really an SMI issue (obviously issues when an SMI interrupt
> was invoked during the CPU being in protected mode) that went away
> disabling SMI interrupts.
> I have no idea if that is still possible with current chipsets. And I'm not
> telling you to do that in production but it might be interesting to narrow
> the issue down still. One might run into thermal issues and such
> SMI is taking care of on that hardware.

Well, as I have no better idea, I'd probably even give "kick it hard with the 
foot" a chance ;-)

Regards,
Ulrich

> 
> Klaus
>>
>> But actually I start believing such a system is a good playground for any HA
>> solution ;-)
>> Unfortunately here it's much more production than playground...
>>
>> Regards,
>> Ulrich
>>
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

2022-06-15 Thread Klaus Wenninger
On Wed, Jun 15, 2022 at 10:33 AM Ulrich Windl
 wrote:
>
> >>> Klaus Wenninger  schrieb am 15.06.2022 um 10:00 in
> Nachricht
> :
> > On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl
> >  wrote:
> >>
> >> >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174
> : 161
> > :
> >> 60728>:
> >>
> >> ...
> >> > Yes it's odd, but isn't the cluster just to protect us from odd
> situations?
> >> > ;‑)
> >>
> >> I have more odd stuff:
> >> Jun 14 20:40:09 rksaph18 pacemaker‑execd[7020]:  warning:
> > prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out
> >> ...
> >> Jun 14 20:40:14 h18 pacemaker‑execd[7020]:  crit:
> > prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die!
> >> ...
> >> Jun 14 20:40:53 h18 pacemaker‑controld[7026]:  warning: lrmd IPC request
> 525
> > failed: Connection timed out after 5000ms
> >> Jun 14 20:40:53 h18 pacemaker‑controld[7026]:  error: Couldn't perform
> > lrmd_rsc_cancel operation (timeout=0): ‑110: Connection timed out (110)
> >> ...
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Couldn't perform
> > lrmd_rsc_exec operation (timeout=9): ‑114: Connection timed out (110)
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Operation stop on
> > prm_lockspace_ocfs2 failed: ‑70
> >> ...
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  warning: Input I_FAIL
> received
> > in state S_NOT_DC from do_lrm_rsc_op
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  notice: State transition
> > S_NOT_DC ‑> S_RECOVERY
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  warning: Fast‑tracking
> shutdown
> > in response to errors
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Input I_TERMINATE
> > received in state S_RECOVERY from do_recover
> >> Jun 14 20:42:28 h18 pacemaker‑controld[7026]:  warning: Sending IPC to lrmd
>
> > disabled until pending reply received
> >> Jun 14 20:42:28 h18 pacemaker‑controld[7026]:  error: Couldn't perform
> > lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110)
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  warning: Sending IPC to lrmd
>
> > disabled until pending reply received
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: Couldn't perform
> > lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110)
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Stopped 2 recurring
>
> > operations at shutdown (0 remaining)
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: 3 resources were
> active
> > at shutdown
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from
> the
> > executor
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from
> > Corosync
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from
> the
> > CIB manager
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: Could not recover
> from
> > internal error
> >> Jun 14 20:42:33 h18 pacemakerd[7003]:  error: pacemaker‑controld[7026]
> exited
> > with status 1 (Error occurred)
> >> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping
> pacemaker‑schedulerd
> >> Jun 14 20:42:33 h18 pacemaker‑schedulerd[7024]:  notice: Caught
> 'Terminated'
> > signal
> >> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker‑attrd
> >> Jun 14 20:42:33 h18 pacemaker‑attrd[7022]:  notice: Caught 'Terminated'
> > signal
> >> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker‑execd
> >> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: pcmk health
> > check: UNHEALTHY
> >> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: Servant pcmk is
>
> > outdated (age: 41877)
> >> (SBD Fencing)
> >>
> >
> > Rolling it up from the back I guess the reaction to self‑fence in case
> > pacemaker
> > is telling it doesn't know ‑ and isn't able to find out ‑ about the
> > state of the resources
> > is basically correct.
> >
> > Seeing the issue with the fake‑age being printed ‑ possibly causing
> > confusion ‑ it reminds
> > me that this should be addressed. Thought we had already but obviously
> > a false memory.
>
> Hi Klaus and others!
>
> Well that is the current update state of SLES15 SP3; maybe upstream updates
> did not make it into SLES yet; I don't know.
>
> >
> > Would be interesting if pacemaker would recover the sub‑processes
> > without sbd around
> > and other ways of fencing ‑ that should kick in in a similar way ‑
> > would need a significant
> > time.
> > As pacemakerd recently started to ping the sub‑daemons via ipc ‑
> > instead of just listening
> > for signals ‑ it would be interesting if logs we are seeing are
> > already from that code.
>
> The "code" probably is:
> pacemaker-2.0.5+20201202.ba59be712-150300.4.21.1.x86_64
>
> >
> > That what is happening with the monitor‑process kicked off by execd seems to
>
> > hog
> > the ipc for a significant time might be an issue to look after.
>
> I don't know the details (even support at SUSE doesn't know what's going 

[ClusterLabs] Antw: Re: Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

2022-06-15 Thread Ulrich Windl
>>> Klaus Wenninger  schrieb am 15.06.2022 um 10:00 in
Nachricht
:
> On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl
>  wrote:
>>
>> >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174
: 161 
> :
>> 60728>:
>>
>> ...
>> > Yes it's odd, but isn't the cluster just to protect us from odd
situations?
>> > ;‑)
>>
>> I have more odd stuff:
>> Jun 14 20:40:09 rksaph18 pacemaker‑execd[7020]:  warning: 
> prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out
>> ...
>> Jun 14 20:40:14 h18 pacemaker‑execd[7020]:  crit: 
> prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die!
>> ...
>> Jun 14 20:40:53 h18 pacemaker‑controld[7026]:  warning: lrmd IPC request
525 
> failed: Connection timed out after 5000ms
>> Jun 14 20:40:53 h18 pacemaker‑controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): ‑110: Connection timed out (110)
>> ...
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Couldn't perform 
> lrmd_rsc_exec operation (timeout=9): ‑114: Connection timed out (110)
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Operation stop on 
> prm_lockspace_ocfs2 failed: ‑70
>> ...
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  warning: Input I_FAIL
received 
> in state S_NOT_DC from do_lrm_rsc_op
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  notice: State transition 
> S_NOT_DC ‑> S_RECOVERY
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  warning: Fast‑tracking
shutdown 
> in response to errors
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Input I_TERMINATE 
> received in state S_RECOVERY from do_recover
>> Jun 14 20:42:28 h18 pacemaker‑controld[7026]:  warning: Sending IPC to lrmd

> disabled until pending reply received
>> Jun 14 20:42:28 h18 pacemaker‑controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110)
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  warning: Sending IPC to lrmd

> disabled until pending reply received
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110)
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Stopped 2 recurring

> operations at shutdown (0 remaining)
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: 3 resources were
active 
> at shutdown
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from
the 
> executor
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from 
> Corosync
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from
the 
> CIB manager
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: Could not recover
from 
> internal error
>> Jun 14 20:42:33 h18 pacemakerd[7003]:  error: pacemaker‑controld[7026]
exited 
> with status 1 (Error occurred)
>> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping
pacemaker‑schedulerd
>> Jun 14 20:42:33 h18 pacemaker‑schedulerd[7024]:  notice: Caught
'Terminated' 
> signal
>> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker‑attrd
>> Jun 14 20:42:33 h18 pacemaker‑attrd[7022]:  notice: Caught 'Terminated' 
> signal
>> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker‑execd
>> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: pcmk health 
> check: UNHEALTHY
>> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: Servant pcmk is

> outdated (age: 41877)
>> (SBD Fencing)
>>
> 
> Rolling it up from the back I guess the reaction to self‑fence in case 
> pacemaker
> is telling it doesn't know ‑ and isn't able to find out ‑ about the
> state of the resources
> is basically correct.
> 
> Seeing the issue with the fake‑age being printed ‑ possibly causing
> confusion ‑ it reminds
> me that this should be addressed. Thought we had already but obviously
> a false memory.

Hi Klaus and others!

Well that is the current update state of SLES15 SP3; maybe upstream updates
did not make it into SLES yet; I don't know.

> 
> Would be interesting if pacemaker would recover the sub‑processes
> without sbd around
> and other ways of fencing ‑ that should kick in in a similar way ‑
> would need a significant
> time.
> As pacemakerd recently started to ping the sub‑daemons via ipc ‑
> instead of just listening
> for signals ‑ it would be interesting if logs we are seeing are
> already from that code.

The "code" probably is:
pacemaker-2.0.5+20201202.ba59be712-150300.4.21.1.x86_64

> 
> That what is happening with the monitor‑process kicked off by execd seems to

> hog
> the ipc for a significant time might be an issue to look after.

I don't know the details (even support at SUSE doesn't know what's going on in
the kernel, it seems),
but it looks as if one "stalled" monitor process can cause the node to be
fenced.

I had been considering this extreme paranoid idea:
What if you could configure three (different) monitor operations for a
resource, and an action will be triggered 

Re: [ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

2022-06-15 Thread Klaus Wenninger
On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl
 wrote:
>
> >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174 : 
> >>> 161 :
> 60728>:
>
> ...
> > Yes it's odd, but isn't the cluster just to protect us from odd situations?
> > ;-)
>
> I have more odd stuff:
> Jun 14 20:40:09 rksaph18 pacemaker-execd[7020]:  warning: 
> prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out
> ...
> Jun 14 20:40:14 h18 pacemaker-execd[7020]:  crit: 
> prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die!
> ...
> Jun 14 20:40:53 h18 pacemaker-controld[7026]:  warning: lrmd IPC request 525 
> failed: Connection timed out after 5000ms
> Jun 14 20:40:53 h18 pacemaker-controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): -110: Connection timed out (110)
> ...
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Couldn't perform 
> lrmd_rsc_exec operation (timeout=9): -114: Connection timed out (110)
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Operation stop on 
> prm_lockspace_ocfs2 failed: -70
> ...
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  warning: Input I_FAIL received 
> in state S_NOT_DC from do_lrm_rsc_op
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  notice: State transition 
> S_NOT_DC -> S_RECOVERY
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  warning: Fast-tracking 
> shutdown in response to errors
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Input I_TERMINATE 
> received in state S_RECOVERY from do_recover
> Jun 14 20:42:28 h18 pacemaker-controld[7026]:  warning: Sending IPC to lrmd 
> disabled until pending reply received
> Jun 14 20:42:28 h18 pacemaker-controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110)
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  warning: Sending IPC to lrmd 
> disabled until pending reply received
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110)
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Stopped 2 recurring 
> operations at shutdown (0 remaining)
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: 3 resources were active 
> at shutdown
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from the 
> executor
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from 
> Corosync
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from the 
> CIB manager
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: Could not recover from 
> internal error
> Jun 14 20:42:33 h18 pacemakerd[7003]:  error: pacemaker-controld[7026] exited 
> with status 1 (Error occurred)
> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-schedulerd
> Jun 14 20:42:33 h18 pacemaker-schedulerd[7024]:  notice: Caught 'Terminated' 
> signal
> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-attrd
> Jun 14 20:42:33 h18 pacemaker-attrd[7022]:  notice: Caught 'Terminated' signal
> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-execd
> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: pcmk health check: 
> UNHEALTHY
> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: Servant pcmk is 
> outdated (age: 41877)
> (SBD Fencing)
>

Rolling it up from the back I guess the reaction to self-fence in case pacemaker
is telling it doesn't know - and isn't able to find out - about the
state of the resources
is basically correct.

Seeing the issue with the fake-age being printed - possibly causing
confusion - it reminds
me that this should be addressed. Thought we had already but obviously
a false memory.

Would be interesting if pacemaker would recover the sub-processes
without sbd around
and other ways of fencing - that should kick in in a similar way -
would need a significant
time.
As pacemakerd recently started to ping the sub-daemons via ipc -
instead of just listening
for signals - it would be interesting if logs we are seeing are
already from that code.

That what is happening with the monitor-process kicked off by execd seems to hog
the ipc for a significant time might be an issue to look after.
Although the new implementation in pacemakerd might kick in and recover execd -
for what that is worth in the end.

This all seems to be kicked off by an RA that might not be robust enough or
the node is in a state that just doesn't allow a better answer.
Guess timeouts and retries required to give a timely answer about the state
of a resource should be taken care of inside the RA.
Guess the last 2 are at least something totally different than fork segfaulting
although that might as well be a sign that there is something really wrong
with the node.

Klaus

> Regards,
> Ulrich
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs 

[ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

2022-06-15 Thread Ulrich Windl
>>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174 : 
>>> 161 :
60728>:

...
> Yes it's odd, but isn't the cluster just to protect us from odd situations? 
> ;-)

I have more odd stuff:
Jun 14 20:40:09 rksaph18 pacemaker-execd[7020]:  warning: 
prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out
...
Jun 14 20:40:14 h18 pacemaker-execd[7020]:  crit: 
prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die!
...
Jun 14 20:40:53 h18 pacemaker-controld[7026]:  warning: lrmd IPC request 525 
failed: Connection timed out after 5000ms
Jun 14 20:40:53 h18 pacemaker-controld[7026]:  error: Couldn't perform 
lrmd_rsc_cancel operation (timeout=0): -110: Connection timed out (110)
...
Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Couldn't perform 
lrmd_rsc_exec operation (timeout=9): -114: Connection timed out (110)
Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Operation stop on 
prm_lockspace_ocfs2 failed: -70
...
Jun 14 20:42:23 h18 pacemaker-controld[7026]:  warning: Input I_FAIL received 
in state S_NOT_DC from do_lrm_rsc_op
Jun 14 20:42:23 h18 pacemaker-controld[7026]:  notice: State transition 
S_NOT_DC -> S_RECOVERY
Jun 14 20:42:23 h18 pacemaker-controld[7026]:  warning: Fast-tracking shutdown 
in response to errors
Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Input I_TERMINATE 
received in state S_RECOVERY from do_recover
Jun 14 20:42:28 h18 pacemaker-controld[7026]:  warning: Sending IPC to lrmd 
disabled until pending reply received
Jun 14 20:42:28 h18 pacemaker-controld[7026]:  error: Couldn't perform 
lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110)
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  warning: Sending IPC to lrmd 
disabled until pending reply received
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: Couldn't perform 
lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110)
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Stopped 2 recurring 
operations at shutdown (0 remaining)
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: 3 resources were active 
at shutdown
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from the 
executor
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from 
Corosync
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from the 
CIB manager
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: Could not recover from 
internal error
Jun 14 20:42:33 h18 pacemakerd[7003]:  error: pacemaker-controld[7026] exited 
with status 1 (Error occurred)
Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-schedulerd
Jun 14 20:42:33 h18 pacemaker-schedulerd[7024]:  notice: Caught 'Terminated' 
signal
Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-attrd
Jun 14 20:42:33 h18 pacemaker-attrd[7022]:  notice: Caught 'Terminated' signal
Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-execd
Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: pcmk health check: 
UNHEALTHY
Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: Servant pcmk is 
outdated (age: 41877)
(SBD Fencing)

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/