Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

2022-06-15 Thread Klaus Wenninger
On Wed, Jun 15, 2022 at 10:33 AM Ulrich Windl
 wrote:
>
> >>> Klaus Wenninger  schrieb am 15.06.2022 um 10:00 in
> Nachricht
> :
> > On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl
> >  wrote:
> >>
> >> >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174
> : 161
> > :
> >> 60728>:
> >>
> >> ...
> >> > Yes it's odd, but isn't the cluster just to protect us from odd
> situations?
> >> > ;‑)
> >>
> >> I have more odd stuff:
> >> Jun 14 20:40:09 rksaph18 pacemaker‑execd[7020]:  warning:
> > prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out
> >> ...
> >> Jun 14 20:40:14 h18 pacemaker‑execd[7020]:  crit:
> > prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die!
> >> ...
> >> Jun 14 20:40:53 h18 pacemaker‑controld[7026]:  warning: lrmd IPC request
> 525
> > failed: Connection timed out after 5000ms
> >> Jun 14 20:40:53 h18 pacemaker‑controld[7026]:  error: Couldn't perform
> > lrmd_rsc_cancel operation (timeout=0): ‑110: Connection timed out (110)
> >> ...
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Couldn't perform
> > lrmd_rsc_exec operation (timeout=9): ‑114: Connection timed out (110)
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Operation stop on
> > prm_lockspace_ocfs2 failed: ‑70
> >> ...
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  warning: Input I_FAIL
> received
> > in state S_NOT_DC from do_lrm_rsc_op
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  notice: State transition
> > S_NOT_DC ‑> S_RECOVERY
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  warning: Fast‑tracking
> shutdown
> > in response to errors
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Input I_TERMINATE
> > received in state S_RECOVERY from do_recover
> >> Jun 14 20:42:28 h18 pacemaker‑controld[7026]:  warning: Sending IPC to lrmd
>
> > disabled until pending reply received
> >> Jun 14 20:42:28 h18 pacemaker‑controld[7026]:  error: Couldn't perform
> > lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110)
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  warning: Sending IPC to lrmd
>
> > disabled until pending reply received
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: Couldn't perform
> > lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110)
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Stopped 2 recurring
>
> > operations at shutdown (0 remaining)
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: 3 resources were
> active
> > at shutdown
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from
> the
> > executor
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from
> > Corosync
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from
> the
> > CIB manager
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: Could not recover
> from
> > internal error
> >> Jun 14 20:42:33 h18 pacemakerd[7003]:  error: pacemaker‑controld[7026]
> exited
> > with status 1 (Error occurred)
> >> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping
> pacemaker‑schedulerd
> >> Jun 14 20:42:33 h18 pacemaker‑schedulerd[7024]:  notice: Caught
> 'Terminated'
> > signal
> >> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker‑attrd
> >> Jun 14 20:42:33 h18 pacemaker‑attrd[7022]:  notice: Caught 'Terminated'
> > signal
> >> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker‑execd
> >> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: pcmk health
> > check: UNHEALTHY
> >> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: Servant pcmk is
>
> > outdated (age: 41877)
> >> (SBD Fencing)
> >>
> >
> > Rolling it up from the back I guess the reaction to self‑fence in case
> > pacemaker
> > is telling it doesn't know ‑ and isn't able to find out ‑ about the
> > state of the resources
> > is basically correct.
> >
> > Seeing the issue with the fake‑age being printed ‑ possibly causing
> > confusion ‑ it reminds
> > me that this should be addressed. Thought we had already but obviously
> > a false memory.
>
> Hi Klaus and others!
>
> Well that is the current update state of SLES15 SP3; maybe upstream updates
> did not make it into SLES yet; I don't know.
>
> >
> > Would be interesting if pacemaker would recover the sub‑processes
> > without sbd around
> > and other ways of fencing ‑ that should kick in in a similar way ‑
> > would need a significant
> > time.
> > As pacemakerd recently started to ping the sub‑daemons via ipc ‑
> > instead of just listening
> > for signals ‑ it would be interesting if logs we are seeing are
> > already from that code.
>
> The "code" probably is:
> pacemaker-2.0.5+20201202.ba59be712-150300.4.21.1.x86_64
>
> >
> > That what is happening with the monitor‑process kicked off by execd seems to
>
> > hog
> > the ipc for a significant time might be an issue to look after.
>
> I don't know the details (even support at SUSE doesn't know what's going 

[ClusterLabs] Antw: Re: Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

2022-06-15 Thread Ulrich Windl
>>> Klaus Wenninger  schrieb am 15.06.2022 um 10:00 in
Nachricht
:
> On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl
>  wrote:
>>
>> >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174
: 161 
> :
>> 60728>:
>>
>> ...
>> > Yes it's odd, but isn't the cluster just to protect us from odd
situations?
>> > ;‑)
>>
>> I have more odd stuff:
>> Jun 14 20:40:09 rksaph18 pacemaker‑execd[7020]:  warning: 
> prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out
>> ...
>> Jun 14 20:40:14 h18 pacemaker‑execd[7020]:  crit: 
> prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die!
>> ...
>> Jun 14 20:40:53 h18 pacemaker‑controld[7026]:  warning: lrmd IPC request
525 
> failed: Connection timed out after 5000ms
>> Jun 14 20:40:53 h18 pacemaker‑controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): ‑110: Connection timed out (110)
>> ...
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Couldn't perform 
> lrmd_rsc_exec operation (timeout=9): ‑114: Connection timed out (110)
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Operation stop on 
> prm_lockspace_ocfs2 failed: ‑70
>> ...
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  warning: Input I_FAIL
received 
> in state S_NOT_DC from do_lrm_rsc_op
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  notice: State transition 
> S_NOT_DC ‑> S_RECOVERY
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  warning: Fast‑tracking
shutdown 
> in response to errors
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Input I_TERMINATE 
> received in state S_RECOVERY from do_recover
>> Jun 14 20:42:28 h18 pacemaker‑controld[7026]:  warning: Sending IPC to lrmd

> disabled until pending reply received
>> Jun 14 20:42:28 h18 pacemaker‑controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110)
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  warning: Sending IPC to lrmd

> disabled until pending reply received
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110)
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Stopped 2 recurring

> operations at shutdown (0 remaining)
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: 3 resources were
active 
> at shutdown
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from
the 
> executor
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from 
> Corosync
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from
the 
> CIB manager
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: Could not recover
from 
> internal error
>> Jun 14 20:42:33 h18 pacemakerd[7003]:  error: pacemaker‑controld[7026]
exited 
> with status 1 (Error occurred)
>> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping
pacemaker‑schedulerd
>> Jun 14 20:42:33 h18 pacemaker‑schedulerd[7024]:  notice: Caught
'Terminated' 
> signal
>> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker‑attrd
>> Jun 14 20:42:33 h18 pacemaker‑attrd[7022]:  notice: Caught 'Terminated' 
> signal
>> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker‑execd
>> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: pcmk health 
> check: UNHEALTHY
>> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: Servant pcmk is

> outdated (age: 41877)
>> (SBD Fencing)
>>
> 
> Rolling it up from the back I guess the reaction to self‑fence in case 
> pacemaker
> is telling it doesn't know ‑ and isn't able to find out ‑ about the
> state of the resources
> is basically correct.
> 
> Seeing the issue with the fake‑age being printed ‑ possibly causing
> confusion ‑ it reminds
> me that this should be addressed. Thought we had already but obviously
> a false memory.

Hi Klaus and others!

Well that is the current update state of SLES15 SP3; maybe upstream updates
did not make it into SLES yet; I don't know.

> 
> Would be interesting if pacemaker would recover the sub‑processes
> without sbd around
> and other ways of fencing ‑ that should kick in in a similar way ‑
> would need a significant
> time.
> As pacemakerd recently started to ping the sub‑daemons via ipc ‑
> instead of just listening
> for signals ‑ it would be interesting if logs we are seeing are
> already from that code.

The "code" probably is:
pacemaker-2.0.5+20201202.ba59be712-150300.4.21.1.x86_64

> 
> That what is happening with the monitor‑process kicked off by execd seems to

> hog
> the ipc for a significant time might be an issue to look after.

I don't know the details (even support at SUSE doesn't know what's going on in
the kernel, it seems),
but it looks as if one "stalled" monitor process can cause the node to be
fenced.

I had been considering this extreme paranoid idea:
What if you could configure three (different) monitor operations for a
resource, and an action will be triggered