Re: [ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

Klaus Wenninger Wed, 15 Jun 2022 01:01:17 -0700

On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl
<ulrich.wi...@rz.uni-regensburg.de> wrote:
>
> >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174 : 
> >>> 161 :
> 60728>:
>
> ...
> > Yes it's odd, but isn't the cluster just to protect us from odd situations?
> > ;-)
>
> I have more odd stuff:
> Jun 14 20:40:09 rksaph18 pacemaker-execd[7020]:  warning: 
> prm_lockspace_ocfs2_monitor_120000 process (PID 30234) timed out
> ...
> Jun 14 20:40:14 h18 pacemaker-execd[7020]:  crit: 
> prm_lockspace_ocfs2_monitor_120000 process (PID 30234) will not die!
> ...
> Jun 14 20:40:53 h18 pacemaker-controld[7026]:  warning: lrmd IPC request 525 
> failed: Connection timed out after 5000ms
> Jun 14 20:40:53 h18 pacemaker-controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): -110: Connection timed out (110)
> ...
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Couldn't perform 
> lrmd_rsc_exec operation (timeout=90000): -114: Connection timed out (110)
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Operation stop on 
> prm_lockspace_ocfs2 failed: -70
> ...
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  warning: Input I_FAIL received 
> in state S_NOT_DC from do_lrm_rsc_op
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  notice: State transition 
> S_NOT_DC -> S_RECOVERY
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  warning: Fast-tracking 
> shutdown in response to errors
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Input I_TERMINATE 
> received in state S_RECOVERY from do_recover
> Jun 14 20:42:28 h18 pacemaker-controld[7026]:  warning: Sending IPC to lrmd 
> disabled until pending reply received
> Jun 14 20:42:28 h18 pacemaker-controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110)
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  warning: Sending IPC to lrmd 
> disabled until pending reply received
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110)
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Stopped 2 recurring 
> operations at shutdown (0 remaining)
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: 3 resources were active 
> at shutdown
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from the 
> executor
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from 
> Corosync
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from the 
> CIB manager
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: Could not recover from 
> internal error
> Jun 14 20:42:33 h18 pacemakerd[7003]:  error: pacemaker-controld[7026] exited 
> with status 1 (Error occurred)
> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-schedulerd
> Jun 14 20:42:33 h18 pacemaker-schedulerd[7024]:  notice: Caught 'Terminated' 
> signal
> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-attrd
> Jun 14 20:42:33 h18 pacemaker-attrd[7022]:  notice: Caught 'Terminated' signal
> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-execd
> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: pcmk health check: 
> UNHEALTHY
> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: Servant pcmk is 
> outdated (age: 41877)
> (SBD Fencing)
>


Rolling it up from the back I guess the reaction to self-fence in case pacemaker
is telling it doesn't know - and isn't able to find out - about the
state of the resources
is basically correct.

Seeing the issue with the fake-age being printed - possibly causing
confusion - it reminds
me that this should be addressed. Thought we had already but obviously
a false memory.

Would be interesting if pacemaker would recover the sub-processes
without sbd around
and other ways of fencing - that should kick in in a similar way -
would need a significant
time.
As pacemakerd recently started to ping the sub-daemons via ipc -
instead of just listening
for signals - it would be interesting if logs we are seeing are
already from that code.

That what is happening with the monitor-process kicked off by execd seems to hog
the ipc for a significant time might be an issue to look after.
Although the new implementation in pacemakerd might kick in and recover execd -
for what that is worth in the end.

This all seems to be kicked off by an RA that might not be robust enough or
the node is in a state that just doesn't allow a better answer.
Guess timeouts and retries required to give a timely answer about the state
of a resource should be taken care of inside the RA.
Guess the last 2 are at least something totally different than fork segfaulting
although that might as well be a sign that there is something really wrong
with the node.

Klaus

> Regards,
> Ulrich
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

Reply via email to