On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote: > > >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174 : > >>> 161 : > 60728>: > > ... > > Yes it's odd, but isn't the cluster just to protect us from odd situations? > > ;-) > > I have more odd stuff: > Jun 14 20:40:09 rksaph18 pacemaker-execd[7020]: warning: > prm_lockspace_ocfs2_monitor_120000 process (PID 30234) timed out > ... > Jun 14 20:40:14 h18 pacemaker-execd[7020]: crit: > prm_lockspace_ocfs2_monitor_120000 process (PID 30234) will not die! > ... > Jun 14 20:40:53 h18 pacemaker-controld[7026]: warning: lrmd IPC request 525 > failed: Connection timed out after 5000ms > Jun 14 20:40:53 h18 pacemaker-controld[7026]: error: Couldn't perform > lrmd_rsc_cancel operation (timeout=0): -110: Connection timed out (110) > ... > Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Couldn't perform > lrmd_rsc_exec operation (timeout=90000): -114: Connection timed out (110) > Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Operation stop on > prm_lockspace_ocfs2 failed: -70 > ... > Jun 14 20:42:23 h18 pacemaker-controld[7026]: warning: Input I_FAIL received > in state S_NOT_DC from do_lrm_rsc_op > Jun 14 20:42:23 h18 pacemaker-controld[7026]: notice: State transition > S_NOT_DC -> S_RECOVERY > Jun 14 20:42:23 h18 pacemaker-controld[7026]: warning: Fast-tracking > shutdown in response to errors > Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Input I_TERMINATE > received in state S_RECOVERY from do_recover > Jun 14 20:42:28 h18 pacemaker-controld[7026]: warning: Sending IPC to lrmd > disabled until pending reply received > Jun 14 20:42:28 h18 pacemaker-controld[7026]: error: Couldn't perform > lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110) > Jun 14 20:42:33 h18 pacemaker-controld[7026]: warning: Sending IPC to lrmd > disabled until pending reply received > Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: Couldn't perform > lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110) > Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Stopped 2 recurring > operations at shutdown (0 remaining) > Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: 3 resources were active > at shutdown > Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from the > executor > Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from > Corosync > Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from the > CIB manager > Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: Could not recover from > internal error > Jun 14 20:42:33 h18 pacemakerd[7003]: error: pacemaker-controld[7026] exited > with status 1 (Error occurred) > Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-schedulerd > Jun 14 20:42:33 h18 pacemaker-schedulerd[7024]: notice: Caught 'Terminated' > signal > Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-attrd > Jun 14 20:42:33 h18 pacemaker-attrd[7022]: notice: Caught 'Terminated' signal > Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-execd > Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: pcmk health check: > UNHEALTHY > Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: Servant pcmk is > outdated (age: 41877) > (SBD Fencing) >
Rolling it up from the back I guess the reaction to self-fence in case pacemaker is telling it doesn't know - and isn't able to find out - about the state of the resources is basically correct. Seeing the issue with the fake-age being printed - possibly causing confusion - it reminds me that this should be addressed. Thought we had already but obviously a false memory. Would be interesting if pacemaker would recover the sub-processes without sbd around and other ways of fencing - that should kick in in a similar way - would need a significant time. As pacemakerd recently started to ping the sub-daemons via ipc - instead of just listening for signals - it would be interesting if logs we are seeing are already from that code. That what is happening with the monitor-process kicked off by execd seems to hog the ipc for a significant time might be an issue to look after. Although the new implementation in pacemakerd might kick in and recover execd - for what that is worth in the end. This all seems to be kicked off by an RA that might not be robust enough or the node is in a state that just doesn't allow a better answer. Guess timeouts and retries required to give a timely answer about the state of a resource should be taken care of inside the RA. Guess the last 2 are at least something totally different than fork segfaulting although that might as well be a sign that there is something really wrong with the node. Klaus > Regards, > Ulrich > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/