Re: [ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?
On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl wrote: > > >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174 : > >>> 161 : > 60728>: > > ... > > Yes it's odd, but isn't the cluster just to protect us from odd situations? > > ;-) > > I have more odd stuff: > Jun 14 20:40:09 rksaph18 pacemaker-execd[7020]: warning: > prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out > ... > Jun 14 20:40:14 h18 pacemaker-execd[7020]: crit: > prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die! > ... > Jun 14 20:40:53 h18 pacemaker-controld[7026]: warning: lrmd IPC request 525 > failed: Connection timed out after 5000ms > Jun 14 20:40:53 h18 pacemaker-controld[7026]: error: Couldn't perform > lrmd_rsc_cancel operation (timeout=0): -110: Connection timed out (110) > ... > Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Couldn't perform > lrmd_rsc_exec operation (timeout=9): -114: Connection timed out (110) > Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Operation stop on > prm_lockspace_ocfs2 failed: -70 > ... > Jun 14 20:42:23 h18 pacemaker-controld[7026]: warning: Input I_FAIL received > in state S_NOT_DC from do_lrm_rsc_op > Jun 14 20:42:23 h18 pacemaker-controld[7026]: notice: State transition > S_NOT_DC -> S_RECOVERY > Jun 14 20:42:23 h18 pacemaker-controld[7026]: warning: Fast-tracking > shutdown in response to errors > Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Input I_TERMINATE > received in state S_RECOVERY from do_recover > Jun 14 20:42:28 h18 pacemaker-controld[7026]: warning: Sending IPC to lrmd > disabled until pending reply received > Jun 14 20:42:28 h18 pacemaker-controld[7026]: error: Couldn't perform > lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110) > Jun 14 20:42:33 h18 pacemaker-controld[7026]: warning: Sending IPC to lrmd > disabled until pending reply received > Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: Couldn't perform > lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110) > Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Stopped 2 recurring > operations at shutdown (0 remaining) > Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: 3 resources were active > at shutdown > Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from the > executor > Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from > Corosync > Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from the > CIB manager > Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: Could not recover from > internal error > Jun 14 20:42:33 h18 pacemakerd[7003]: error: pacemaker-controld[7026] exited > with status 1 (Error occurred) > Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-schedulerd > Jun 14 20:42:33 h18 pacemaker-schedulerd[7024]: notice: Caught 'Terminated' > signal > Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-attrd > Jun 14 20:42:33 h18 pacemaker-attrd[7022]: notice: Caught 'Terminated' signal > Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-execd > Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: pcmk health check: > UNHEALTHY > Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: Servant pcmk is > outdated (age: 41877) > (SBD Fencing) > Rolling it up from the back I guess the reaction to self-fence in case pacemaker is telling it doesn't know - and isn't able to find out - about the state of the resources is basically correct. Seeing the issue with the fake-age being printed - possibly causing confusion - it reminds me that this should be addressed. Thought we had already but obviously a false memory. Would be interesting if pacemaker would recover the sub-processes without sbd around and other ways of fencing - that should kick in in a similar way - would need a significant time. As pacemakerd recently started to ping the sub-daemons via ipc - instead of just listening for signals - it would be interesting if logs we are seeing are already from that code. That what is happening with the monitor-process kicked off by execd seems to hog the ipc for a significant time might be an issue to look after. Although the new implementation in pacemakerd might kick in and recover execd - for what that is worth in the end. This all seems to be kicked off by an RA that might not be robust enough or the node is in a state that just doesn't allow a better answer. Guess timeouts and retries required to give a timely answer about the state of a resource should be taken care of inside the RA. Guess the last 2 are at least something totally different than fork segfaulting although that might as well be a sign that there is something really wrong with the node. Klaus > Regards, > Ulrich > > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs
[ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?
>>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174 : >>> 161 : 60728>: ... > Yes it's odd, but isn't the cluster just to protect us from odd situations? > ;-) I have more odd stuff: Jun 14 20:40:09 rksaph18 pacemaker-execd[7020]: warning: prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out ... Jun 14 20:40:14 h18 pacemaker-execd[7020]: crit: prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die! ... Jun 14 20:40:53 h18 pacemaker-controld[7026]: warning: lrmd IPC request 525 failed: Connection timed out after 5000ms Jun 14 20:40:53 h18 pacemaker-controld[7026]: error: Couldn't perform lrmd_rsc_cancel operation (timeout=0): -110: Connection timed out (110) ... Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Couldn't perform lrmd_rsc_exec operation (timeout=9): -114: Connection timed out (110) Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Operation stop on prm_lockspace_ocfs2 failed: -70 ... Jun 14 20:42:23 h18 pacemaker-controld[7026]: warning: Input I_FAIL received in state S_NOT_DC from do_lrm_rsc_op Jun 14 20:42:23 h18 pacemaker-controld[7026]: notice: State transition S_NOT_DC -> S_RECOVERY Jun 14 20:42:23 h18 pacemaker-controld[7026]: warning: Fast-tracking shutdown in response to errors Jun 14 20:42:23 h18 pacemaker-controld[7026]: error: Input I_TERMINATE received in state S_RECOVERY from do_recover Jun 14 20:42:28 h18 pacemaker-controld[7026]: warning: Sending IPC to lrmd disabled until pending reply received Jun 14 20:42:28 h18 pacemaker-controld[7026]: error: Couldn't perform lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110) Jun 14 20:42:33 h18 pacemaker-controld[7026]: warning: Sending IPC to lrmd disabled until pending reply received Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: Couldn't perform lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110) Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Stopped 2 recurring operations at shutdown (0 remaining) Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: 3 resources were active at shutdown Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from the executor Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from Corosync Jun 14 20:42:33 h18 pacemaker-controld[7026]: notice: Disconnected from the CIB manager Jun 14 20:42:33 h18 pacemaker-controld[7026]: error: Could not recover from internal error Jun 14 20:42:33 h18 pacemakerd[7003]: error: pacemaker-controld[7026] exited with status 1 (Error occurred) Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-schedulerd Jun 14 20:42:33 h18 pacemaker-schedulerd[7024]: notice: Caught 'Terminated' signal Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-attrd Jun 14 20:42:33 h18 pacemaker-attrd[7022]: notice: Caught 'Terminated' signal Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker-execd Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: pcmk health check: UNHEALTHY Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: Servant pcmk is outdated (age: 41877) (SBD Fencing) Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?
On Tue, 2022-06-14 at 15:53 +0200, Ulrich Windl wrote: > > > > Ken Gaillot schrieb am 14.06.2022 um > > > > 15:49 in > Nachricht > : > > On Tue, 2022‑06‑14 at 14:36 +0200, Ulrich Windl wrote: > > > Hi! > > > > > > I had a case where a VirtualDomain monitor operation ended in a > > > core > > > dump (actually it was pacemaker‑execd, but it counted as > > > "monitor" > > > operation), and the cluster decided to restart the VM. Wouldn't > > > it be > > > worth to retry the monitor operation first? > > > > It counts like any other monitor failure > > > > > Chances are that a re‑tried monitor operation returns a better > > > status > > > than segmentation fault. > > > Or dies the logic just ignore processes dying on signals? > > > > > > 20201202.ba59be712‑150300.4.21.1.x86_64 (SLES15 SP3) > > > > > > Jun 14 14:09:16 h19 systemd‑coredump[28788]: Process 28786 > > > (pacemaker‑execd) of user 0 dumped core. > > > Jun 14 14:09:16 h19 pacemaker‑execd[7440]: warning: > > > prm_xen_v04_monitor_60[28786] terminated with signal: > > > Segmentation fault > > > > This means that the child process forked to execute the resource > > agent > > segfaulted, which is odd. > > Yes it's odd, but isn't the cluster just to protect us from odd > situations? > ;-) > > > Is the agent a compiled program? If not, it's possible the tiny > > amount > > of pacemaker code that executes the agent is what segfaulted. Do > > you > > have the actual core, and can you do a backtrace? > > Believe me, it's just "odd": > Stack trace of > thread > 28786: > #0 0x7f85589e0 > bcf > __libc_fork (/lib64/libc-2.31.so + 0xe1bcf) > #1 0x7f855949b > 85d n/a > (/usr/lib64/libcrmservice.so.28.2.2 + 0x785d) > #2 0x7f855949a > 5e3 n/a > (/usr/lib64/libcrmservice.so.28.2.2 + 0x65e3) > #3 0x7f8558d47 > 0ed n/a > (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x530ed) > #4 0x7f8558d46 > 624 > g_main_context_dispatch (/usr/lib64/libglib-2.0.so.0.6200.6 + > 0x52624) > #5 0x7f8558d46 > 9c0 n/a > (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x529c0) > #6 0x7f8558d46 > c82 > g_main_loop_run (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x52c82) > #7 0x558c07659 > 30b n/a > (/usr/lib/pacemaker/pacemaker-execd + 0x330b) > #8 0x7f8558934 > 2bd > __libc_start_main (/lib64/libc-2.31.so + 0x352bd) > #9 0x558c07659 > 3da n/a > (/usr/lib/pacemaker/pacemaker-execd + 0x33da) > > Rumors say it's Dell's dcdbas module combined with Xen and an AMD CPU > plus > some software bugs ;-) From the above, it looks like the C library's fork() is what's segfaulting, so yeah probably not much to do about it ... > > Regards, > Ulrich > > > > Jun 14 14:09:16 h19 pacemaker‑controld[7443]: error: Result of > > > monitor operation for prm_xen_v04 on h19: Error > > > Jun 14 14:09:16 h19 pacemaker‑controld[7443]: notice: Transition > > > 9 > > > action 107 (prm_xen_v04_monitor_60 on h19): expected 'ok' but > > > got > > > 'error' > > > ... > > > Jun 14 14:09:16 h19 pacemaker‑schedulerd[7442]: notice: * > > > Recoverprm_xen_v04 ( h19 ) > > > > > > Regards, > > > ulrich > > > > > > > > > > > > ___ > > > Manage your subscription: > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > ‑‑ > > Ken Gaillot > > > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?
>>> Ken Gaillot schrieb am 14.06.2022 um 15:49 in Nachricht : > On Tue, 2022‑06‑14 at 14:36 +0200, Ulrich Windl wrote: >> Hi! >> >> I had a case where a VirtualDomain monitor operation ended in a core >> dump (actually it was pacemaker‑execd, but it counted as "monitor" >> operation), and the cluster decided to restart the VM. Wouldn't it be >> worth to retry the monitor operation first? > > It counts like any other monitor failure > >> Chances are that a re‑tried monitor operation returns a better status >> than segmentation fault. >> Or dies the logic just ignore processes dying on signals? >> >> 20201202.ba59be712‑150300.4.21.1.x86_64 (SLES15 SP3) >> >> Jun 14 14:09:16 h19 systemd‑coredump[28788]: Process 28786 >> (pacemaker‑execd) of user 0 dumped core. >> Jun 14 14:09:16 h19 pacemaker‑execd[7440]: warning: >> prm_xen_v04_monitor_60[28786] terminated with signal: >> Segmentation fault > > This means that the child process forked to execute the resource agent > segfaulted, which is odd. Yes it's odd, but isn't the cluster just to protect us from odd situations? ;-) > > Is the agent a compiled program? If not, it's possible the tiny amount > of pacemaker code that executes the agent is what segfaulted. Do you > have the actual core, and can you do a backtrace? Believe me, it's just "odd": Stack trace of thread 28786: #0 0x7f85589e0bcf __libc_fork (/lib64/libc-2.31.so + 0xe1bcf) #1 0x7f855949b85d n/a (/usr/lib64/libcrmservice.so.28.2.2 + 0x785d) #2 0x7f855949a5e3 n/a (/usr/lib64/libcrmservice.so.28.2.2 + 0x65e3) #3 0x7f8558d470ed n/a (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x530ed) #4 0x7f8558d46624 g_main_context_dispatch (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x52624) #5 0x7f8558d469c0 n/a (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x529c0) #6 0x7f8558d46c82 g_main_loop_run (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x52c82) #7 0x558c0765930b n/a (/usr/lib/pacemaker/pacemaker-execd + 0x330b) #8 0x7f85589342bd __libc_start_main (/lib64/libc-2.31.so + 0x352bd) #9 0x558c076593da n/a (/usr/lib/pacemaker/pacemaker-execd + 0x33da) Rumors say it's Dell's dcdbas module combined with Xen and an AMD CPU plus some software bugs ;-) Regards, Ulrich > >> Jun 14 14:09:16 h19 pacemaker‑controld[7443]: error: Result of >> monitor operation for prm_xen_v04 on h19: Error >> Jun 14 14:09:16 h19 pacemaker‑controld[7443]: notice: Transition 9 >> action 107 (prm_xen_v04_monitor_60 on h19): expected 'ok' but got >> 'error' >> ... >> Jun 14 14:09:16 h19 pacemaker‑schedulerd[7442]: notice: * >> Recoverprm_xen_v04 ( h19 ) >> >> Regards, >> ulrich >> >> >> >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > ‑‑ > Ken Gaillot > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/