Re: [ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

2022-06-15 Thread Klaus Wenninger
On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl
 wrote:
>
> >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174 : 
> >>> 161 :
> 60728>:
>
> ...
> > Yes it's odd, but isn't the cluster just to protect us from odd situations?
> > ;-)
>
> I have more odd stuff:
> Jun 14 20:40:09 rksaph18 pacemaker-execd[7020]:  warning: 
> prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out
> ...
> Jun 14 20:40:14 h18 pacemaker-execd[7020]:  crit: 
> prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die!
> ...
> Jun 14 20:40:53 h18 pacemaker-controld[7026]:  warning: lrmd IPC request 525 
> failed: Connection timed out after 5000ms
> Jun 14 20:40:53 h18 pacemaker-controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): -110: Connection timed out (110)
> ...
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Couldn't perform 
> lrmd_rsc_exec operation (timeout=9): -114: Connection timed out (110)
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Operation stop on 
> prm_lockspace_ocfs2 failed: -70
> ...
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  warning: Input I_FAIL received 
> in state S_NOT_DC from do_lrm_rsc_op
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  notice: State transition 
> S_NOT_DC -> S_RECOVERY
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  warning: Fast-tracking 
> shutdown in response to errors
> Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Input I_TERMINATE 
> received in state S_RECOVERY from do_recover
> Jun 14 20:42:28 h18 pacemaker-controld[7026]:  warning: Sending IPC to lrmd 
> disabled until pending reply received
> Jun 14 20:42:28 h18 pacemaker-controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110)
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  warning: Sending IPC to lrmd 
> disabled until pending reply received
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: Couldn't perform 
> lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110)
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Stopped 2 recurring 
> operations at shutdown (0 remaining)
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: 3 resources were active 
> at shutdown
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from the 
> executor
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from 
> Corosync
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from the 
> CIB manager
> Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: Could not recover from 
> internal error
> Jun 14 20:42:33 h18 pacemakerd[7003]:  error: pacemaker-controld[7026] exited 
> with status 1 (Error occurred)
> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-schedulerd
> Jun 14 20:42:33 h18 pacemaker-schedulerd[7024]:  notice: Caught 'Terminated' 
> signal
> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-attrd
> Jun 14 20:42:33 h18 pacemaker-attrd[7022]:  notice: Caught 'Terminated' signal
> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-execd
> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: pcmk health check: 
> UNHEALTHY
> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: Servant pcmk is 
> outdated (age: 41877)
> (SBD Fencing)
>

Rolling it up from the back I guess the reaction to self-fence in case pacemaker
is telling it doesn't know - and isn't able to find out - about the
state of the resources
is basically correct.

Seeing the issue with the fake-age being printed - possibly causing
confusion - it reminds
me that this should be addressed. Thought we had already but obviously
a false memory.

Would be interesting if pacemaker would recover the sub-processes
without sbd around
and other ways of fencing - that should kick in in a similar way -
would need a significant
time.
As pacemakerd recently started to ping the sub-daemons via ipc -
instead of just listening
for signals - it would be interesting if logs we are seeing are
already from that code.

That what is happening with the monitor-process kicked off by execd seems to hog
the ipc for a significant time might be an issue to look after.
Although the new implementation in pacemakerd might kick in and recover execd -
for what that is worth in the end.

This all seems to be kicked off by an RA that might not be robust enough or
the node is in a state that just doesn't allow a better answer.
Guess timeouts and retries required to give a timely answer about the state
of a resource should be taken care of inside the RA.
Guess the last 2 are at least something totally different than fork segfaulting
although that might as well be a sign that there is something really wrong
with the node.

Klaus

> Regards,
> Ulrich
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs 

[ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

2022-06-15 Thread Ulrich Windl
>>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174 : 
>>> 161 :
60728>:

...
> Yes it's odd, but isn't the cluster just to protect us from odd situations? 
> ;-)

I have more odd stuff:
Jun 14 20:40:09 rksaph18 pacemaker-execd[7020]:  warning: 
prm_lockspace_ocfs2_monitor_12 process (PID 30234) timed out
...
Jun 14 20:40:14 h18 pacemaker-execd[7020]:  crit: 
prm_lockspace_ocfs2_monitor_12 process (PID 30234) will not die!
...
Jun 14 20:40:53 h18 pacemaker-controld[7026]:  warning: lrmd IPC request 525 
failed: Connection timed out after 5000ms
Jun 14 20:40:53 h18 pacemaker-controld[7026]:  error: Couldn't perform 
lrmd_rsc_cancel operation (timeout=0): -110: Connection timed out (110)
...
Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Couldn't perform 
lrmd_rsc_exec operation (timeout=9): -114: Connection timed out (110)
Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Operation stop on 
prm_lockspace_ocfs2 failed: -70
...
Jun 14 20:42:23 h18 pacemaker-controld[7026]:  warning: Input I_FAIL received 
in state S_NOT_DC from do_lrm_rsc_op
Jun 14 20:42:23 h18 pacemaker-controld[7026]:  notice: State transition 
S_NOT_DC -> S_RECOVERY
Jun 14 20:42:23 h18 pacemaker-controld[7026]:  warning: Fast-tracking shutdown 
in response to errors
Jun 14 20:42:23 h18 pacemaker-controld[7026]:  error: Input I_TERMINATE 
received in state S_RECOVERY from do_recover
Jun 14 20:42:28 h18 pacemaker-controld[7026]:  warning: Sending IPC to lrmd 
disabled until pending reply received
Jun 14 20:42:28 h18 pacemaker-controld[7026]:  error: Couldn't perform 
lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110)
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  warning: Sending IPC to lrmd 
disabled until pending reply received
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: Couldn't perform 
lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110)
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Stopped 2 recurring 
operations at shutdown (0 remaining)
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: 3 resources were active 
at shutdown
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from the 
executor
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from 
Corosync
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  notice: Disconnected from the 
CIB manager
Jun 14 20:42:33 h18 pacemaker-controld[7026]:  error: Could not recover from 
internal error
Jun 14 20:42:33 h18 pacemakerd[7003]:  error: pacemaker-controld[7026] exited 
with status 1 (Error occurred)
Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-schedulerd
Jun 14 20:42:33 h18 pacemaker-schedulerd[7024]:  notice: Caught 'Terminated' 
signal
Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-attrd
Jun 14 20:42:33 h18 pacemaker-attrd[7022]:  notice: Caught 'Terminated' signal
Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker-execd
Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: pcmk health check: 
UNHEALTHY
Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: Servant pcmk is 
outdated (age: 41877)
(SBD Fencing)

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

2022-06-14 Thread Ken Gaillot
On Tue, 2022-06-14 at 15:53 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot  schrieb am 14.06.2022 um
> > > > 15:49 in
> Nachricht
> :
> > On Tue, 2022‑06‑14 at 14:36 +0200, Ulrich Windl wrote:
> > > Hi!
> > > 
> > > I had a case where a VirtualDomain monitor operation ended in a
> > > core
> > > dump (actually it was pacemaker‑execd, but it counted as
> > > "monitor"
> > > operation), and the cluster decided to restart the VM. Wouldn't
> > > it be
> > > worth to retry the monitor operation first?
> > 
> > It counts like any other monitor failure
> > 
> > > Chances are that a re‑tried monitor operation returns a better
> > > status
> > > than segmentation fault.
> > > Or dies the logic just ignore processes dying on signals?
> > > 
> > > 20201202.ba59be712‑150300.4.21.1.x86_64 (SLES15 SP3)
> > > 
> > > Jun 14 14:09:16 h19 systemd‑coredump[28788]: Process 28786
> > > (pacemaker‑execd) of user 0 dumped core.
> > > Jun 14 14:09:16 h19 pacemaker‑execd[7440]:  warning:
> > > prm_xen_v04_monitor_60[28786] terminated with signal:
> > > Segmentation fault
> > 
> > This means that the child process forked to execute the resource
> > agent
> > segfaulted, which is odd.
> 
> Yes it's odd, but isn't the cluster just to protect us from odd
> situations?
> ;-)
> 
> > Is the agent a compiled program? If not, it's possible the tiny
> > amount
> > of pacemaker code that executes the agent is what segfaulted. Do
> > you
> > have the actual core, and can you do a backtrace?
> 
> Believe me, it's just "odd":
>   Stack trace of
> thread
> 28786:
>   #0  0x7f85589e0
> bcf
> __libc_fork (/lib64/libc-2.31.so + 0xe1bcf)
>   #1  0x7f855949b
> 85d n/a
> (/usr/lib64/libcrmservice.so.28.2.2 + 0x785d)
>   #2  0x7f855949a
> 5e3 n/a
> (/usr/lib64/libcrmservice.so.28.2.2 + 0x65e3)
>   #3  0x7f8558d47
> 0ed n/a
> (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x530ed)
>   #4  0x7f8558d46
> 624
> g_main_context_dispatch (/usr/lib64/libglib-2.0.so.0.6200.6 +
> 0x52624)
>   #5  0x7f8558d46
> 9c0 n/a
> (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x529c0)
>   #6  0x7f8558d46
> c82
> g_main_loop_run (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x52c82)
>   #7  0x558c07659
> 30b n/a
> (/usr/lib/pacemaker/pacemaker-execd + 0x330b)
>   #8  0x7f8558934
> 2bd
> __libc_start_main (/lib64/libc-2.31.so + 0x352bd)
>   #9  0x558c07659
> 3da n/a
> (/usr/lib/pacemaker/pacemaker-execd + 0x33da)
> 
> Rumors say it's Dell's dcdbas module combined with Xen and an AMD CPU
> plus
> some software bugs ;-)

From the above, it looks like the C library's fork() is what's
segfaulting, so yeah probably not much to do about it ...

> 
> Regards,
> Ulrich
> 
> > > Jun 14 14:09:16 h19 pacemaker‑controld[7443]:  error: Result of
> > > monitor operation for prm_xen_v04 on h19: Error
> > > Jun 14 14:09:16 h19 pacemaker‑controld[7443]:  notice: Transition
> > > 9
> > > action 107 (prm_xen_v04_monitor_60 on h19): expected 'ok' but
> > > got
> > > 'error'
> > > ...
> > > Jun 14 14:09:16 h19 pacemaker‑schedulerd[7442]:  notice:  *
> > > Recoverprm_xen_v04  ( h19 )
> > > 
> > > Regards,
> > > ulrich
> > > 
> > > 
> > > 
> > > ___
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users 
> > > 
> > > ClusterLabs home: https://www.clusterlabs.org/ 
> > > 
> > ‑‑ 
> > Ken Gaillot 
> > 
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

2022-06-14 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 14.06.2022 um 15:49 in
Nachricht
:
> On Tue, 2022‑06‑14 at 14:36 +0200, Ulrich Windl wrote:
>> Hi!
>> 
>> I had a case where a VirtualDomain monitor operation ended in a core
>> dump (actually it was pacemaker‑execd, but it counted as "monitor"
>> operation), and the cluster decided to restart the VM. Wouldn't it be
>> worth to retry the monitor operation first?
> 
> It counts like any other monitor failure
> 
>> Chances are that a re‑tried monitor operation returns a better status
>> than segmentation fault.
>> Or dies the logic just ignore processes dying on signals?
>> 
>> 20201202.ba59be712‑150300.4.21.1.x86_64 (SLES15 SP3)
>> 
>> Jun 14 14:09:16 h19 systemd‑coredump[28788]: Process 28786
>> (pacemaker‑execd) of user 0 dumped core.
>> Jun 14 14:09:16 h19 pacemaker‑execd[7440]:  warning:
>> prm_xen_v04_monitor_60[28786] terminated with signal:
>> Segmentation fault
> 
> This means that the child process forked to execute the resource agent
> segfaulted, which is odd.

Yes it's odd, but isn't the cluster just to protect us from odd situations?
;-)

> 
> Is the agent a compiled program? If not, it's possible the tiny amount
> of pacemaker code that executes the agent is what segfaulted. Do you
> have the actual core, and can you do a backtrace?

Believe me, it's just "odd":
  Stack trace of thread
28786:
  #0  0x7f85589e0bcf
__libc_fork (/lib64/libc-2.31.so + 0xe1bcf)
  #1  0x7f855949b85d n/a
(/usr/lib64/libcrmservice.so.28.2.2 + 0x785d)
  #2  0x7f855949a5e3 n/a
(/usr/lib64/libcrmservice.so.28.2.2 + 0x65e3)
  #3  0x7f8558d470ed n/a
(/usr/lib64/libglib-2.0.so.0.6200.6 + 0x530ed)
  #4  0x7f8558d46624
g_main_context_dispatch (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x52624)
  #5  0x7f8558d469c0 n/a
(/usr/lib64/libglib-2.0.so.0.6200.6 + 0x529c0)
  #6  0x7f8558d46c82
g_main_loop_run (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x52c82)
  #7  0x558c0765930b n/a
(/usr/lib/pacemaker/pacemaker-execd + 0x330b)
  #8  0x7f85589342bd
__libc_start_main (/lib64/libc-2.31.so + 0x352bd)
  #9  0x558c076593da n/a
(/usr/lib/pacemaker/pacemaker-execd + 0x33da)

Rumors say it's Dell's dcdbas module combined with Xen and an AMD CPU plus
some software bugs ;-)

Regards,
Ulrich

> 
>> Jun 14 14:09:16 h19 pacemaker‑controld[7443]:  error: Result of
>> monitor operation for prm_xen_v04 on h19: Error
>> Jun 14 14:09:16 h19 pacemaker‑controld[7443]:  notice: Transition 9
>> action 107 (prm_xen_v04_monitor_60 on h19): expected 'ok' but got
>> 'error'
>> ...
>> Jun 14 14:09:16 h19 pacemaker‑schedulerd[7442]:  notice:  *
>> Recoverprm_xen_v04  ( h19 )
>> 
>> Regards,
>> ulrich
>> 
>> 
>> 
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>> 
> ‑‑ 
> Ken Gaillot 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/