On 5/29/21 12:05 AM, Strahil Nikolov wrote:
Most RA scripts are writen in bash.
Usually you can change the shebang to '!#/usr/bin/bash -x' or you can
set trace_ra=1 via 'pcs resource update RESOURCE trace_ra=1
trace_file=/somepath'.
If you don't define trace_file, it should create them in
/var/lib/heartbeat/trace_ra (based on memory -> so use find/locate).
Best Regards,
Strahil Nikolov
On Fri, May 28, 2021 at 22:10, Abithan Kumarasamy
<abithan.kumaras...@ibm.com> wrote:
Hello Team,
We have been recently running some tests on our Pacemaker clusters
that involve two Pacemaker resources on two nodes respectively.
The test case in which we are experiencing intermittent problems
is one in which we bring down the Pacemaker resources on both
nodes simultaneously. Now our expected behaviour is that our
monitor function in our resource agent script detects the
downtime, and then should issue a start command. This happens on
most successful iterations of our test case. However, on some
iterations (approximately 1 out of 30 simulations) we notice that
Pacemaker is issuing the start command on only one of the hosts.
On the troubled host the monitor function is logging that the
resource is down as expected and is exiting with OCF_ERR_GENERIC
return code (1) . According to the documentation, this should
perform a soft disaster recovery, but when scanning the Pacemaker
logs, there is no indication of the start command being issued or
invoked. However, it works as expected on the other host.
To summarize the issue:
1. The resource’s monitor is running and returning OCF_ERR_GENERIC
2. The constraints we have for the resources are satisfied.
3. There are no visible differences in the Pacemaker logs between
the test iteration that failed, and the multiple successful
iterations, other than the fact that Pacemaker does not start
the resource after the monitor returns OCF_ERR_GENERIC
In general pacemaker won't start a resource after receiving
OCF_ERR_GENERIC from the monitor. As you already mentioned
it will try to recover the resource to a known state by first
trying to stop and the state has to be reported as stopped
after that. Just then it will try to restart if rules say so.
Which Resource Agent are you using? If you brought down
the resource manually it shouldn't report OCF_ERR_GENERIC
but stopped.
Regards,
Klaus
1.
Could you provide some more insight into why this may be happening
and how we can further debug this issue? We are currently relying
on Pacemaker logs, but are there additional diagnostics to further
debug?
Thanks,
Abithan
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
<https://lists.clusterlabs.org/mailman/listinfo/users>
ClusterLabs home: https://www.clusterlabs.org/
<https://www.clusterlabs.org/>
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/