On 5/29/21 12:05 AM, Strahil Nikolov wrote:
Most RA scripts are writen in bash.
Usually you can change the shebang to '!#/usr/bin/bash -x' or you can set trace_ra=1 via 'pcs resource update RESOURCE trace_ra=1 trace_file=/somepath'.

If you don't define trace_file, it should create them in /var/lib/heartbeat/trace_ra (based on memory -> so use find/locate).

Best Regards,
Strahil Nikolov

    On Fri, May 28, 2021 at 22:10, Abithan Kumarasamy
    <abithan.kumaras...@ibm.com> wrote:
    Hello Team,
    We have been recently running some tests on our Pacemaker clusters
    that involve two Pacemaker resources on two nodes respectively.
    The test case in which we are experiencing intermittent problems
    is one in which we bring down the Pacemaker resources on both
    nodes simultaneously. Now our expected behaviour is that our
    monitor function in our resource agent script detects the
    downtime, and then should issue a start command. This happens on
    most successful iterations of our test case. However, on some
    iterations (approximately 1 out of 30 simulations) we notice that
    Pacemaker is issuing the start command on only one of the hosts.
    On the troubled host the monitor function is logging that the
    resource is down as expected and is exiting with OCF_ERR_GENERIC
    return code (1) . According to the documentation, this should
    perform a soft disaster recovery, but when scanning the Pacemaker
    logs, there is no indication of the start command being issued or
    invoked. However, it works as expected on the other host.
    To summarize the issue:

     1. The resource’s monitor is running and returning OCF_ERR_GENERIC
     2. The constraints we have for the resources are satisfied.
     3. There are no visible differences in the Pacemaker logs between
        the test iteration that failed, and the multiple successful
        iterations, other than the fact that Pacemaker does not start
        the resource after the monitor returns OCF_ERR_GENERIC

In general pacemaker won't start a resource after receiving
OCF_ERR_GENERIC from the monitor. As you already mentioned
it will try to recover the resource to a known state by first
trying to stop and the state has to be reported as stopped
after that. Just then it will try to restart if rules say so.
Which Resource Agent are you using? If you brought down
the resource manually it shouldn't report OCF_ERR_GENERIC
but stopped.

Regards,
Klaus

    1.

    Could you provide some more insight into why this may be happening
    and how we can further debug this issue? We are currently relying
    on Pacemaker logs, but are there additional diagnostics to further
    debug?
    Thanks,
    Abithan

    _______________________________________________
    Manage your subscription:
    https://lists.clusterlabs.org/mailman/listinfo/users
    <https://lists.clusterlabs.org/mailman/listinfo/users>

    ClusterLabs home: https://www.clusterlabs.org/
    <https://www.clusterlabs.org/>


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to