On Fri, Aug 28, 2015 at 12:14 AM, Andrew Beekhof <[email protected]> wrote: > >> On 21 Aug 2015, at 1:32 pm, Andrei Borzenkov <[email protected]> wrote: >> >> 21.08.2015 00:35, Brian Campbell пишет: >>> I have a master/slave resource (with a custom resource agent) which, >>> if it uncleanly shut down, will return OCF_FAILED_MASTER on the next >>> "monitor" operation. This seems to be what >>> http://www.linux-ha.org/doc/dev-guides/_literal_ocf_failed_master_literal_9.html >>> suggests that exit code should be used for. >>> >>> After the node is fenced, and comes up again, Pacemaker probes all of >>> the resources. It gets the OCF_FAILED_MASTER exit code, and decides >>> that it needs to demote the resource. So it executes the demote >>> action. My resource agent returns an error on a demote action if it is >>> not running, which seems to be the suggested behavior according to >>> http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html >>> >>> This then causes Pacemaker to log a failure for the "demote" action, >>> and then try to recover by stopping (which succeeds cleanly because >>> the resource is stopped) followed by starting it again (which again >>> succeeds, as we can start in slave mode from a failed state). So the >>> end state is correct, but crm_mon shows a failed action that you need >>> to clear out: >>> >>> Failed actions: >>> >>> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_demote_0 >>> (node=es-efs-master2, call=73, rc=1, status=complete, l >>> ast-rc-change=Thu Aug 20 12:52:21 2015 >>> , queued=54ms, exec=1ms >>> ): unknown error >>> >>> I'm curious about whether the behavior of my resource agent is >>> correct. Should I not be returning OCF_FAILED_MASTER upon the >>> "monitor" operation if the resource isn't started? >> >> Correct. If resource is not started it cannot be master or slave; it can >> become master only after pacemaker requested it. Unexpected master would be >> just the same error as well. >> >> If you can determine that one resource instance is more suitable to become >> master than another one, you should set master score respectively so >> pacemaker will promote correct instance. >> >>> Or should the >>> "demote" operation do something different in this state, like actually >>> starting up the slave? >>> >> >> In general, if current resource state is the same as would be after >> operation is completed, there is absolutely no reason to return error - just >> pretend operation succeeded. > > Always return the actual state. ie. OCF_NOT_RUNNING in these two cases. > > Only return OCF_FAILED_MASTER if you know enough to say that its in the > master state (ie. lock file, or similar mechanism) but not able to handle > requests.
Thanks for the clarifications! So it sounds like I should be returning OCF_NOT_RUNNING from the monitor operation even if I detect that it was uncleanly shut down in the master state earlier, and only return OCF_FAILED_MASTER if it is running in the master state but failed for some reason, so it needs a demote or stop. -- Brian _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
