Re: [ClusterLabs] Pacemaker tries to demote resource that isn't running and returns OCF_FAILED_MASTER

Andrew Beekhof Sun, 30 Aug 2015 18:23:04 -0700

> On 29 Aug 2015, at 1:24 am, Brian Campbell <[email protected]> 
> wrote:
> 
> On Fri, Aug 28, 2015 at 12:14 AM, Andrew Beekhof <[email protected]> wrote:
>> 
>>> On 21 Aug 2015, at 1:32 pm, Andrei Borzenkov <[email protected]> wrote:
>>> 
>>> 21.08.2015 00:35, Brian Campbell пишет:
>>>> I have a master/slave resource (with a custom resource agent) which,
>>>> if it uncleanly shut down, will return OCF_FAILED_MASTER on the next
>>>> "monitor" operation. This seems to be what
>>>> http://www.linux-ha.org/doc/dev-guides/_literal_ocf_failed_master_literal_9.html
>>>> suggests that exit code should be used for.
>>>> 
>>>> After the node is fenced, and comes up again, Pacemaker probes all of
>>>> the resources. It gets the OCF_FAILED_MASTER exit code, and decides
>>>> that it needs to demote the resource. So it executes the demote
>>>> action. My resource agent returns an error on a demote action if it is
>>>> not running, which seems to be the suggested behavior according to
>>>> http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html
>>>> 
>>>> This then causes Pacemaker to log a failure for the "demote" action,
>>>> and then try to recover by stopping (which succeeds cleanly because
>>>> the resource is stopped) followed by starting it again (which again
>>>> succeeds, as we can start in slave mode from a failed state). So the
>>>> end state is correct, but crm_mon shows a failed action that you need
>>>> to clear out:
>>>> 
>>>> Failed actions:
>>>>    
>>>> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_demote_0
>>>> (node=es-efs-master2, call=73, rc=1, status=complete, l
>>>> ast-rc-change=Thu Aug 20 12:52:21 2015
>>>> , queued=54ms, exec=1ms
>>>> ): unknown error
>>>> 
>>>> I'm curious about whether the behavior of my resource agent is
>>>> correct. Should I not be returning OCF_FAILED_MASTER upon the
>>>> "monitor" operation if the resource isn't started?
>>> 
>>> Correct. If resource is not started it cannot be master or slave; it can 
>>> become master only after pacemaker requested it. Unexpected master would be 
>>> just the same error as well.
>>> 
>>> If you can determine that one resource instance is more suitable to become 
>>> master than another one, you should set master score respectively so 
>>> pacemaker will promote correct instance.
>>> 
>>>>                                                  Or should the
>>>> "demote" operation do something different in this state, like actually
>>>> starting up the slave?
>>>> 
>>> 
>>> In general, if current resource state is the same as would be after 
>>> operation is completed, there is absolutely no reason to return error - 
>>> just pretend operation succeeded.
>> 
>> Always return the actual state. ie. OCF_NOT_RUNNING in these two cases.
>> 
>> Only return OCF_FAILED_MASTER if you know enough to say that its in the 
>> master state (ie. lock file, or similar mechanism) but not able to handle 
>> requests.
> 
> Thanks for the clarifications!
> 
> So it sounds like I should be returning OCF_NOT_RUNNING from the
> monitor operation even if I detect that it was uncleanly shut down in
> the master state earlier,


It really depends on if you need any cleanup to happen.
Need cleanup: OCF_FAILED_MASTER
_Safely_ stopped:   OCF_NOT_RUNNING

> and only return OCF_FAILED_MASTER if it is
> running in the master state but failed for some reason, so it needs a
> demote or stop.
> 
> -- Brian
> 
> _______________________________________________
> Users mailing list: [email protected]
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker tries to demote resource that isn't running and returns OCF_FAILED_MASTER

Reply via email to