Re: [ClusterLabs] Pacemaker tries to demote resource that isn't running and returns OCF_FAILED_MASTER

Brian Campbell Fri, 28 Aug 2015 08:26:43 -0700

On Fri, Aug 28, 2015 at 12:14 AM, Andrew Beekhof <[email protected]> wrote:
>
>> On 21 Aug 2015, at 1:32 pm, Andrei Borzenkov <[email protected]> wrote:
>>
>> 21.08.2015 00:35, Brian Campbell пишет:
>>> I have a master/slave resource (with a custom resource agent) which,
>>> if it uncleanly shut down, will return OCF_FAILED_MASTER on the next
>>> "monitor" operation. This seems to be what
>>> http://www.linux-ha.org/doc/dev-guides/_literal_ocf_failed_master_literal_9.html
>>> suggests that exit code should be used for.
>>>
>>> After the node is fenced, and comes up again, Pacemaker probes all of
>>> the resources. It gets the OCF_FAILED_MASTER exit code, and decides
>>> that it needs to demote the resource. So it executes the demote
>>> action. My resource agent returns an error on a demote action if it is
>>> not running, which seems to be the suggested behavior according to
>>> http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html
>>>
>>> This then causes Pacemaker to log a failure for the "demote" action,
>>> and then try to recover by stopping (which succeeds cleanly because
>>> the resource is stopped) followed by starting it again (which again
>>> succeeds, as we can start in slave mode from a failed state). So the
>>> end state is correct, but crm_mon shows a failed action that you need
>>> to clear out:
>>>
>>> Failed actions:
>>>     
>>> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_demote_0
>>> (node=es-efs-master2, call=73, rc=1, status=complete, l
>>> ast-rc-change=Thu Aug 20 12:52:21 2015
>>> , queued=54ms, exec=1ms
>>> ): unknown error
>>>
>>> I'm curious about whether the behavior of my resource agent is
>>> correct. Should I not be returning OCF_FAILED_MASTER upon the
>>> "monitor" operation if the resource isn't started?
>>
>> Correct. If resource is not started it cannot be master or slave; it can 
>> become master only after pacemaker requested it. Unexpected master would be 
>> just the same error as well.
>>
>> If you can determine that one resource instance is more suitable to become 
>> master than another one, you should set master score respectively so 
>> pacemaker will promote correct instance.
>>
>>>                                                   Or should the
>>> "demote" operation do something different in this state, like actually
>>> starting up the slave?
>>>
>>
>> In general, if current resource state is the same as would be after 
>> operation is completed, there is absolutely no reason to return error - just 
>> pretend operation succeeded.
>
> Always return the actual state. ie. OCF_NOT_RUNNING in these two cases.
>
> Only return OCF_FAILED_MASTER if you know enough to say that its in the 
> master state (ie. lock file, or similar mechanism) but not able to handle 
> requests.


Thanks for the clarifications!

So it sounds like I should be returning OCF_NOT_RUNNING from the
monitor operation even if I detect that it was uncleanly shut down in
the master state earlier, and only return OCF_FAILED_MASTER if it is
running in the master state but failed for some reason, so it needs a
demote or stop.

-- Brian

_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker tries to demote resource that isn't running and returns OCF_FAILED_MASTER

Reply via email to