[ClusterLabs] Pacemaker crash and fencing failure

2015-11-20 Thread Brian Campbell
I've been trying to debug and do a root cause analysis for a cascading series of failures that a customer hit a couple of days ago, that caused their filesystem to be unavailable for a couple of hours. The original failure was in our own distributed filesystem backend, a fork of LizardFS, which

[ClusterLabs] Pacemaker tries to demote resource that isn't running and returns OCF_FAILED_MASTER

2015-08-20 Thread Brian Campbell
I have a master/slave resource (with a custom resource agent) which, if it uncleanly shut down, will return OCF_FAILED_MASTER on the next monitor operation. This seems to be what http://www.linux-ha.org/doc/dev-guides/_literal_ocf_failed_master_literal_9.html suggests that exit code should be used