[ClusterLabs] Using fence_scsi agent and watchdog

2017-08-21 Thread Luca Maranzano
Hello all,

I've setup a 2 Nodes PCS lab to test the fence_scsi agent and how it works.
The lab is comprised by the following VMs, all CentOS 7.3 under VMware
Workstation:

pcs1 - 192.168.199.101
pcs2 - 192.168.199.102
iscsi - 192.168.199.200  ISCSI Server

The ISCSI server is providing 3 Block Volumes like these to both PCS nodes:

/dev/sdb 200 MB fence volume with working SCSI-3 persistent reservation
/dev/sdc 1GB data volume XFS
/dev/sdd 2GB data volume XFS

The Fencing agent is configured like this:
pcs stonith create FenceSCSI fence_scsi pcmk_host_list="pcs1 pcs2"
devices=/dev/sdb meta provides=unfencing

Then I've created 2 ResGroups, each with an LVM Volume mounted under
/cluster/fs1 and /cluster/fs2.

PCS is working like expected in managing resources.

Coming to the fence_scsi it seems that to be sure to have one node fenced
the only solution is to install the watchdog rpm and to link the correct
/usr/share/cluster/fence_scsi_check file in the /etc/watchdog.d directory.

But I've noticed that there is a significant lag between the effective
reboot of the node and the resource takeover on the surviving node which
could lead to a dangerous situation, for example:
1. stonith_admin -F pcs1
2. PCS will stop on pcs1 and resource are switched on node pcs2 in a few
moments
3. watchdog in some more time will trigger the reboot of the pcs1 node.

I've the following questions:

A. Is this the only possible configuration in order to use the fence_scsi
agent to be sure that fenced node is rebooted? If yes I think that
documentation should be updated accordingly because it is not very clear
B. is there a way to make the surviving node to wait that the fenced node
is actually rebooted before taking over the resources from the fenced node?

Thanks in advance for any answers.
Best regards,
Luca
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Retries before setting fail-count to INFINITY

2017-08-21 Thread Ken Gaillot
On Mon, 2017-08-21 at 15:39 +0200, Ulrich Windl wrote:
> >>> Vaibhaw Pandey  schrieb am 21.08.2017 um 14:58 in
> Nachricht
> 

Re: [ClusterLabs] SLES11 SP4: Strange problem with "(crm configure) commit"

2017-08-21 Thread Kristoffer Grönlund
Ulrich Windl  writes:

> Hi! 
>
> I just had a strange problem: When trying to "clean up" the cib configuration 
> (acually deleting unneded "operations" lines), I failed to commit the change, 
> even through it verified OK:
>
> crm(live)configure# commit
> Call cib_apply_diff failed (-206): Application of an update diff failed
> ERROR: could not patch cib (rc=206)
> INFO: offending xml diff: 

It looks to me (from a cursory glance) like you may be hitting a bug
with the patch generation in pacemaker. But there isn't enough details
to say for sure.

Try running crmsh with the "-dR" command line options to get it to
output the patch it tries to apply to the log.

Cheers,
Kristoffer

>
> In Syslog I see this:
> Aug 21 15:01:48 h02 cib[19397]:error: xml_apply_patchset_v2: Moved 
> meta_attributes.14926208 to position 1 instead of 2 (0xe3f0f0)
> Aug 21 15:01:48 h02 cib[19397]:error: xml_apply_patchset_v2: Moved 
> meta_attributes.9876096 to position 1 instead of 2 (0xe3c470)
> Aug 21 15:01:48 h02 cib[19397]:error: xml_apply_patchset_v2: Moved 
> utilization.10594784 to position 1 instead of 2 (0x96a2b0)
> Aug 21 15:01:48 h02 cib[19397]:error: xml_apply_patchset_v2: Moved 
> meta_attributes.11397008 to position 1 instead of 2 (0xacc5b0)
> Aug 21 15:01:48 h02 cib[19397]:  warning: cib_server_process_diff: Something 
> went wrong in compatibility mode, requesting full refresh
> Aug 21 15:01:48 h02 cib[19397]:  warning: cib_process_request: Completed 
> cib_apply_diff operation for section 'all': Application of an update diff 
> failed (rc=-206, origin=local/cibadmin/2, version=1.65.23)
>
> What could be causing this? I think I did the same change about three years 
> ago without problem (with different software, of course).
>
> # rpm -q pacemaker corosync crmsh
> pacemaker-1.1.12-18.1
> corosync-1.4.7-0.23.5
> crmsh-2.1.2+git132.gbc9fde0-18.2
> (latest)
>
> Regards,
> Ulrich
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Retries before setting fail-count to INFINITY

2017-08-21 Thread Ulrich Windl
>>> Vaibhaw Pandey  schrieb am 21.08.2017 um 14:58 in
Nachricht

[ClusterLabs] Retries before setting fail-count to INFINITY

2017-08-21 Thread Vaibhaw Pandey
Version in use: 1.1 along with corosync 1.4

Hello,
I am new to pacemaker and was trying to setup a MySQL master/slave cluster
using pacemaker and had a question on resource failure response which I
couldn't resolve from the documentation.

The pacemaker doc (
https://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_failure_response.html)
says clearly that:

"Normally, if a running resource fails, pacemaker will try to stop it and
start it again."

I was wondering if there is a way to configure the # of times pacemaker
will attempt this start and stop sequence - we want to try and restart the
resource 2 or 3 times before it is stopped. Obviously setting a
migration-threshold doesn't work in this case because the moment the 1st
attempt to restart the resource fails, fail-count is set to INFINITY. Our
failure-timeout is set to default (0).

The reason we wish to do this is that, at times the database is busy and
the monitor action fails. However there is a good chance it might succeed
on a second or third attempt.

Is there a parameter in pacemaker that we can utilize to cause this
behavior or will this have to be coded in the resource agent?

Thanks,
Vaibhaw
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org