On Sun, Jun 26, 2016 at 6:05 AM, Marcin Dulak <[email protected]> wrote: > Hi, > > I'm trying to get familiar with STONITH Block Devices (SBD) on a 3-node > CentOS7 built in VirtualBox. > The complete setup is available at > https://github.com/marcindulak/vagrant-sbd-tutorial-centos7.git > so hopefully with some help I'll be able to make it work. > > Question 1: > The shared device /dev/sbd1 is the VirtualBox's "shareable hard disk" > https://www.virtualbox.org/manual/ch05.html#hdimagewrites > will SBD fencing work with that type of storage?
unknown > > I start the cluster using vagrant_1.8.1 and virtualbox-4.3 with: > $ vagrant up # takes ~15 minutes > > The setup brings up the nodes, installs the necessary packages, and prepares > for the configuration of the pcs cluster. > You can see which scripts the nodes execute at the bottom of the > Vagrantfile. > While there is 'yum -y install sbd' on CentOS7 the fence_sbd agent has not > been packaged yet. you're not supposed to use it > Therefore I rebuild Fedora 24 package using the latest > https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz > plus the update to the fence_sbd from > https://github.com/ClusterLabs/fence-agents/pull/73 > > The configuration is inspired by > https://www.novell.com/support/kb/doc.php?id=7009485 and > https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html > > Question 2: > After reading http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit I > expect with just one stonith resource configured there shouldn't be any stonith resources configured > a node will be fenced when I stop pacemaker and corosync `pcs cluster stop > node-1` or just `stonith_admin -F node-1`, but this is not the case. > > As can be seen below from uptime, the node-1 is not shutdown by `pcs cluster > stop node-1` executed on itself. > I found some discussions on [email protected] about whether a node > running SBD resource can fence itself, > but the conclusion was not clear to me. on RHEL and derivatives it can ONLY fence itself. the disk based posion pill isn't supported yet > > Question 3: > Neither node-1 is fenced by `stonith_admin -F node-1` executed on node-2, > despite the fact > /var/log/messages on node-2 (the one currently running MyStonith) reporting: > ... > notice: Operation 'off' [3309] (call 2 from stonith_admin.3288) for host > 'node-1' with device 'MyStonith' returned: 0 (OK) > ... > What is happening here? have you tried looking at the sbd logs? is the watchdog device functioning correctly? > > Question 4 (for the future): > Assuming the node-1 was fenced, what is the way of operating SBD? > I see the sbd lists now: > 0 node-3 clear > 1 node-1 off node-2 > 2 node-2 clear > How to clear the status of node-1? > > Question 5 (also for the future): > While the relation 'stonith-timeout = Timeout (msgwait) + 20%' presented > at > https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html > is clearly described, I wonder about the relation of 'stonith-timeout' > to other timeouts like the 'monitor interval=60s' reported by `pcs stonith > show MyStonith`. > > Here is how I configure the cluster and test it. The run.sh script is > attached. > > $ sh -x run01.sh 2>&1 | tee run01.txt > > with the result: > > $ cat run01.txt > > Each block below shows the executed ssh command and the result. > > ############################ > ssh node-1 -c sudo su - -c 'pcs cluster auth -u hacluster -p password node-1 > node-2 node-3' > node-1: Authorized > node-3: Authorized > node-2: Authorized > > > > ############################ > ssh node-1 -c sudo su - -c 'pcs cluster setup --name mycluster node-1 node-2 > node-3' > Shutting down pacemaker/corosync services... > Redirecting to /bin/systemctl stop pacemaker.service > Redirecting to /bin/systemctl stop corosync.service > Killing any remaining services... > Removing all cluster configuration files... > node-1: Succeeded > node-2: Succeeded > node-3: Succeeded > Synchronizing pcsd certificates on nodes node-1, node-2, node-3... > node-1: Success > node-3: Success > node-2: Success > Restaring pcsd on the nodes in order to reload the certificates... > node-1: Success > node-3: Success > node-2: Success > > > > ############################ > ssh node-1 -c sudo su - -c 'pcs cluster start --all' > node-3: Starting Cluster... > node-2: Starting Cluster... > node-1: Starting Cluster... > > > > ############################ > ssh node-1 -c sudo su - -c 'corosync-cfgtool -s' > Printing ring status. > Local node ID 1 > RING ID 0 > id = 192.168.10.11 > status = ring 0 active with no faults > > > ############################ > ssh node-1 -c sudo su - -c 'pcs status corosync' > Membership information > ---------------------- > Nodeid Votes Name > 1 1 node-1 (local) > 2 1 node-2 > 3 1 node-3 > > > > ############################ > ssh node-1 -c sudo su - -c 'pcs status' > Cluster name: mycluster > WARNING: no stonith devices and stonith-enabled is not false > Last updated: Sat Jun 25 15:40:51 2016 Last change: Sat Jun 25 > 15:40:33 2016 by hacluster via crmd on node-2 > Stack: corosync > Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with > quorum > 3 nodes and 0 resources configured > Online: [ node-1 node-2 node-3 ] > Full list of resources: > PCSD Status: > node-1: Online > node-2: Online > node-3: Online > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > > > > > ############################ > ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list' > 0 node-3 clear > 1 node-2 clear > 2 node-1 clear > > > > > ############################ > ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 dump' > ==Dumping header on disk /dev/sdb1 > Header version : 2.1 > UUID : 79f28167-a207-4f2a-a723-aa1c00bf1dee > Number of slots : 255 > Sector size : 512 > Timeout (watchdog) : 10 > Timeout (allocate) : 2 > Timeout (loop) : 1 > Timeout (msgwait) : 20 > ==Header on disk /dev/sdb1 is dumped > > > > > ############################ > ssh node-1 -c sudo su - -c 'pcs stonith list' > fence_sbd - Fence agent for sbd > > > > > ############################ > ssh node-1 -c sudo su - -c 'pcs stonith create MyStonith fence_sbd > devices=/dev/sdb1 power_timeout=21 action=off' > ssh node-1 -c sudo su - -c 'pcs property set stonith-enabled=true' > ssh node-1 -c sudo su - -c 'pcs property set stonith-timeout=24s' > ssh node-1 -c sudo su - -c 'pcs property' > Cluster Properties: > cluster-infrastructure: corosync > cluster-name: mycluster > dc-version: 1.1.13-10.el7_2.2-44eb2dd > have-watchdog: true > stonith-enabled: true > stonith-timeout: 24s > stonith-watchdog-timeout: 10s > > > > ############################ > ssh node-1 -c sudo su - -c 'pcs stonith show MyStonith' > Resource: MyStonith (class=stonith type=fence_sbd) > Attributes: devices=/dev/sdb1 power_timeout=21 action=off > Operations: monitor interval=60s (MyStonith-monitor-interval-60s) > > > > ############################ > ssh node-1 -c sudo su - -c 'pcs cluster stop node-1 ' > node-1: Stopping Cluster (pacemaker)... > node-1: Stopping Cluster (corosync)... > > > > ############################ > ssh node-2 -c sudo su - -c 'pcs status' > Cluster name: mycluster > Last updated: Sat Jun 25 15:42:29 2016 Last change: Sat Jun 25 > 15:41:09 2016 by root via cibadmin on node-1 > Stack: corosync > Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with > quorum > 3 nodes and 1 resource configured > Online: [ node-2 node-3 ] > OFFLINE: [ node-1 ] > Full list of resources: > MyStonith (stonith:fence_sbd): Started node-2 > PCSD Status: > node-1: Online > node-2: Online > node-3: Online > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > > > > ############################ > ssh node-2 -c sudo su - -c 'stonith_admin -F node-1 ' > > > > ############################ > ssh node-2 -c sudo su - -c 'grep stonith-ng /var/log/messages' > Jun 25 15:40:11 localhost stonith-ng[3102]: notice: Additional logging > available in /var/log/cluster/corosync.log > Jun 25 15:40:11 localhost stonith-ng[3102]: notice: Connecting to cluster > infrastructure: corosync > Jun 25 15:40:11 localhost stonith-ng[3102]: notice: crm_update_peer_proc: > Node node-2[2] - state is now member (was (null)) > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: Watching for stonith > topology changes > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: Added 'watchdog' to the > device list (1 active devices) > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: crm_update_peer_proc: > Node node-3[3] - state is now member (was (null)) > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: crm_update_peer_proc: > Node node-1[1] - state is now member (was (null)) > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: New watchdog timeout > 10s (was 0s) > Jun 25 15:41:03 localhost stonith-ng[3102]: notice: Relying on watchdog > integration for fencing > Jun 25 15:41:04 localhost stonith-ng[3102]: notice: Added 'MyStonith' to > the device list (2 active devices) > Jun 25 15:41:54 localhost stonith-ng[3102]: notice: crm_update_peer_proc: > Node node-1[1] - state is now lost (was member) > Jun 25 15:41:54 localhost stonith-ng[3102]: notice: Removing node-1/1 from > the membership list > Jun 25 15:41:54 localhost stonith-ng[3102]: notice: Purged 1 peers with > id=1 and/or uname=node-1 from the membership cache > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: Client > stonith_admin.3288.eb400ac9 wants to fence (off) 'node-1' with device > '(any)' > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: Initiating remote > operation off for node-1: 848cd1e9-55e4-4abc-8d7a-3762eaaf9ab4 (0) > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: watchdog can not fence > (off) node-1: static-list > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: MyStonith can fence > (off) node-1: dynamic-list > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: watchdog can not fence > (off) node-1: static-list > Jun 25 15:42:54 localhost stonith-ng[3102]: notice: Operation 'off' [3309] > (call 2 from stonith_admin.3288) for host 'node-1' with device 'MyStonith' > returned: 0 (OK) > Jun 25 15:42:54 localhost stonith-ng[3102]: notice: Operation off of node-1 > by node-2 for [email protected]: OK > Jun 25 15:42:54 localhost stonith-ng[3102]: warning: new_event_notification > (3102-3288-12): Broken pipe (32) > Jun 25 15:42:54 localhost stonith-ng[3102]: warning: st_notify_fence > notification of client stonith_admin.3288.eb400a failed: Broken pipe (-32) > > > > ############################ > ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list' > 0 node-3 clear > 1 node-2 clear > 2 node-1 off node-2 > > > > ############################ > ssh node-1 -c sudo su - -c 'uptime' > 15:43:31 up 21 min, 2 users, load average: 0.25, 0.18, 0.11 > > > > Cheers, > > Marcin > > > _______________________________________________ > Users mailing list: [email protected] > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
