Hi, The problem was due to bad stonith configuration. Above config is an example of a working Active/Active NFS configuration.
Pozdrawiam, Arek 2017-07-10 12:59 GMT+02:00 ArekW <[email protected]>: > Hi, > I've created 2-node active-active HA Cluster with NFS resource. The > resources are active on both nodes. The Cluster passes failover test with > pcs standby command but does not work when "real" node shutdown occure. > > Test scenario with cluster standby: > - start cluster > - mount nfs share on client1 > - start copy file from client1 to nfs share > - during the copy put node1/node2 to standby mode (pcs cluster standby > nfsnode2) > - the copy continue > - unstandby node1/node2 > - the copy continue and the storage re-sync (drbd) > - the copy finish with no errors > > I can standby and unstandby the cluster many times and it works. The > problem begins when I do a "true" failover test by hard-shutting down one > of the nodes. Test results: > - start cluster > - mount nfs share on client1 > - start copy file from client1 to nfs share > - during the copy shutdown node2 by stoping the node's virtual machine > (hard stop) > - the system hangs: > > <Start copy file at client1> > # rsync -a --bwlimit=2000 /root/testfile.dat /mnt/nfsshare/ > > <everything works ok. There is temp file .testfile.dat.9780fH> > > [root@nfsnode1 nfs]# ls -lah > razem 9,8M > drwxr-xr-x 2 root root 3,8K 07-10 11:07 . > drwxr-xr-x 4 root root 3,8K 07-10 08:20 .. > -rw-r--r-- 1 root root 9 07-10 08:20 client1.txt > -rw-r----- 1 root root 0 07-10 11:07 .rmtab > -rw------- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH > > [root@nfsnode1 nfs]# pcs status > Cluster name: nfscluster > Stack: corosync > Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with > quorum > Last updated: Mon Jul 10 11:07:29 2017 Last change: Mon Jul 10 > 10:28:12 2017 by root via crm_attribute on nfsnode1 > > 2 nodes and 15 resources configured > > Online: [ nfsnode1 nfsnode2 ] > > Full list of resources: > > Master/Slave Set: StorageClone [Storage] > Masters: [ nfsnode1 nfsnode2 ] > Clone Set: dlm-clone [dlm] > Started: [ nfsnode1 nfsnode2 ] > vbox-fencing (stonith:fence_vbox): Started nfsnode1 > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started nfsnode2 > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1 > Clone Set: StorageFS-clone [StorageFS] > Started: [ nfsnode1 nfsnode2 ] > Clone Set: WebSite-clone [WebSite] > Started: [ nfsnode1 nfsnode2 ] > Clone Set: nfs-group-clone [nfs-group] > Started: [ nfsnode1 nfsnode2 ] > > <Hard poweroff vm machine: nfsnode2> > > [root@nfsnode1 nfs]# pcs status > Cluster name: nfscluster > Stack: corosync > Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with > quorum > Last updated: Mon Jul 10 11:07:43 2017 Last change: Mon Jul 10 > 10:28:12 2017 by root via crm_attribute on nfsnode1 > > 2 nodes and 15 resources configured > > Node nfsnode2: UNCLEAN (offline) > Online: [ nfsnode1 ] > > Full list of resources: > > Master/Slave Set: StorageClone [Storage] > Storage (ocf::linbit:drbd): Master nfsnode2 (UNCLEAN) > Masters: [ nfsnode1 ] > Clone Set: dlm-clone [dlm] > dlm (ocf::pacemaker:controld): Started nfsnode2 (UNCLEAN) > Started: [ nfsnode1 ] > vbox-fencing (stonith:fence_vbox): Started nfsnode1 > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started nfsnode2 > (UNCLEAN) > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1 > Clone Set: StorageFS-clone [StorageFS] > StorageFS (ocf::heartbeat:Filesystem): Started nfsnode2 (UNCLEAN) > Started: [ nfsnode1 ] > Clone Set: WebSite-clone [WebSite] > WebSite (ocf::heartbeat:apache): Started nfsnode2 (UNCLEAN) > Started: [ nfsnode1 ] > Clone Set: nfs-group-clone [nfs-group] > Resource Group: nfs-group:1 > nfs (ocf::heartbeat:nfsserver): Started nfsnode2 (UNCLEAN) > nfs-export (ocf::heartbeat:exportfs): Started nfsnode2 > (UNCLEAN) > Started: [ nfsnode1 ] > > <ssh console hangs on client1> > [root@nfsnode1 nfs]# ls -lah > <nothing happen> > > <drbd status is ok in this situation> > [root@nfsnode1 ~]# drbdadm status > storage role:Primary > disk:UpToDate > nfsnode2 connection:Connecting > > <the nfs export is still active on node1> > [root@nfsnode1 ~]# exportfs > /mnt/drbd/nfs 10.0.2.0/255.255.255.0 > > <After ssh to client1 the nfs mount is not accessible> > login as: root > [email protected]'s password: > Last login: Mon Jul 10 07:48:17 2017 from 10.0.2.2 > # cd /mnt/ > # ls > <console hangs> > > # mount > 10.0.2.7:/ on /mnt/nfsshare type nfs4 (rw,relatime,vers=4.0,rsize= > 131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600, > retrans=2,sec=sys,clientaddr=10.0.2.20,local_lock=none,addr=10.0.2.7) > > <Power on vm machine nfsnode2> > <After nfsnode2 boot, console an nfsnode1 start respond but coping is not > proceeding> > <The temp file is visible but not active> > [root@nfsnode1 ~]# ls -lah > razem 9,8M > drwxr-xr-x 2 root root 3,8K 07-10 11:07 . > drwxr-xr-x 4 root root 3,8K 07-10 08:20 .. > -rw-r--r-- 1 root root 9 07-10 08:20 client1.txt > -rw-r----- 1 root root 0 07-10 11:16 .rmtab > -rw------- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH > > <Coping at client1 hangs> > > <Cluster status:> > [root@nfsnode1 ~]# pcs status > Cluster name: nfscluster > Stack: corosync > Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with > quorum > Last updated: Mon Jul 10 11:17:19 2017 Last change: Mon Jul 10 > 10:28:12 2017 by root via crm_attribute on nfsnode1 > > 2 nodes and 15 resources configured > > Online: [ nfsnode1 nfsnode2 ] > > Full list of resources: > > Master/Slave Set: StorageClone [Storage] > Masters: [ nfsnode1 ] > Stopped: [ nfsnode2 ] > Clone Set: dlm-clone [dlm] > Started: [ nfsnode1 nfsnode2 ] > vbox-fencing (stonith:fence_vbox): Started nfsnode1 > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Stopped > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1 > Clone Set: StorageFS-clone [StorageFS] > Started: [ nfsnode1 ] > Stopped: [ nfsnode2 ] > Clone Set: WebSite-clone [WebSite] > Started: [ nfsnode1 ] > Stopped: [ nfsnode2 ] > Clone Set: nfs-group-clone [nfs-group] > Resource Group: nfs-group:0 > nfs (ocf::heartbeat:nfsserver): Started nfsnode1 > nfs-export (ocf::heartbeat:exportfs): FAILED nfsnode1 > Stopped: [ nfsnode2 ] > > Failed Actions: > * nfs-export_monitor_30000 on nfsnode1 'unknown error' (1): call=61, > status=Timed Out, exitreason='none', > last-rc-change='Mon Jul 10 11:11:50 2017', queued=0ms, exec=0ms > * vbox-fencing_monitor_60000 on nfsnode1 'unknown error' (1): call=22, > status=Error, exitreason='none', > last-rc-change='Mon Jul 10 11:06:41 2017', queued=0ms, exec=11988ms > > <Try to cleanup> > > # pcs resource cleanup > # pcs status > Cluster name: nfscluster > Stack: corosync > Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with > quorum > Last updated: Mon Jul 10 11:20:38 2017 Last change: Mon Jul 10 > 10:28:12 2017 by root via crm_attribute on nfsnode1 > > 2 nodes and 15 resources configured > > Online: [ nfsnode1 nfsnode2 ] > > Full list of resources: > > Master/Slave Set: StorageClone [Storage] > Masters: [ nfsnode1 ] > Stopped: [ nfsnode2 ] > Clone Set: dlm-clone [dlm] > Started: [ nfsnode1 nfsnode2 ] > vbox-fencing (stonith:fence_vbox): Stopped > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Stopped > ClusterIP:1 (ocf::heartbeat:IPaddr2): Stopped > Clone Set: StorageFS-clone [StorageFS] > Stopped: [ nfsnode1 nfsnode2 ] > Clone Set: WebSite-clone [WebSite] > Stopped: [ nfsnode1 nfsnode2 ] > Clone Set: nfs-group-clone [nfs-group] > Stopped: [ nfsnode1 nfsnode2 ] > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > > <Reboot of both nfsnode1 and nfsnode2> > <After reboot:> > > # pcs status > Cluster name: nfscluster > Stack: corosync > Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with > quorum > Last updated: Mon Jul 10 11:24:10 2017 Last change: Mon Jul 10 > 10:28:12 2017 by root via crm_attribute on nfsnode1 > > 2 nodes and 15 resources configured > > Online: [ nfsnode1 nfsnode2 ] > > Full list of resources: > > Master/Slave Set: StorageClone [Storage] > Slaves: [ nfsnode2 ] > Stopped: [ nfsnode1 ] > Clone Set: dlm-clone [dlm] > Started: [ nfsnode1 nfsnode2 ] > vbox-fencing (stonith:fence_vbox): Stopped > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Stopped > ClusterIP:1 (ocf::heartbeat:IPaddr2): Stopped > Clone Set: StorageFS-clone [StorageFS] > Stopped: [ nfsnode1 nfsnode2 ] > Clone Set: WebSite-clone [WebSite] > Stopped: [ nfsnode1 nfsnode2 ] > Clone Set: nfs-group-clone [nfs-group] > Stopped: [ nfsnode1 nfsnode2 ] > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > > <Eventually the cluster was recovered after:> > pcs cluster stop --all > <Solve drbd split-brain> > pcs cluster start --all > > The client1 could not be rebooted with 'reboot' due to mount hung (as I > preasume). It has to be rebooted hard-way by virtualbox hypervisor. > What's wrong with this configuration? I can send CIB configuration if > necessary. > > --------------- > Full cluster configuration (working state): > > # pcs status --full > Cluster name: nfscluster > Stack: corosync > Current DC: nfsnode1 (1) (version 1.1.15-11.el7_3.5-e174ec8) - partition > with quorum > Last updated: Mon Jul 10 12:44:03 2017 Last change: Mon Jul 10 > 11:37:13 2017 by root via crm_attribute on nfsnode1 > > 2 nodes and 15 resources configured > > Online: [ nfsnode1 (1) nfsnode2 (2) ] > > Full list of resources: > > Master/Slave Set: StorageClone [Storage] > Storage (ocf::linbit:drbd): Master nfsnode1 > Storage (ocf::linbit:drbd): Master nfsnode2 > Masters: [ nfsnode1 nfsnode2 ] > Clone Set: dlm-clone [dlm] > dlm (ocf::pacemaker:controld): Started nfsnode1 > dlm (ocf::pacemaker:controld): Started nfsnode2 > Started: [ nfsnode1 nfsnode2 ] > vbox-fencing (stonith:fence_vbox): Started nfsnode1 > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0 (ocf::heartbeat:IPaddr2): Started nfsnode2 > ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1 > Clone Set: StorageFS-clone [StorageFS] > StorageFS (ocf::heartbeat:Filesystem): Started nfsnode1 > StorageFS (ocf::heartbeat:Filesystem): Started nfsnode2 > Started: [ nfsnode1 nfsnode2 ] > Clone Set: WebSite-clone [WebSite] > WebSite (ocf::heartbeat:apache): Started nfsnode1 > WebSite (ocf::heartbeat:apache): Started nfsnode2 > Started: [ nfsnode1 nfsnode2 ] > Clone Set: nfs-group-clone [nfs-group] > Resource Group: nfs-group:0 > nfs (ocf::heartbeat:nfsserver): Started nfsnode1 > nfs-export (ocf::heartbeat:exportfs): Started nfsnode1 > Resource Group: nfs-group:1 > nfs (ocf::heartbeat:nfsserver): Started nfsnode2 > nfs-export (ocf::heartbeat:exportfs): Started nfsnode2 > Started: [ nfsnode1 nfsnode2 ] > > Node Attributes: > * Node nfsnode1 (1): > + master-Storage : 10000 > * Node nfsnode2 (2): > + master-Storage : 10000 > > Migration Summary: > * Node nfsnode1 (1): > * Node nfsnode2 (2): > > PCSD Status: > nfsnode1: Online > nfsnode2: Online > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > > ]# pcs resource --full > Master: StorageClone > Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=2 > clone-node-max=1 > Resource: Storage (class=ocf provider=linbit type=drbd) > Attributes: drbd_resource=storage > Operations: start interval=0s timeout=240 (Storage-start-interval-0s) > promote interval=0s timeout=90 (Storage-promote-interval-0s) > demote interval=0s timeout=90 (Storage-demote-interval-0s) > stop interval=0s timeout=100 (Storage-stop-interval-0s) > monitor interval=60s (Storage-monitor-interval-60s) > Clone: dlm-clone > Meta Attrs: clone-max=2 clone-node-max=1 > Resource: dlm (class=ocf provider=pacemaker type=controld) > Operations: start interval=0s timeout=90 (dlm-start-interval-0s) > stop interval=0s timeout=100 (dlm-stop-interval-0s) > monitor interval=60s (dlm-monitor-interval-60s) > Clone: ClusterIP-clone > Meta Attrs: clona-node-max=2 clone-max=2 globally-unique=true > clone-node-max=2 > Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) > Attributes: ip=10.0.2.7 cidr_netmask=32 clusterip_hash=sourceip > Meta Attrs: resource-stickiness=0 > Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s) > stop interval=0s timeout=20s (ClusterIP-stop-interval-0s) > monitor interval=5s (ClusterIP-monitor-interval-5s) > Clone: StorageFS-clone > Resource: StorageFS (class=ocf provider=heartbeat type=Filesystem) > Attributes: device=/dev/drbd1 directory=/mnt/drbd fstype=gfs2 > Operations: start interval=0s timeout=60 (StorageFS-start-interval-0s) > stop interval=0s timeout=60 (StorageFS-stop-interval-0s) > monitor interval=20 timeout=40 (StorageFS-monitor-interval- > 20) > Clone: WebSite-clone > Resource: WebSite (class=ocf provider=heartbeat type=apache) > Attributes: configfile=/etc/httpd/conf/httpd.conf statusurl= > http://localhost/server-status > Operations: start interval=0s timeout=40s (WebSite-start-interval-0s) > stop interval=0s timeout=60s (WebSite-stop-interval-0s) > monitor interval=1min (WebSite-monitor-interval-1min) > Clone: nfs-group-clone > Meta Attrs: interleave=true > Group: nfs-group > Resource: nfs (class=ocf provider=heartbeat type=nfsserver) > Attributes: nfs_ip=10.0.2.7 nfs_no_notify=true > Operations: start interval=0s timeout=40 (nfs-start-interval-0s) > stop interval=0s timeout=20s (nfs-stop-interval-0s) > monitor interval=30s (nfs-monitor-interval-30s) > Resource: nfs-export (class=ocf provider=heartbeat type=exportfs) > Attributes: clientspec=10.0.2.0/255.255.255.0 > options=rw,sync,no_root_squash directory=/mnt/drbd/nfs fsid=0 > Operations: start interval=0s timeout=40 (nfs-export-start-interval-0s) > stop interval=0s timeout=120 (nfs-export-stop-interval-0s) > monitor interval=30s (nfs-export-monitor-interval-30s) > > # pcs constraint --full > Location Constraints: > Ordering Constraints: > start ClusterIP-clone then start WebSite-clone (kind:Mandatory) > (id:order-ClusterIP-WebSite-mandatory) > promote StorageClone then start StorageFS-clone (kind:Mandatory) > (id:order-StorageClone-StorageFS-mandatory) > start StorageFS-clone then start WebSite-clone (kind:Mandatory) > (id:order-StorageFS-WebSite-mandatory) > start dlm-clone then start StorageFS-clone (kind:Mandatory) > (id:order-dlm-clone-StorageFS-mandatory) > start StorageFS-clone then start nfs-group-clone (kind:Mandatory) > (id:order-StorageFS-clone-nfs-group-clone-mandatory) > Colocation Constraints: > WebSite-clone with ClusterIP-clone (score:INFINITY) > (id:colocation-WebSite-ClusterIP-INFINITY) > StorageFS-clone with StorageClone (score:INFINITY) > (with-rsc-role:Master) (id:colocation-StorageFS-StorageClone-INFINITY) > WebSite-clone with StorageFS-clone (score:INFINITY) > (id:colocation-WebSite-StorageFS-INFINITY) > StorageFS-clone with dlm-clone (score:INFINITY) > (id:colocation-StorageFS-dlm-clone-INFINITY) > StorageFS-clone with nfs-group-clone (score:INFINITY) > (id:colocation-StorageFS-clone-nfs-group-clone-INFINITY) > >
_______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
