Re: [ClusterLabs] Active-Active NFS cluster failover test - system hangs (VirtualBox)
I'm still having troubles with 2-node active-active configuration for NFS. The standby and unstandby of any node seems to work fine but NFS hungs every time a node state changes. When both nodes are up and I do ls on client1 I get directory listing. Sometimes when a node is put to standby the ls on client1 is OK but sometimes it hungs and it takes ~a minute until it responds. It seems that after unstandby, the cluster stops nfs on healthy node and than starts over on both. That could make client1 to temporary unable to reach nfs export. Sometimes (not always) when a hung occures there is a message in logs: ERROR: nfs-mountd is not running. Maybe the problem is not caused by nfsserver itsef but bue to some problem with ClusterIP. I've checked many configurations and still no luck. - logs after unstandby node2 Jul 14 09:49:21 nfsnode2 nfsserver(nfs)[9420]: INFO: Status: rpcbind Jul 14 09:49:21 nfsnode2 nfsserver(nfs)[9420]: INFO: Status: nfs-mountd Jul 14 09:49:21 nfsnode2 nfsserver(nfs)[9420]: ERROR: nfs-mountd is not running Jul 14 09:49:21 nfsnode2 nfsserver(nfs)[9420]: INFO: Starting NFS server ... Jul 14 09:49:21 nfsnode2 nfsserver(nfs)[9420]: INFO: Start: rpcbind i: 1 Jul 14 09:49:21 nfsnode2 nfsserver(nfs)[9420]: INFO: Start: v3locking: 0 Jul 14 09:49:22 nfsnode2 nfsserver(nfs)[9420]: INFO: Start: nfs-mountd i: 1 Jul 14 09:49:22 nfsnode2 nfsserver(nfs)[9420]: INFO: Start: nfs-idmapd i: 1 Jul 14 09:49:22 nfsnode2 nfsserver(nfs)[9420]: INFO: Start: rpc-statd i: 1 Jul 14 09:49:22 nfsnode2 nfsserver(nfs)[9420]: INFO: NFS server started - logs after standby node2: Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stopping NFS server ... Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: threads Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: rpc-statd Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: nfs-idmapd Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: nfs-mountd Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: rpcbind Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: rpc-gssd Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: umount (1/10 attempts) Jul 14 09:54:38 nfsnode2 nfsserver(nfs)[23284]: INFO: NFS server stopped I don't see anything else in logs (I could paste all logs but it would be long) # pcs resource --full Master: StorageClone Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=2 clone-node-max=1 Resource: Storage (class=ocf provider=linbit type=drbd) Attributes: drbd_resource=storage Operations: start interval=0s timeout=240 (Storage-start-interval-0s) promote interval=0s timeout=90 (Storage-promote-interval-0s) demote interval=0s timeout=90 (Storage-demote-interval-0s) stop interval=0s timeout=100 (Storage-stop-interval-0s) monitor interval=60s (Storage-monitor-interval-60s) Clone: ClusterIP-clone Meta Attrs: clone-max=2 globally-unique=true clone-node-max=2 Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.0.2.7 cidr_netmask=32 clusterip_hash=sourceip Meta Attrs: resource-stickiness=0 Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s) stop interval=0s timeout=20s (ClusterIP-stop-interval-0s) monitor interval=5s (ClusterIP-monitor-interval-5s) Clone: ping-clone Resource: ping (class=ocf provider=pacemaker type=ping) Attributes: host_list="nfsnode1 nfsnode2" Operations: start interval=0s timeout=60 (ping-start-interval-0s) stop interval=0s timeout=20 (ping-stop-interval-0s) monitor interval=10 timeout=60 (ping-monitor-interval-10) Clone: vbox-fencing-clone Resource: vbox-fencing (class=stonith type=fence_vbox) Attributes: ip=10.0.2.2 username=AW23321 identity_file=/root/.ssh/id_rsa host_os=windows vboxmanage_path="/cygdrive/c/Program\ Files/Oracle/VirtualBox/VBoxManage" pcmk_host_map=nfsnode1:centos1;nfsnode2:centos2 secure=true inet4_only=true login_timeout=30 Operations: monitor interval=10 (vbox-fencing-monitor-interval-10) Clone: dlm-clone Meta Attrs: clone-max=2 clone-node-max=1 on-fail=fence Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: start interval=0s timeout=90 (dlm-start-interval-0s) stop interval=0s timeout=100 (dlm-stop-interval-0s) monitor interval=3s (dlm-monitor-interval-3s) Clone: StorageFS-clone Resource: StorageFS (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/drbd1 directory=/mnt/drbd fstype=gfs2 Operations: start interval=0s timeout=60 (StorageFS-start-interval-0s) stop interval=0s timeout=60 (StorageFS-stop-interval-0s) monitor interval=20 timeout=40 (StorageFS-monitor-interval-20) Clone: nfs-group-clone Meta Attrs: clone-max=2 clone-node-max=1 interleave=true Group: nfs-group Resource: nfs (class=ocf provider=heartbeat
Re: [ClusterLabs] Active-Active NFS cluster failover test - system hangs (VirtualBox)
Hi, The problem was due to bad stonith configuration. Above config is an example of a working Active/Active NFS configuration. Pozdrawiam, Arek 2017-07-10 12:59 GMT+02:00 ArekW: > Hi, > I've created 2-node active-active HA Cluster with NFS resource. The > resources are active on both nodes. The Cluster passes failover test with > pcs standby command but does not work when "real" node shutdown occure. > > Test scenario with cluster standby: > - start cluster > - mount nfs share on client1 > - start copy file from client1 to nfs share > - during the copy put node1/node2 to standby mode (pcs cluster standby > nfsnode2) > - the copy continue > - unstandby node1/node2 > - the copy continue and the storage re-sync (drbd) > - the copy finish with no errors > > I can standby and unstandby the cluster many times and it works. The > problem begins when I do a "true" failover test by hard-shutting down one > of the nodes. Test results: > - start cluster > - mount nfs share on client1 > - start copy file from client1 to nfs share > - during the copy shutdown node2 by stoping the node's virtual machine > (hard stop) > - the system hangs: > > > # rsync -a --bwlimit=2000 /root/testfile.dat /mnt/nfsshare/ > > > > [root@nfsnode1 nfs]# ls -lah > razem 9,8M > drwxr-xr-x 2 root root 3,8K 07-10 11:07 . > drwxr-xr-x 4 root root 3,8K 07-10 08:20 .. > -rw-r--r-- 1 root root9 07-10 08:20 client1.txt > -rw-r- 1 root root0 07-10 11:07 .rmtab > -rw--- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH > > [root@nfsnode1 nfs]# pcs status > Cluster name: nfscluster > Stack: corosync > Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with > quorum > Last updated: Mon Jul 10 11:07:29 2017 Last change: Mon Jul 10 > 10:28:12 2017 by root via crm_attribute on nfsnode1 > > 2 nodes and 15 resources configured > > Online: [ nfsnode1 nfsnode2 ] > > Full list of resources: > > Master/Slave Set: StorageClone [Storage] > Masters: [ nfsnode1 nfsnode2 ] > Clone Set: dlm-clone [dlm] > Started: [ nfsnode1 nfsnode2 ] > vbox-fencing (stonith:fence_vbox): Started nfsnode1 > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0(ocf::heartbeat:IPaddr2): Started nfsnode2 > ClusterIP:1(ocf::heartbeat:IPaddr2): Started nfsnode1 > Clone Set: StorageFS-clone [StorageFS] > Started: [ nfsnode1 nfsnode2 ] > Clone Set: WebSite-clone [WebSite] > Started: [ nfsnode1 nfsnode2 ] > Clone Set: nfs-group-clone [nfs-group] > Started: [ nfsnode1 nfsnode2 ] > > > > [root@nfsnode1 nfs]# pcs status > Cluster name: nfscluster > Stack: corosync > Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with > quorum > Last updated: Mon Jul 10 11:07:43 2017 Last change: Mon Jul 10 > 10:28:12 2017 by root via crm_attribute on nfsnode1 > > 2 nodes and 15 resources configured > > Node nfsnode2: UNCLEAN (offline) > Online: [ nfsnode1 ] > > Full list of resources: > > Master/Slave Set: StorageClone [Storage] > Storage(ocf::linbit:drbd): Master nfsnode2 (UNCLEAN) > Masters: [ nfsnode1 ] > Clone Set: dlm-clone [dlm] > dlm(ocf::pacemaker:controld): Started nfsnode2 (UNCLEAN) > Started: [ nfsnode1 ] > vbox-fencing (stonith:fence_vbox): Started nfsnode1 > Clone Set: ClusterIP-clone [ClusterIP] (unique) > ClusterIP:0(ocf::heartbeat:IPaddr2): Started nfsnode2 > (UNCLEAN) > ClusterIP:1(ocf::heartbeat:IPaddr2): Started nfsnode1 > Clone Set: StorageFS-clone [StorageFS] > StorageFS (ocf::heartbeat:Filesystem):Started nfsnode2 (UNCLEAN) > Started: [ nfsnode1 ] > Clone Set: WebSite-clone [WebSite] > WebSite(ocf::heartbeat:apache):Started nfsnode2 (UNCLEAN) > Started: [ nfsnode1 ] > Clone Set: nfs-group-clone [nfs-group] > Resource Group: nfs-group:1 > nfs(ocf::heartbeat:nfsserver): Started nfsnode2 (UNCLEAN) > nfs-export (ocf::heartbeat:exportfs): Started nfsnode2 > (UNCLEAN) > Started: [ nfsnode1 ] > > > [root@nfsnode1 nfs]# ls -lah > > > > [root@nfsnode1 ~]# drbdadm status > storage role:Primary > disk:UpToDate > nfsnode2 connection:Connecting > > > [root@nfsnode1 ~]# exportfs > /mnt/drbd/nfs 10.0.2.0/255.255.255.0 > > > login as: root > root@127.0.0.1's password: > Last login: Mon Jul 10 07:48:17 2017 from 10.0.2.2 > # cd /mnt/ > # ls > > > # mount > 10.0.2.7:/ on /mnt/nfsshare type nfs4 (rw,relatime,vers=4.0,rsize= > 131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600, > retrans=2,sec=sys,clientaddr=10.0.2.20,local_lock=none,addr=10.0.2.7) > > > proceeding> > > [root@nfsnode1 ~]# ls -lah > razem 9,8M > drwxr-xr-x 2 root root 3,8K 07-10 11:07 . > drwxr-xr-x 4 root root 3,8K 07-10 08:20 .. > -rw-r--r-- 1 root root9 07-10 08:20 client1.txt > -rw-r- 1 root root0 07-10 11:16 .rmtab > -rw--- 1 root root 9,8M 07-10 11:07
[ClusterLabs] Active-Active NFS cluster failover test - system hangs (VirtualBox)
Hi, I've created 2-node active-active HA Cluster with NFS resource. The resources are active on both nodes. The Cluster passes failover test with pcs standby command but does not work when "real" node shutdown occure. Test scenario with cluster standby: - start cluster - mount nfs share on client1 - start copy file from client1 to nfs share - during the copy put node1/node2 to standby mode (pcs cluster standby nfsnode2) - the copy continue - unstandby node1/node2 - the copy continue and the storage re-sync (drbd) - the copy finish with no errors I can standby and unstandby the cluster many times and it works. The problem begins when I do a "true" failover test by hard-shutting down one of the nodes. Test results: - start cluster - mount nfs share on client1 - start copy file from client1 to nfs share - during the copy shutdown node2 by stoping the node's virtual machine (hard stop) - the system hangs: # rsync -a --bwlimit=2000 /root/testfile.dat /mnt/nfsshare/ [root@nfsnode1 nfs]# ls -lah razem 9,8M drwxr-xr-x 2 root root 3,8K 07-10 11:07 . drwxr-xr-x 4 root root 3,8K 07-10 08:20 .. -rw-r--r-- 1 root root9 07-10 08:20 client1.txt -rw-r- 1 root root0 07-10 11:07 .rmtab -rw--- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH [root@nfsnode1 nfs]# pcs status Cluster name: nfscluster Stack: corosync Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum Last updated: Mon Jul 10 11:07:29 2017 Last change: Mon Jul 10 10:28:12 2017 by root via crm_attribute on nfsnode1 2 nodes and 15 resources configured Online: [ nfsnode1 nfsnode2 ] Full list of resources: Master/Slave Set: StorageClone [Storage] Masters: [ nfsnode1 nfsnode2 ] Clone Set: dlm-clone [dlm] Started: [ nfsnode1 nfsnode2 ] vbox-fencing (stonith:fence_vbox): Started nfsnode1 Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0(ocf::heartbeat:IPaddr2): Started nfsnode2 ClusterIP:1(ocf::heartbeat:IPaddr2): Started nfsnode1 Clone Set: StorageFS-clone [StorageFS] Started: [ nfsnode1 nfsnode2 ] Clone Set: WebSite-clone [WebSite] Started: [ nfsnode1 nfsnode2 ] Clone Set: nfs-group-clone [nfs-group] Started: [ nfsnode1 nfsnode2 ] [root@nfsnode1 nfs]# pcs status Cluster name: nfscluster Stack: corosync Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum Last updated: Mon Jul 10 11:07:43 2017 Last change: Mon Jul 10 10:28:12 2017 by root via crm_attribute on nfsnode1 2 nodes and 15 resources configured Node nfsnode2: UNCLEAN (offline) Online: [ nfsnode1 ] Full list of resources: Master/Slave Set: StorageClone [Storage] Storage(ocf::linbit:drbd): Master nfsnode2 (UNCLEAN) Masters: [ nfsnode1 ] Clone Set: dlm-clone [dlm] dlm(ocf::pacemaker:controld): Started nfsnode2 (UNCLEAN) Started: [ nfsnode1 ] vbox-fencing (stonith:fence_vbox): Started nfsnode1 Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0(ocf::heartbeat:IPaddr2): Started nfsnode2 (UNCLEAN) ClusterIP:1(ocf::heartbeat:IPaddr2): Started nfsnode1 Clone Set: StorageFS-clone [StorageFS] StorageFS (ocf::heartbeat:Filesystem):Started nfsnode2 (UNCLEAN) Started: [ nfsnode1 ] Clone Set: WebSite-clone [WebSite] WebSite(ocf::heartbeat:apache):Started nfsnode2 (UNCLEAN) Started: [ nfsnode1 ] Clone Set: nfs-group-clone [nfs-group] Resource Group: nfs-group:1 nfs(ocf::heartbeat:nfsserver): Started nfsnode2 (UNCLEAN) nfs-export (ocf::heartbeat:exportfs): Started nfsnode2 (UNCLEAN) Started: [ nfsnode1 ] [root@nfsnode1 nfs]# ls -lah [root@nfsnode1 ~]# drbdadm status storage role:Primary disk:UpToDate nfsnode2 connection:Connecting [root@nfsnode1 ~]# exportfs /mnt/drbd/nfs 10.0.2.0/255.255.255.0 login as: root root@127.0.0.1's password: Last login: Mon Jul 10 07:48:17 2017 from 10.0.2.2 # cd /mnt/ # ls # mount 10.0.2.7:/ on /mnt/nfsshare type nfs4 (rw,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.20,local_lock=none,addr=10.0.2.7) [root@nfsnode1 ~]# ls -lah razem 9,8M drwxr-xr-x 2 root root 3,8K 07-10 11:07 . drwxr-xr-x 4 root root 3,8K 07-10 08:20 .. -rw-r--r-- 1 root root9 07-10 08:20 client1.txt -rw-r- 1 root root0 07-10 11:16 .rmtab -rw--- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH [root@nfsnode1 ~]# pcs status Cluster name: nfscluster Stack: corosync Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum Last updated: Mon Jul 10 11:17:19 2017 Last change: Mon Jul 10 10:28:12 2017 by root via crm_attribute on nfsnode1 2 nodes and 15 resources configured Online: [ nfsnode1 nfsnode2 ] Full list of resources: Master/Slave Set: StorageClone [Storage] Masters: [ nfsnode1 ] Stopped: [ nfsnode2 ]