Re: [ClusterLabs] Active-Active NFS cluster failover test - system hangs (VirtualBox)

2017-07-14 Thread ArekW
I'm still having troubles with 2-node active-active configuration for
NFS. The standby and unstandby of any node seems to work fine but NFS
hungs every time a node state changes.

When both nodes are up and I do ls on client1 I get directory listing.
Sometimes when a node is put to standby the ls on client1 is OK but
sometimes it hungs and it takes ~a minute until it responds. It seems
that after unstandby, the cluster stops nfs on healthy node and than
starts over on both. That could make client1 to temporary unable to
reach nfs export. Sometimes (not always) when a hung occures there is
a message in logs: ERROR: nfs-mountd is not running. Maybe the problem
is not caused by nfsserver itsef but bue to some problem with
ClusterIP. I've checked many configurations and still no luck.

- logs after unstandby node2
Jul 14 09:49:21 nfsnode2 nfsserver(nfs)[9420]: INFO: Status: rpcbind
Jul 14 09:49:21 nfsnode2 nfsserver(nfs)[9420]: INFO: Status: nfs-mountd
Jul 14 09:49:21 nfsnode2 nfsserver(nfs)[9420]: ERROR: nfs-mountd is not running
Jul 14 09:49:21 nfsnode2 nfsserver(nfs)[9420]: INFO: Starting NFS server ...
Jul 14 09:49:21 nfsnode2 nfsserver(nfs)[9420]: INFO: Start: rpcbind i: 1
Jul 14 09:49:21 nfsnode2 nfsserver(nfs)[9420]: INFO: Start: v3locking: 0
Jul 14 09:49:22 nfsnode2 nfsserver(nfs)[9420]: INFO: Start: nfs-mountd i: 1
Jul 14 09:49:22 nfsnode2 nfsserver(nfs)[9420]: INFO: Start: nfs-idmapd i: 1
Jul 14 09:49:22 nfsnode2 nfsserver(nfs)[9420]: INFO: Start: rpc-statd i: 1
Jul 14 09:49:22 nfsnode2 nfsserver(nfs)[9420]: INFO: NFS server started

- logs after standby node2:
Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stopping NFS server ...
Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: threads
Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: rpc-statd
Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: nfs-idmapd
Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: nfs-mountd
Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: rpcbind
Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: rpc-gssd
Jul 14 09:54:36 nfsnode2 nfsserver(nfs)[23284]: INFO: Stop: umount
(1/10 attempts)
Jul 14 09:54:38 nfsnode2 nfsserver(nfs)[23284]: INFO: NFS server stopped

I don't see anything else in logs (I could paste all logs but it would be long)

# pcs resource --full
 Master: StorageClone
  Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=2
clone-node-max=1
  Resource: Storage (class=ocf provider=linbit type=drbd)
   Attributes: drbd_resource=storage
   Operations: start interval=0s timeout=240 (Storage-start-interval-0s)
   promote interval=0s timeout=90 (Storage-promote-interval-0s)
   demote interval=0s timeout=90 (Storage-demote-interval-0s)
   stop interval=0s timeout=100 (Storage-stop-interval-0s)
   monitor interval=60s (Storage-monitor-interval-60s)
 Clone: ClusterIP-clone
  Meta Attrs: clone-max=2 globally-unique=true clone-node-max=2
  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.0.2.7 cidr_netmask=32 clusterip_hash=sourceip
   Meta Attrs: resource-stickiness=0
   Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s)
   stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
   monitor interval=5s (ClusterIP-monitor-interval-5s)
 Clone: ping-clone
  Resource: ping (class=ocf provider=pacemaker type=ping)
   Attributes: host_list="nfsnode1 nfsnode2"
   Operations: start interval=0s timeout=60 (ping-start-interval-0s)
   stop interval=0s timeout=20 (ping-stop-interval-0s)
   monitor interval=10 timeout=60 (ping-monitor-interval-10)
 Clone: vbox-fencing-clone
  Resource: vbox-fencing (class=stonith type=fence_vbox)
   Attributes: ip=10.0.2.2 username=AW23321
identity_file=/root/.ssh/id_rsa host_os=windows
vboxmanage_path="/cygdrive/c/Program\
Files/Oracle/VirtualBox/VBoxManage"
pcmk_host_map=nfsnode1:centos1;nfsnode2:centos2 secure=true
inet4_only=true login_timeout=30
   Operations: monitor interval=10 (vbox-fencing-monitor-interval-10)
 Clone: dlm-clone
  Meta Attrs: clone-max=2 clone-node-max=1 on-fail=fence
  Resource: dlm (class=ocf provider=pacemaker type=controld)
   Operations: start interval=0s timeout=90 (dlm-start-interval-0s)
   stop interval=0s timeout=100 (dlm-stop-interval-0s)
   monitor interval=3s (dlm-monitor-interval-3s)
 Clone: StorageFS-clone
  Resource: StorageFS (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/drbd1 directory=/mnt/drbd fstype=gfs2
   Operations: start interval=0s timeout=60 (StorageFS-start-interval-0s)
   stop interval=0s timeout=60 (StorageFS-stop-interval-0s)
   monitor interval=20 timeout=40 (StorageFS-monitor-interval-20)
 Clone: nfs-group-clone
  Meta Attrs: clone-max=2 clone-node-max=1 interleave=true
  Group: nfs-group
   Resource: nfs (class=ocf provider=heartbeat 

Re: [ClusterLabs] Active-Active NFS cluster failover test - system hangs (VirtualBox)

2017-07-12 Thread ArekW
Hi,
The problem was due to bad stonith configuration. Above config is an
example of a working Active/Active NFS configuration.

Pozdrawiam,
Arek

2017-07-10 12:59 GMT+02:00 ArekW :

> Hi,
> I've created 2-node active-active HA Cluster with NFS resource. The
> resources are active on both nodes. The Cluster passes failover test with
> pcs standby command but does not work when "real" node shutdown occure.
>
> Test scenario with cluster standby:
> - start cluster
> - mount nfs share on client1
> - start copy file from client1 to nfs share
> - during the copy put node1/node2 to standby mode (pcs cluster standby
> nfsnode2)
> - the copy continue
> - unstandby node1/node2
> - the copy continue and the storage re-sync (drbd)
> - the copy finish with no errors
>
> I can standby and unstandby the cluster many times and it works. The
> problem begins when I do a "true" failover test by hard-shutting down one
> of the nodes. Test results:
> - start cluster
> - mount nfs share on client1
> - start copy file from client1 to nfs share
> - during the copy shutdown node2 by stoping the node's virtual machine
> (hard stop)
> - the system hangs:
>
> 
> # rsync -a --bwlimit=2000 /root/testfile.dat /mnt/nfsshare/
>
> 
>
> [root@nfsnode1 nfs]# ls -lah
> razem 9,8M
> drwxr-xr-x 2 root root 3,8K 07-10 11:07 .
> drwxr-xr-x 4 root root 3,8K 07-10 08:20 ..
> -rw-r--r-- 1 root root9 07-10 08:20 client1.txt
> -rw-r- 1 root root0 07-10 11:07 .rmtab
> -rw--- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH
>
> [root@nfsnode1 nfs]# pcs status
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
> quorum
> Last updated: Mon Jul 10 11:07:29 2017  Last change: Mon Jul 10
> 10:28:12 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Online: [ nfsnode1 nfsnode2 ]
>
> Full list of resources:
>
>  Master/Slave Set: StorageClone [Storage]
>  Masters: [ nfsnode1 nfsnode2 ]
>  Clone Set: dlm-clone [dlm]
>  Started: [ nfsnode1 nfsnode2 ]
>  vbox-fencing   (stonith:fence_vbox):   Started nfsnode1
>  Clone Set: ClusterIP-clone [ClusterIP] (unique)
>  ClusterIP:0(ocf::heartbeat:IPaddr2):   Started nfsnode2
>  ClusterIP:1(ocf::heartbeat:IPaddr2):   Started nfsnode1
>  Clone Set: StorageFS-clone [StorageFS]
>  Started: [ nfsnode1 nfsnode2 ]
>  Clone Set: WebSite-clone [WebSite]
>  Started: [ nfsnode1 nfsnode2 ]
>  Clone Set: nfs-group-clone [nfs-group]
>  Started: [ nfsnode1 nfsnode2 ]
>
> 
>
> [root@nfsnode1 nfs]# pcs status
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
> quorum
> Last updated: Mon Jul 10 11:07:43 2017  Last change: Mon Jul 10
> 10:28:12 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Node nfsnode2: UNCLEAN (offline)
> Online: [ nfsnode1 ]
>
> Full list of resources:
>
>  Master/Slave Set: StorageClone [Storage]
>  Storage(ocf::linbit:drbd): Master nfsnode2 (UNCLEAN)
>  Masters: [ nfsnode1 ]
>  Clone Set: dlm-clone [dlm]
>  dlm(ocf::pacemaker:controld):  Started nfsnode2 (UNCLEAN)
>  Started: [ nfsnode1 ]
>  vbox-fencing   (stonith:fence_vbox):   Started nfsnode1
>  Clone Set: ClusterIP-clone [ClusterIP] (unique)
>  ClusterIP:0(ocf::heartbeat:IPaddr2):   Started nfsnode2
> (UNCLEAN)
>  ClusterIP:1(ocf::heartbeat:IPaddr2):   Started nfsnode1
>  Clone Set: StorageFS-clone [StorageFS]
>  StorageFS  (ocf::heartbeat:Filesystem):Started nfsnode2 (UNCLEAN)
>  Started: [ nfsnode1 ]
>  Clone Set: WebSite-clone [WebSite]
>  WebSite(ocf::heartbeat:apache):Started nfsnode2 (UNCLEAN)
>  Started: [ nfsnode1 ]
>  Clone Set: nfs-group-clone [nfs-group]
>  Resource Group: nfs-group:1
>  nfs(ocf::heartbeat:nfsserver): Started nfsnode2 (UNCLEAN)
>  nfs-export (ocf::heartbeat:exportfs):  Started nfsnode2
> (UNCLEAN)
>  Started: [ nfsnode1 ]
>
> 
> [root@nfsnode1 nfs]# ls -lah
> 
>
> 
> [root@nfsnode1 ~]# drbdadm status
> storage role:Primary
>   disk:UpToDate
>   nfsnode2 connection:Connecting
>
> 
> [root@nfsnode1 ~]# exportfs
> /mnt/drbd/nfs   10.0.2.0/255.255.255.0
>
> 
> login as: root
> root@127.0.0.1's password:
> Last login: Mon Jul 10 07:48:17 2017 from 10.0.2.2
> # cd /mnt/
> # ls
> 
>
> # mount
> 10.0.2.7:/ on /mnt/nfsshare type nfs4 (rw,relatime,vers=4.0,rsize=
> 131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,
> retrans=2,sec=sys,clientaddr=10.0.2.20,local_lock=none,addr=10.0.2.7)
>
> 
>  proceeding>
> 
> [root@nfsnode1 ~]# ls -lah
> razem 9,8M
> drwxr-xr-x 2 root root 3,8K 07-10 11:07 .
> drwxr-xr-x 4 root root 3,8K 07-10 08:20 ..
> -rw-r--r-- 1 root root9 07-10 08:20 client1.txt
> -rw-r- 1 root root0 07-10 11:16 .rmtab
> -rw--- 1 root root 9,8M 07-10 11:07 

[ClusterLabs] Active-Active NFS cluster failover test - system hangs (VirtualBox)

2017-07-10 Thread ArekW
Hi,
I've created 2-node active-active HA Cluster with NFS resource. The
resources are active on both nodes. The Cluster passes failover test with
pcs standby command but does not work when "real" node shutdown occure.

Test scenario with cluster standby:
- start cluster
- mount nfs share on client1
- start copy file from client1 to nfs share
- during the copy put node1/node2 to standby mode (pcs cluster standby
nfsnode2)
- the copy continue
- unstandby node1/node2
- the copy continue and the storage re-sync (drbd)
- the copy finish with no errors

I can standby and unstandby the cluster many times and it works. The
problem begins when I do a "true" failover test by hard-shutting down one
of the nodes. Test results:
- start cluster
- mount nfs share on client1
- start copy file from client1 to nfs share
- during the copy shutdown node2 by stoping the node's virtual machine
(hard stop)
- the system hangs:


# rsync -a --bwlimit=2000 /root/testfile.dat /mnt/nfsshare/



[root@nfsnode1 nfs]# ls -lah
razem 9,8M
drwxr-xr-x 2 root root 3,8K 07-10 11:07 .
drwxr-xr-x 4 root root 3,8K 07-10 08:20 ..
-rw-r--r-- 1 root root9 07-10 08:20 client1.txt
-rw-r- 1 root root0 07-10 11:07 .rmtab
-rw--- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH

[root@nfsnode1 nfs]# pcs status
Cluster name: nfscluster
Stack: corosync
Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
quorum
Last updated: Mon Jul 10 11:07:29 2017  Last change: Mon Jul 10
10:28:12 2017 by root via crm_attribute on nfsnode1

2 nodes and 15 resources configured

Online: [ nfsnode1 nfsnode2 ]

Full list of resources:

 Master/Slave Set: StorageClone [Storage]
 Masters: [ nfsnode1 nfsnode2 ]
 Clone Set: dlm-clone [dlm]
 Started: [ nfsnode1 nfsnode2 ]
 vbox-fencing   (stonith:fence_vbox):   Started nfsnode1
 Clone Set: ClusterIP-clone [ClusterIP] (unique)
 ClusterIP:0(ocf::heartbeat:IPaddr2):   Started nfsnode2
 ClusterIP:1(ocf::heartbeat:IPaddr2):   Started nfsnode1
 Clone Set: StorageFS-clone [StorageFS]
 Started: [ nfsnode1 nfsnode2 ]
 Clone Set: WebSite-clone [WebSite]
 Started: [ nfsnode1 nfsnode2 ]
 Clone Set: nfs-group-clone [nfs-group]
 Started: [ nfsnode1 nfsnode2 ]



[root@nfsnode1 nfs]# pcs status
Cluster name: nfscluster
Stack: corosync
Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
quorum
Last updated: Mon Jul 10 11:07:43 2017  Last change: Mon Jul 10
10:28:12 2017 by root via crm_attribute on nfsnode1

2 nodes and 15 resources configured

Node nfsnode2: UNCLEAN (offline)
Online: [ nfsnode1 ]

Full list of resources:

 Master/Slave Set: StorageClone [Storage]
 Storage(ocf::linbit:drbd): Master nfsnode2 (UNCLEAN)
 Masters: [ nfsnode1 ]
 Clone Set: dlm-clone [dlm]
 dlm(ocf::pacemaker:controld):  Started nfsnode2 (UNCLEAN)
 Started: [ nfsnode1 ]
 vbox-fencing   (stonith:fence_vbox):   Started nfsnode1
 Clone Set: ClusterIP-clone [ClusterIP] (unique)
 ClusterIP:0(ocf::heartbeat:IPaddr2):   Started nfsnode2
(UNCLEAN)
 ClusterIP:1(ocf::heartbeat:IPaddr2):   Started nfsnode1
 Clone Set: StorageFS-clone [StorageFS]
 StorageFS  (ocf::heartbeat:Filesystem):Started nfsnode2 (UNCLEAN)
 Started: [ nfsnode1 ]
 Clone Set: WebSite-clone [WebSite]
 WebSite(ocf::heartbeat:apache):Started nfsnode2 (UNCLEAN)
 Started: [ nfsnode1 ]
 Clone Set: nfs-group-clone [nfs-group]
 Resource Group: nfs-group:1
 nfs(ocf::heartbeat:nfsserver): Started nfsnode2 (UNCLEAN)
 nfs-export (ocf::heartbeat:exportfs):  Started nfsnode2
(UNCLEAN)
 Started: [ nfsnode1 ]


[root@nfsnode1 nfs]# ls -lah



[root@nfsnode1 ~]# drbdadm status
storage role:Primary
  disk:UpToDate
  nfsnode2 connection:Connecting


[root@nfsnode1 ~]# exportfs
/mnt/drbd/nfs   10.0.2.0/255.255.255.0


login as: root
root@127.0.0.1's password:
Last login: Mon Jul 10 07:48:17 2017 from 10.0.2.2
# cd /mnt/
# ls


# mount
10.0.2.7:/ on /mnt/nfsshare type nfs4
(rw,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.20,local_lock=none,addr=10.0.2.7)




[root@nfsnode1 ~]# ls -lah
razem 9,8M
drwxr-xr-x 2 root root 3,8K 07-10 11:07 .
drwxr-xr-x 4 root root 3,8K 07-10 08:20 ..
-rw-r--r-- 1 root root9 07-10 08:20 client1.txt
-rw-r- 1 root root0 07-10 11:16 .rmtab
-rw--- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH




[root@nfsnode1 ~]# pcs status
Cluster name: nfscluster
Stack: corosync
Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
quorum
Last updated: Mon Jul 10 11:17:19 2017  Last change: Mon Jul 10
10:28:12 2017 by root via crm_attribute on nfsnode1

2 nodes and 15 resources configured

Online: [ nfsnode1 nfsnode2 ]

Full list of resources:

 Master/Slave Set: StorageClone [Storage]
 Masters: [ nfsnode1 ]
 Stopped: [ nfsnode2 ]