On 09/05/2016 05:16 AM, Pablo Pines Leon wrote: > Hello, > > I implemented the suggested change in corosync and I realized that service > pacemaker stop on the master node works provided that I run crm_resource -P > from another terminal right after it, and the same goes for the case of the > "failback", getting back the node that failed on the cluster, which causes > the IP resource and then the NFS exports to fail, if I run crm_resource -P > twice after running service pacemaker start to get it back in it will work. > > However, I see no reason why this is happening, if the failover works fine > why can there be any problem getting a node back in the cluster?
Looking at your config again, I see only some of your resources have monitor operations. All primitives should have monitors, except for master/slave resources which should have two monitors on the m/s resource, one for master and one for slave (with different intervals). BTW, crm_resource -P is deprecated in favor of -C. Same thing, just renamed. > Thanks and kind regards > > Pablo > ________________________________________ > From: Pablo Pines Leon [pablo.pines.l...@cern.ch] > Sent: 01 September 2016 09:49 > To: kgail...@redhat.com; Cluster Labs - All topics related to > open-source clustering welcomed > Subject: Re: [ClusterLabs] Service pacemaker start kills my cluster and other > NFS HA issues > > Dear Ken, > > Thanks for your reply. That configuration in Ubuntu works perfectly fine, the > problem is that in CentOS 7 for some reason I am not even able to do a > "service pacemaker stop" of the node that is running as master (with the > slave off too) because it will have some failed actions that don't make any > sense: > > Migration Summary: > * Node nfsha1: > res_exportfs_root: migration-threshold=1000000 fail-count=1 > last-failure='Thu > Sep 1 09:42:43 2016' > res_exportfs_export1: migration-threshold=1000000 fail-count=1000000 > last-fai > lure='Thu Sep 1 09:42:38 2016' > > Failed Actions: > * res_exportfs_root_monitor_30000 on nfsha1 'not running' (7): call=79, > status=c > omplete, exitreason='none', > last-rc-change='Thu Sep 1 09:42:43 2016', queued=0ms, exec=0ms > * res_exportfs_export1_stop_0 on nfsha1 'unknown error' (1): call=88, > status=Tim > ed Out, exitreason='none', > last-rc-change='Thu Sep 1 09:42:18 2016', queued=0ms, exec=20001ms > > So I am wondering what is different between both OSes that will cause this > different outcome. > > Kind regards > > ________________________________________ > From: Ken Gaillot [kgail...@redhat.com] > Sent: 31 August 2016 17:31 > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Service pacemaker start kills my cluster and other > NFS HA issues > > On 08/30/2016 10:49 AM, Pablo Pines Leon wrote: >> Hello, >> >> I have set up a DRBD-Corosync-Pacemaker cluster following the >> instructions from https://wiki.ubuntu.com/ClusterStack/Natty adapting >> them to CentOS 7 (e.g: using systemd). After testing it in Virtual > > There is a similar how-to specifically for CentOS 7: > > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Clusters_from_Scratch/index.html > > I think if you compare your configs to that, you'll probably find the > cause. I'm guessing the most important missing pieces are "two_node: 1" > in corosync.conf, and fencing. > > >> Machines it seemed to be working fine, so it is now implemented in >> physical machines, and I have noticed that the failover works fine as >> long as I kill the master by pulling the AC cable, but not if I issue >> the halt, reboot or shutdown commands, that makes the cluster get in a >> situation like this: >> >> Last updated: Tue Aug 30 16:55:58 2016 Last change: Tue Aug 23 >> 11:49:43 2016 by hacluster via crmd on nfsha2 >> Stack: corosync >> Current DC: nfsha2 (version 1.1.13-10.el7_2.4-44eb2dd) - partition with >> quorum >> 2 nodes and 9 resources configured >> >> Online: [ nfsha1 nfsha2 ] >> >> Master/Slave Set: ms_drbd_export [res_drbd_export] >> Masters: [ nfsha2 ] >> Slaves: [ nfsha1 ] >> Resource Group: rg_export >> res_fs (ocf::heartbeat:Filesystem): Started nfsha2 >> res_exportfs_export1 (ocf::heartbeat:exportfs): FAILED nfsha2 >> (unmanaged) >> res_ip (ocf::heartbeat:IPaddr2): Stopped >> Clone Set: cl_nfsserver [res_nfsserver] >> Started: [ nfsha1 ] >> Clone Set: cl_exportfs_root [res_exportfs_root] >> res_exportfs_root (ocf::heartbeat:exportfs): FAILED nfsha2 >> Started: [ nfsha1 ] >> >> Migration Summary: >> * Node 2: >> res_exportfs_export1: migration-threshold=1000000 >> fail-count=1000000 last-failure='Tue Aug 30 16:55:50 2016' >> res_exportfs_root: migration-threshold=1000000 fail-count=1 >> last-failure='Tue Aug 30 16:55:48 2016' >> * Node 1: >> >> Failed Actions: >> * res_exportfs_export1_stop_0 on nfsha2 'unknown error' (1): call=134, >> status=Timed Out, exitreason='non >> e', >> last-rc-change='Tue Aug 30 16:55:30 2016', queued=0ms, exec=20001ms >> * res_exportfs_root_monitor_30000 on nfsha2 'not running' (7): call=126, >> status=complete, exitreason='no >> ne', >> last-rc-change='Tue Aug 30 16:55:48 2016', queued=0ms, exec=0ms >> >> This of course blocks it, because the IP and the NFS exports are down. >> It doesn't even recognize that the other node is down. I am then forced >> to do "crm_resource -P" to get it back to a working state. >> >> Even when unplugging the master, and booting it up again, trying to get >> it back in the cluster executing "service pacemaker start" on the node >> that was unplugged will sometimes just cause the exportfs_root resource >> on the slave to fail (but the service is still up): >> >> Master/Slave Set: ms_drbd_export [res_drbd_export] >> Masters: [ nfsha1 ] >> Slaves: [ nfsha2 ] >> Resource Group: rg_export >> res_fs (ocf::heartbeat:Filesystem): Started nfsha1 >> res_exportfs_export1 (ocf::heartbeat:exportfs): Started nfsha1 >> res_ip (ocf::heartbeat:IPaddr2): Started nfsha1 >> Clone Set: cl_nfsserver [res_nfsserver] >> Started: [ nfsha1 nfsha2 ] >> Clone Set: cl_exportfs_root [res_exportfs_root] >> Started: [ nfsha1 nfsha2 ] >> >> Migration Summary: >> * Node nfsha2: >> res_exportfs_root: migration-threshold=1000000 fail-count=1 >> last-failure='Tue Aug 30 17:18:17 2016' >> * Node nfsha1: >> >> Failed Actions: >> * res_exportfs_root_monitor_30000 on nfsha2 'not running' (7): call=34, >> status=complete, exitreason='non >> e', >> last-rc-change='Tue Aug 30 17:18:17 2016', queued=0ms, exec=33ms >> >> BTW I notice that the node attributes are changed: >> >> Node Attributes: >> * Node nfsha1: >> + master-res_drbd_export : 10000 >> * Node nfsha2: >> + master-res_drbd_export : 1000 >> >> Usually both would have the same weight (10000), so running >> "crm_resource -P" restores that. >> >> Some other times it will instead cause a service disruption: >> >> Online: [ nfsha1 nfsha2 ] >> >> Master/Slave Set: ms_drbd_export [res_drbd_export] >> Masters: [ nfsha2 ] >> Slaves: [ nfsha1 ] >> Resource Group: rg_export >> res_fs (ocf::heartbeat:Filesystem): Started nfsha2 >> res_exportfs_export1 (ocf::heartbeat:exportfs): FAILED >> (unmanaged)[ nfsha2 nfsha1 ] >> res_ip (ocf::heartbeat:IPaddr2): Stopped >> Clone Set: cl_nfsserver [res_nfsserver] >> Started: [ nfsha1 nfsha2 ] >> Clone Set: cl_exportfs_root [res_exportfs_root] >> Started: [ nfsha1 nfsha2] >> >> Migration Summary: >> * Node nfsha2: >> res_exportfs_export1: migration-threshold=1000000 >> fail-count=1000000 last-failure='Tue Aug 30 17:31:01 2016' >> * Node nfsha1: >> res_exportfs_export1: migration-threshold=1000000 >> fail-count=1000000 last-failure='Tue Aug 30 17:31:01 2016' >> res_exportfs_root: migration-threshold=1000000 fail-count=1 >> last-failure='Tue Aug 30 17:31:11 2016' >> >> Failed Actions: >> * res_exportfs_export1_stop_0 on nfsha2 'unknown error' (1): call=86, >> status=Timed Out, exitreason='none >> ', >> last-rc-change='Tue Aug 30 17:30:41 2016', queued=0ms, exec=20002ms >> * res_exportfs_export1_stop_0 on nfsha1 'unknown error' (1): call=32, >> status=Timed Out, exitreason='none >> ', >> last-rc-change='Tue Aug 30 17:30:41 2016', queued=0ms, exec=20002ms >> * res_exportfs_root_monitor_30000 on nfsha1 'not running' (7): call=29, >> status=complete, exitreason='non >> e', >> last-rc-change='Tue Aug 30 17:31:11 2016', queued=0ms, exec=0ms >> >> Then executing "crm_resource -P" brings it back to life, but if that >> command is not executed the cluster remains blocked until after around >> 10 mins when it sometimes gets magically back (like an auto execution of >> crm_resource -P). >> >> In case it helps, the CRM configuration is this one: >> >> node 1: nfsha1 >> node 2: nfsha2 \ >> attributes standby=off >> primitive res_drbd_export ocf:linbit:drbd \ >> params drbd_resource=export >> primitive res_exportfs_export1 exportfs \ >> params fsid=1 directory="/mnt/export/export1" >> options="rw,root_squash,mountpoint" clientspec="*.0/255.255.255.0" >> wait_for_leasetime_on_stop=true \ >> op monitor interval=30s \ >> meta target-role=Started >> primitive res_exportfs_root exportfs \ >> params fsid=0 directory="/mnt/export" options="rw,crossmnt" >> clientspec="*.0/255.255.255.0" \ >> op monitor interval=30s \ >> meta target-role=Started >> primitive res_fs Filesystem \ >> params device="/dev/drbd0" directory="/mnt/export" fstype=ext3 \ >> meta target-role=Started >> primitive res_ip IPaddr2 \ >> params ip=*.46 cidr_netmask=24 nic=eno1 >> primitive res_nfsserver systemd:nfs-server \ >> op monitor interval=30s >> group rg_export res_fs res_exportfs_export1 res_ip >> ms ms_drbd_export res_drbd_export \ >> meta notify=true master-max=1 master-node-max=1 clone-max=2 >> clone-node-max=1 >> clone cl_exportfs_root res_exportfs_root >> clone cl_nfsserver res_nfsserver >> colocation c_export_on_drbd inf: rg_export ms_drbd_export:Master >> colocation c_nfs_on_root inf: rg_export cl_exportfs_root >> order o_drbd_before_nfs inf: ms_drbd_export:promote rg_export:start >> order o_root_before_nfs inf: cl_exportfs_root rg_export:start >> property cib-bootstrap-options: \ >> maintenance-mode=false \ >> stonith-enabled=false \ >> no-quorum-policy=ignore \ >> have-watchdog=false \ >> dc-version=1.1.13-10.el7_2.4-44eb2dd \ >> cluster-infrastructure=corosync \ >> cluster-name=nfsha >> >> And the corosync.conf: >> >> totem { >> version: 2 >> # Corosync itself works without a cluster name, but DLM needs one. >> # The cluster name is also written into the VG metadata of newly >> # created shared LVM volume groups, if lvmlockd uses DLM locking. >> # It is also used for computing mcastaddr, unless overridden below. >> cluster_name: nfsha >> # How long before declaring a token lost (ms) >> token: 3000 >> # How many token retransmits before forming a new configuration >> token_retransmits_before_loss_const: 10 >> # Limit generated nodeids to 31-bits (positive signed integers) >> clear_node_high_bit: yes >> # crypto_cipher and crypto_hash: Used for mutual node authentication. >> # If you choose to enable this, then do remember to create a shared >> # secret with "corosync-keygen". >> # enabling crypto_cipher, requires also enabling of crypto_hash. >> # crypto_cipher and crypto_hash should be used instead of deprecated >> # secauth parameter. >> # Valid values for crypto_cipher are none (no encryption), aes256, aes192, >> # aes128 and 3des. Enabling crypto_cipher, requires also enabling of >> # crypto_hash. >> crypto_cipher: none >> # Valid values for crypto_hash are none (no authentication), md5, sha1, >> # sha256, sha384 and sha512. >> crypto_hash: none >> # Optionally assign a fixed node id (integer) >> # nodeid: 1234 >> transport: udpu >> } >> nodelist { >> node { >> ring0_addr: *.50 >> nodeid: 1 >> } >> node { >> ring0_addr:*.51 >> nodeid: 2 >> } >> } >> logging { >> to_syslog: yes >> } >> >> quorum { >> # Enable and configure quorum subsystem (default: off) >> # see also corosync.conf.5 and votequorum.5 >> provider: corosync_votequorum >> expected_votes: 2 >> } >> >> So as you can imagine I am really puzzled about all this and would >> certainly welcome any help about what might be wrong with the current >> configuration. >> >> Thank you very much, kind regards >> >> Pablo _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org