On February 24, 2020 4:56:07 PM GMT+02:00, Luke Camilleri <[email protected]> wrote: >Hello users, I would like to ask for assistance on the below setup >please, mainly on the monitor fence timeout: > >#pcs --version >0.9.167 > >#pacemakerd --version >Pacemaker 1.1.20-5.el7_7.2 > >#corosync -v >Corosync Cluster Engine, version '2.4.3' >Copyright (c) 2006-2009 Red Hat, Inc. > ># cat /etc/redhat-release >CentOS Linux release 7.7.1908 (Core) > >I have setup a 2-node axigen mail server setup with 1 resource group >(and 3 resources within) and 2 fence devices (1 for each node). > >the hosts file on the nodes is as follows: > >#KVM Management nodes >10.1.4.31 zc-infra-mgmt-node-1 >10.1.4.20 zc-infra-mgmt-node-2 > >#Service Network >10.1.4.22 zc-mail-1.domain.com zc-mail-1 >10.1.4.23 zc-mail-2.domain.com zc-mail-2 > >#High-Availability Network (cross-over link) >192.168.1.22 zc-mail-1-ha.domain.local zc-mail-1-ha >192.168.1.23 zc-mail-2-ha.domain.local zc-mail-2-ha > >the routable network is 10.1.4.0/24. A VIP is setup as part of the HA >Cluster resources (IPaddr2) which is 10.1.4.24 > >the resources are described as follows: > ># pcs resource show --full > > Group: zc-mail-res-group > >Resource: zc-mail-ha-Cfs (class=ocf provider=heartbeat type=Filesystem) >Attributes: device=10.1.3.11:6789,10.1.3.12:6789,10.1.3.13:6789:/ >directory=/var/clusterfs/data/axigen fstype=ceph >options=name=email,secretfile=/etc/ceph/ceph.key >statusfile_prefix=ceph_fs_checks_ >Operations: monitor interval=120s >(zc-mail-ha-Cfs-monitor-interval-120s) > notify interval=0s timeout=120s (zc-mail-ha-Cfs-notify-interval-0s) > start interval=0s timeout=120s (zc-mail-ha-Cfs-start-interval-0s) > stop interval=0s timeout=120s (zc-mail-ha-Cfs-stop-interval-0s) > >Resource: zc-mail-ha-vip (class=ocf provider=heartbeat type=IPaddr2) > Attributes: cidr_netmask=24 ip=10.1.4.24 >Operations: monitor interval=120s >(zc-mail-ha-vip-monitor-interval-120s) > start interval=0s timeout=120s (zc-mail-ha-vip-start-interval-0s) > stop interval=0s timeout=120s (zc-mail-ha-vip-stop-interval-0s) > >Resource: zc-mail-ha-svc (class=lsb type=axigen) > Meta Attrs: is-managed=true target-role=Started >Operations: force-reload interval=0s timeout=60 >(zc-mail-ha-svc-force-reload-interval-0s) >monitor interval=30s timeout=120s OCF_CHECK_LEVEL=20 >(zc-mail-ha-svc-monitor-interval-30s) > restart interval=0s timeout=120s (zc-mail-ha-svc-restart-interval-0s) > start interval=0s timeout=120s (zc-mail-ha-svc-start-interval-0s) > stop interval=0s timeout=120s (zc-mail-ha-svc-stop-interval-0s) > ># pcs stonith show --full > > Resource: fence_zc-mail-1_virsh (class=stonith type=fence_virsh) >Attributes: delay=0 identity_file=/home/lcami/.ssh/id_rsa >ipaddr=zc-infra-mgmt-node-1 login=lcami login_timeout=20 >pcmk_host_check=static-list pcmk_host_list=zc-mail-1-ha >pcmk_host_map=zc-mail-1-ha:zc-infra-mgmt-node-1 port=Axigen-Mail-1 >sudo=1 >Operations: monitor interval=60s >(fence_zc-mail-1_virsh-monitor-interval-60s) > > Resource: fence_zc-mail-2_virsh (class=stonith type=fence_virsh) >Attributes: identity_file=/home/lcami/.ssh/id_rsa >ipaddr=zc-infra-mgmt-node-2 login=lcami login_timeout=20 >pcmk_host_check=static-list pcmk_host_list=zc-mail-2-ha >pcmk_host_map=zc-mail-2-ha:zc-infra-mgmt-node-2 port=Axigen-Mail-2 >sudo=1 >Operations: monitor interval=60s >(fence_zc-mail-2_virsh-monitor-interval-60s) > >Every couple of days I used to receive the following error: > >Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng: notice: >operation_finished: fence_virsh_monitor_1:21995:stderr [ 2020-02-16 >00:00:23,996 ERROR: Connection timed out ] >Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng: notice: >operation_finished: fence_virsh_monitor_1:21995:stderr [ ] >Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng: notice: >operation_finished: fence_virsh_monitor_1:21995:stderr [ ] >Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng: warning: >log_action: fence_virsh[21995] stderr: [ 2020-02-16 00:00:23,996 ERROR: >Connection timed out ] >Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng: warning: >log_action: fence_virsh[21995] stderr: [ ] >Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng: warning: >log_action: fence_virsh[21995] stderr: [ ] >Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng: notice: >log_operation: Operation 'monitor' [21995] for device >'fence_zc-mail-1_virsh' returned: -62 (Timer expired) >Feb 16 00:00:24 [2052] zc-mail-2.zylacloud.com lrmd: info: >log_finished: finished - rsc:fence_zc-mail-1_virsh action:start >call_id:85 exit-code:1 exec-time:5449ms queue-time:0ms > >which I concluded was a problem with the login timeout (which was 5 >seconds) > >I have therefore incresed this timeut to 20 seconds but the timeout >persisted: > >Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng: notice: >operation_finished: fence_virsh_monitor_1:20006:stderr [ 2020-02-23 >00:00:21,102 ERROR: Connection timed out ] >Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng: notice: >operation_finished: fence_virsh_monitor_1:20006:stderr [ ] >Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng: notice: >operation_finished: fence_virsh_monitor_1:20006:stderr [ ] >Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng: warning: >log_action: fence_virsh[20006] stderr: [ 2020-02-23 00:00:21,102 ERROR: >Connection timed out ] >Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng: warning: >log_action: fence_virsh[20006] stderr: [ ] >Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng: warning: >log_action: fence_virsh[20006] stderr: [ ] >Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng: notice: >log_operation: Operation 'monitor' [20006] for device >'fence_zc-mail-1_virsh' returned: -62 (Timer expired) >Feb 23 00:00:21 [24637] zc-mail-2.zylacloud.com crmd: error: >process_lrm_event: Result of monitor operation for >fence_zc-mail-1_virsh on zc-mail-2-ha: Timed Out | call=30 >key=fence_zc-mail-1_virsh_monitor_60000 timeout=20000ms > >There is also a constraint as shown below so that the fencing "agent" >runs on the opposite node to be restarted: > ># pcs constraint show --full > >Location Constraints: > > Resource: fence_zc-mail-1_virsh >Enabled on: zc-mail-2-ha (score:INFINITY) (role: Started) >(id:cli-prefer-fence_zc-mail-1_virsh) > > Resource: fence_zc-mail-2_virsh >Enabled on: zc-mail-1-ha (score:INFINITY) (role: Started) >(id:cli-prefer-fence_zc-mail-2_virsh) > >Ordering Constraints: > >Colocation Constraints: > >Ticket Constraints:
I notice that the issue happens at 00:00 on both days . Have you checked for a backup or other cron job that is 'overloading' the virtualization host ? Anything in libvirt logs or in the hosts' /var/log/messages ? Best Regards, Strahil Nikolov _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
