Nice to know. Yet, if the monitoring of that fencing device failed - most probably the Vcenter was not responding/unreachable - that's why I offered sbd .
Best Regards, Strahil Nikolov На 18 юни 2020 г. 18:24:48 GMT+03:00, Ken Gaillot <kgail...@redhat.com> написа: >Note that a failed start of a stonith device will not prevent the >cluster from using that device for fencing. It just prevents the >cluster from monitoring the device. > >On Thu, 2020-06-18 at 08:20 +0000, Strahil Nikolov wrote: >> What about second fencing mechanism ? >> You can add a shared (independent) vmdk as an sbd device. The >> reconfiguration will require cluster downtime, but this is only >> necessary once. >> Once 2 fencing mechanisms are available - you can configure the order >> easily. >> Best Regards, >> Strahil Nikolov >> >> >> >> >> >> >> В четвъртък, 18 юни 2020 г., 10:29:22 Гринуич+3, Ulrich Windl < >> ulrich.wi...@rz.uni-regensburg.de> написа: >> >> >> >> >> >> Hi! >> >> I can't give much detailed advice, but I think any network service >> should have a timeout of at least 30 Sekonds (you have >> timeout=20000ms). >> >> And "after 1000000 failures" is symbolic, not literal: It means it >> failed too often, so I won't retry. >> >> Regards, >> Ulrich >> >> > > > Howard <hmon...@gmail.com> schrieb am 17.06.2020 um 21:05 in >> > > > Nachricht >> >> <2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1g >> fDL_2tAbKmw >> mq...@mail.gmail.com>: >> > Hello, recently I received some really great advice from this >> > community >> > regarding changing the token timeout value in corosync. Thank you! >> > Since >> > then the cluster has been working perfectly with no errors in the >> > log for >> > more than a week. >> > >> > This morning I logged in to find a stopped stonith device. If I'm >> > reading >> > the log right, it looks like it failed 1 million times in ~20 >> > seconds then >> > gave up. If you wouldn't mind looking at the logs below, is there >> > some way >> > that I can make this more robust so that it can recover? I'll be >> > investigating the reason for the timeout but would like to help the >> > system >> > recover on its own. >> > >> > Servers: RHEL 8.2 >> > >> > Cluster name: cluster_pgperf2 >> > Stack: corosync >> > Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition >> > with >> > quorum >> > Last updated: Wed Jun 17 11:47:42 2020 >> > Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on >> > srv1 >> > >> > 2 nodes configured >> > 4 resources configured >> > >> > Online: [ srv1 srv2 ] >> > >> > Full list of resources: >> > >> > Clone Set: pgsqld-clone [pgsqld] (promotable) >> > Masters: [ srv1 ] >> > Slaves: [ srv2 ] >> > pgsql-master-ip (ocf::heartbeat:IPaddr2): Started >> > srv1 >> > vmfence (stonith:fence_vmware_soap): Stopped >> > >> > Failed Resource Actions: >> > * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19, >> > status=Timed Out, >> > exitreason='', >> > last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms, >> > exec=20184ms >> > * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44, >> > status=Timed Out, >> > exitreason='', >> > last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms, >> > exec=20008ms >> > >> > Daemon Status: >> > corosync: active/disabled >> > pacemaker: active/disabled >> > pcsd: active/enabled >> > >> > pcs resource config >> > Clone: pgsqld-clone >> > Meta Attrs: notify=true promotable=true >> > Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms) >> > Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data >> > Operations: demote interval=0s timeout=120s (pgsqld-demote- >> > interval-0s) >> > methods interval=0s timeout=5 (pgsqld-methods- >> > interval-0s) >> > monitor interval=15s role=Master timeout=60s >> > (pgsqld-monitor-interval-15s) >> > monitor interval=16s role=Slave timeout=60s >> > (pgsqld-monitor-interval-16s) >> > notify interval=0s timeout=60s (pgsqld-notify- >> > interval-0s) >> > promote interval=0s timeout=30s (pgsqld-promote- >> > interval-0s) >> > reload interval=0s timeout=20 (pgsqld-reload- >> > interval-0s) >> > start interval=0s timeout=60s (pgsqld-start- >> > interval-0s) >> > stop interval=0s timeout=60s (pgsqld-stop-interval- >> > 0s) >> > monitor interval=60s timeout=60s >> > (pgsqld-monitor-interval-60s) >> > Resource: pgsql-master-ip (class=ocf provider=heartbeat >> > type=IPaddr2) >> > Attributes: cidr_netmask=24 ip=xxx.xxx.xxx.xxx >> > Operations: monitor interval=10s (pgsql-master-ip-monitor- >> > interval-10s) >> > start interval=0s timeout=20s >> > (pgsql-master-ip-start-interval-0s) >> > stop interval=0s timeout=20s >> > (pgsql-master-ip-stop-interval-0s) >> > >> > pcs stonith config >> > Resource: vmfence (class=stonith type=fence_vmware_soap) >> > Attributes: ipaddr=xxx.xxx.xxx.xxx login=xxxx\xxxxxxxx >> > passwd_script=xxxxxxxx pcmk_host_map=srv1:xxxxxxxxx;srv2:yyyyyyyyy >> > ssl=1 >> > ssl_insecure=1 >> > Operations: monitor interval=60s (vmfence-monitor-interval-60s) >> > >> > pcs resource failcount show >> > Failcounts for resource 'vmfence' >> > srv1: INFINITY >> > srv2: INFINITY >> > >> > Here are the versions installed: >> > [postgres@srv1 cluster]$ rpm -qa|grep >> > "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf" >> > corosync-3.0.2-3.el8_1.1.x86_64 >> > corosync-qdevice-3.0.0-2.el8.x86_64 >> > corosync-qnetd-3.0.0-2.el8.x86_64 >> > corosynclib-3.0.2-3.el8_1.1.x86_64 >> > fence-agents-vmware-soap-4.2.1-41.el8.noarch >> > pacemaker-2.0.2-3.el8_1.2.x86_64 >> > pacemaker-cli-2.0.2-3.el8_1.2.x86_64 >> > pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64 >> > pacemaker-libs-2.0.2-3.el8_1.2.x86_64 >> > pacemaker-schemas-2.0.2-3.el8_1.2.noarch >> > pcs-0.10.2-4.el8.x86_64 >> > resource-agents-paf-2.3.0-1.noarch >> > >> > Here are the errors and warnings from the pacemaker.log from the >> > first >> > warning until it gave up. >> > >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker- >> > fenced >> > [26722] (child_timeout_callback) warning: >> > fence_vmware_soap_monitor_1 process (PID 43095) timed out >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker- >> > fenced >> > [26722] (operation_finished) warning: >> > fence_vmware_soap_monitor_1:43095 - timed out after 20000ms >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker- >> > controld >> > [26726] (process_lrm_event) error: Result of monitor >> > operation for >> > vmfence on srv1: Timed Out | call=39 key=vmfence_monitor_60000 >> > timeout=20000ms >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 >> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: >> > Processing >> > failed monitor of vmfence on srv1: OCF_TIMEOUT | rc=198 >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker- >> > fenced >> > [26722] (child_timeout_callback) warning: >> > fence_vmware_soap_monitor_1 process (PID 43215) timed out >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker- >> > fenced >> > [26722] (operation_finished) warning: >> > fence_vmware_soap_monitor_1:43215 - timed out after 20000ms >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker- >> > controld >> > [26726] (process_lrm_event) error: Result of start operation >> > for >> > vmfence on srv1: Timed Out | call=44 key=vmfence_start_0 >> > timeout=20000ms >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker- >> > controld >> > [26726] (status_from_rc) warning: Action 39 >> > (vmfence_start_0) on >> > srv1 failed (target: 0 vs. rc: 198): Error >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 >> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: >> > Processing >> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 >> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: >> > Processing >> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 >> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: >> > Processing >> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 >> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: >> > Processing >> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 >> > pacemaker-schedulerd[26725] (check_migration_threshold) >> > warning: >> > Forcing vmfence away from srv1 after 1000000 failures (max=5) >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 >> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: >> > Processing >> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 >> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: >> > Processing >> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 >> > pacemaker-schedulerd[26725] (check_migration_threshold) >> > warning: >> > Forcing vmfence away from srv1 after 1000000 failures (max=5) >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker- >> > controld >> > [26726] (status_from_rc) warning: Action 38 >> > (vmfence_start_0) on >> > srv2 failed (target: 0 vs. rc: 198): Error >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 >> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: >> > Processing >> > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198 >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 >> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: >> > Processing >> > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198 >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 >> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: >> > Processing >> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 >> > pacemaker-schedulerd[26725] (check_migration_threshold) >> > warning: >> > Forcing vmfence away from srv1 after 1000000 failures (max=5) >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 >> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: >> > Processing >> > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198 >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 >> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: >> > Processing >> > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198 >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 >> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure) warning: >> > Processing >> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198 >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 >> > pacemaker-schedulerd[26725] (check_migration_threshold) >> > warning: >> > Forcing vmfence away from srv1 after 1000000 failures (max=5) >> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 >> > pacemaker-schedulerd[26725] (check_migration_threshold) >> > warning: >> > Forcing vmfence away from srv2 after 1000000 failures (max=5) >> >> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >-- >Ken Gaillot <kgail...@redhat.com> > >_______________________________________________ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/