Dear Reid,

Thanks for the reply. Yes, lots of pacemaker logs - I have included just over a minute of them below and 5m of them as an attached .log file. The same behaviour occurs for a period of roughly 6 minutes before the corosync shutdown happens and can be summarised like so:

  • Both cluster nodes (test cluster-c1 and testcluster-c2) are online
  • High CPU load is detected on the active cluster node (testcluster-c1)
  • High CPU load (presumably) leads to monitor operations on cluster resources timing out
  • Pacemaker attempts to recover/restart managed resources several times - unsuccessfully
  • Cluster processes start shutting down
  • Pacemaker is responsible for corosync shutting down (pcmkd_shutdown_corosync)      info: Asking Corosync to shut down)

This leads to several more questions…

  • Why does the corosync log say ’shutdown by sysadmin’ when the shutdown was triggered by pacemaker? Isn’t this misleading?
  • Why was there no transition to the other node?
  • What is the purpose of shutting down corosync? Isn’t this how the 2 cluster nodes communicate with each other? Since testcluster-c1 requests a shutdown - and this is acknowledged by testcluster-c2 - isn’t corosync then required for the transition to occur?

Any help is much appreciated!

Cheers

Alex

Attachment: pacemaker.log.gz
Description: GNU Zip compressed data


Apr 23 11:04:59.462 testcluster-c1 pacemaker-execd     [1295872] (async_action_complete)        warning: virtual_ip_monitor_30000[724966] timed out after 20000ms
Apr 23 11:05:20.190 testcluster-c1 pacemaker-execd     [1295872] (async_action_complete)        warning: PingChecks_monitor_10000[724970] timed out after 30000ms
Apr 23 11:05:24.754 testcluster-c1 pacemaker-controld  [1295875] (throttle_check_thresholds)    notice: High CPU load detected: 108.660004
Apr 23 11:05:29.558 testcluster-c1 pacemaker-controld  [1295875] (qb_ipcs_us_withdraw)  info: withdrawing server sockets
Apr 23 11:05:30.270 testcluster-c1 pacemaker-controld  [1295875] (pe_ipc_destroy)       info: Connection to the scheduler released
Apr 23 11:05:34.982 testcluster-c1 pacemaker-controld  [1295875] (tengine_stonith_connection_destroy)   info: Fencing daemon disconnected
Apr 23 11:05:35.350 testcluster-c1 pacemaker-controld  [1295875] (crmd_exit)    notice: Forcing immediate exit with status 100 (Fatal error occurred, will not respawn)
Apr 23 11:05:35.538 testcluster-c1 pacemaker-controld  [1295875] (crm_xml_cleanup)      info: Cleaning up memory from libxml2
Apr 23 11:05:35.850 testcluster-c1 pacemaker-controld  [1295875] (crm_exit)     info: Exiting pacemaker-controld | with status 100
Apr 23 11:05:38.630 testcluster-c1 pacemaker-execd     [1295872] (cancel_recurring_action)      info: Cancelling ocf operation virtual_ip_monitor_30000
Apr 23 11:05:38.630 testcluster-c1 pacemaker-execd     [1295872] (services_action_cancel)       info: Terminating in-flight op virtual_ip_monitor_30000[724993] early because it was cancelled
Apr 23 11:05:38.610 testcluster-c1 pacemakerd          [1295869] (pcmk_child_exit)      warning: Shutting cluster down because pacemaker-controld[1295875] had fatal failure
Apr 23 11:05:38.630 testcluster-c1 pacemakerd          [1295869] (pcmk_shutdown_worker)         notice: Shutting down Pacemaker
Apr 23 11:05:38.630 testcluster-c1 pacemakerd          [1295869] (stop_child)   notice: Stopping pacemaker-schedulerd | sent signal 15 to process 1295874
Apr 23 11:05:38.634 testcluster-c1 pacemaker-execd     [1295872] (async_action_complete)        info: virtual_ip_monitor_30000[724993] terminated with signal 9 (Killed)
Apr 23 11:05:38.634 testcluster-c1 pacemaker-execd     [1295872] (cancel_recurring_action)      info: Cancelling ocf operation virtual_ip_monitor_30000
Apr 23 11:05:38.650 testcluster-c1 pacemaker-execd     [1295872] (send_client_notify)   warning: Could not notify client crmd: Bad file descriptor | rc=9
Apr 23 11:05:38.650 testcluster-c1 pacemaker-execd     [1295872] (cancel_recurring_action)      info: Cancelling systemd operation docker-services_status_60000
Apr 23 11:05:38.650 testcluster-c1 pacemaker-execd     [1295872] (cancel_recurring_action)      info: Cancelling ocf operation PingChecks_monitor_10000
Apr 23 11:05:38.650 testcluster-c1 pacemaker-execd     [1295872] (services_action_cancel)       info: Terminating in-flight op PingChecks_monitor_10000[724994] early because it was cancelled
Apr 23 11:05:38.650 testcluster-c1 pacemaker-execd     [1295872] (async_action_complete)        info: PingChecks_monitor_10000[724994] terminated with signal 9 (Killed)
Apr 23 11:05:38.650 testcluster-c1 pacemaker-execd     [1295872] (cancel_recurring_action)      info: Cancelling ocf operation PingChecks_monitor_10000
Apr 23 11:05:38.654 testcluster-c1 pacemaker-execd     [1295872] (cancel_recurring_action)      info: Cancelling ocf operation DrbdFS_monitor_20000
Apr 23 11:05:38.654 testcluster-c1 pacemaker-execd     [1295872] (services_action_cancel)       info: Terminating in-flight op DrbdFS_monitor_20000[724990] early because it was cancelled
Apr 23 11:05:38.662 testcluster-c1 pacemaker-execd     [1295872] (async_action_complete)        info: DrbdFS_monitor_20000[724990] terminated with signal 9 (Killed)
Apr 23 11:05:38.662 testcluster-c1 pacemaker-execd     [1295872] (cancel_recurring_action)      info: Cancelling ocf operation DrbdFS_monitor_20000
Apr 23 11:05:38.802 testcluster-c1 pacemaker-schedulerd[1295874] (pcmk__set_instance_roles)     info: Promoting DrbdData:0 (Promoted testcluster-c1)
Apr 23 11:05:38.802 testcluster-c1 pacemaker-schedulerd[1295874] (pcmk__set_instance_roles)     info: DrbdData-clone: Promoted 1 instances of a possible 1
Apr 23 11:05:38.806 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: --- 0.30.149 2
Apr 23 11:05:38.806 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: +++ 0.30.150 c0591b944f7de4d8a5b8ce02555e8983
Apr 23 11:05:38.806 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: -- /cib/status/node_state[@id='1']/transient_attributes[@id='1']
Apr 23 11:05:38.806 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: +  /cib:  @num_updates=150
Apr 23 11:05:38.806 testcluster-c1 pacemaker-schedulerd[1295874] (RecurringOp)  info:  Start recurring monitor (20s) for DrbdFS on testcluster-c1
Apr 23 11:05:38.806 testcluster-c1 pacemaker-schedulerd[1295874] (rsc_action_default)   info: Leave   DrbdData:0        (Promoted testcluster-c1)
Apr 23 11:05:38.806 testcluster-c1 pacemaker-schedulerd[1295874] (rsc_action_default)   info: Leave   DrbdData:1        (Unpromoted testcluster-c2)
Apr 23 11:05:38.806 testcluster-c1 pacemaker-schedulerd[1295874] (log_list_item)        notice: Actions: Recover    DrbdFS              ( testcluster-c1 )
Apr 23 11:05:38.806 testcluster-c1 pacemaker-schedulerd[1295874] (log_list_item)        notice: Actions: Restart    docker-services     ( testcluster-c1 )  due to required DrbdFS start
Apr 23 11:05:38.806 testcluster-c1 pacemaker-schedulerd[1295874] (log_list_item)        notice: Actions: Restart    virtual_ip          ( testcluster-c1 )  due to required docker-services start
Apr 23 11:05:38.806 testcluster-c1 pacemaker-schedulerd[1295874] (rsc_action_default)   info: Leave   PingChecks:0      (Started testcluster-c1)
Apr 23 11:05:38.810 testcluster-c1 pacemaker-schedulerd[1295874] (rsc_action_default)   info: Leave   PingChecks:1      (Started testcluster-c2)
Apr 23 11:05:39.186 testcluster-c1 pacemaker-schedulerd[1295874] (handle_pecalc_op)     error: Couldn't send transition graph to peer, writing to /var/lib/pacemaker/pengine/pengine.graph.3KGsrP instead
Apr 23 11:05:39.442 testcluster-c1 pacemaker-schedulerd[1295874] (pcmk__log_transition_summary)         notice: Calculated transition 495, saving inputs in /var/lib/pacemaker/pengine/pe-input-28.bz2
Apr 23 11:05:39.570 testcluster-c1 pacemaker-schedulerd[1295874] (handle_pecalc_op)     info: Input has not changed since last time, not saving to disk
Apr 23 11:05:41.426 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_delete operation for section //node_state[@uname='testcluster-c1']/transient_attributes: OK (rc=0, origin=testcluster-c2/crmd/27, version=0.30.150)
Apr 23 11:05:41.998 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_modify operation for section cib: OK (rc=0, origin=testcluster-c2/crmd/29, version=0.30.150)
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (unpack_config)        notice: On loss of quorum: Ignore
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (determine_online_status)      info: Node testcluster-c1 is online
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (determine_online_status)      info: Node testcluster-c2 is online
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: virtual_ip has failed 1 times on testcluster-c1
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (pe__clear_failcount)  notice: Clearing failure of virtual_ip on testcluster-c1 because it expired | virtual_ip_clear_failcount_0
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: virtual_ip has failed 1 times on testcluster-c1
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (pe__clear_failcount)  notice: Clearing failure of virtual_ip on testcluster-c1 because it expired | virtual_ip_clear_failcount_0
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: virtual_ip has failed 1 times on testcluster-c1
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (pe__clear_failcount)  notice: Clearing failure of virtual_ip on testcluster-c1 because it expired | virtual_ip_clear_failcount_0
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (unpack_rsc_op)        notice: Rescheduling virtual_ip_monitor_30000 after failure expired on testcluster-c1 | actual=1 expected=0 magic=2:1;43:2:0:3f9900ac-185f-4dc1-972d-934d41ca6dc9
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: PingChecks:0 has failed 1 times on testcluster-c1
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (pe__clear_failcount)  notice: Clearing failure of PingChecks:0 on testcluster-c1 because it expired | PingChecks:0_clear_failcount_0
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: PingChecks:0 has failed 1 times on testcluster-c1
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (pe__clear_failcount)  notice: Clearing failure of PingChecks:0 on testcluster-c1 because it expired | PingChecks:0_clear_failcount_0
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: PingChecks:0 has failed 1 times on testcluster-c1
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (pe__clear_failcount)  notice: Clearing failure of PingChecks:0 on testcluster-c1 because it expired | PingChecks:0_clear_failcount_0
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (unpack_rsc_op)        notice: Rescheduling PingChecks_monitor_10000 after failure expired on testcluster-c1 | actual=1 expected=0 magic=2:1;51:4:0:a085676a-5b2b-4928-83f2-c01315e5aec5
Apr 23 11:05:42.002 testcluster-c1 pacemaker-schedulerd[1295874] (unpack_rsc_op_failure)        warning: Unexpected result (error: Process did not exit within specified timeout) was recorded for monitor of DrbdFS on testcluster-c1 at Apr 23 11:02:43 2024 | rc=1 id=DrbdFS_last_failure
_0
Apr 23 11:05:43.334 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_modify operation for section crm_config: OK (rc=0, origin=testcluster-c2/crmd/31, version=0.30.150)
Apr 23 11:05:43.334 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: DrbdFS has failed 1 times on testcluster-c1
Apr 23 11:05:43.334 testcluster-c1 pacemaker-schedulerd[1295874] (pcmk__threshold_reached)      info: DrbdFS can fail 999999 more times on testcluster-c1 before reaching migration threshold (1000000)
Apr 23 11:05:44.022 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_modify operation for section crm_config: OK (rc=0, origin=testcluster-c2/crmd/33, version=0.30.150)
Apr 23 11:05:46.818 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_modify operation for section crm_config: OK (rc=0, origin=testcluster-c2/crmd/35, version=0.30.150)
Apr 23 11:05:47.342 testcluster-c1 pacemaker-based     [1295870] (cib_process_replace)  info: Digest matched on replace from testcluster-c2: c0591b944f7de4d8a5b8ce02555e8983
Apr 23 11:05:47.414 testcluster-c1 pacemaker-based     [1295870] (cib_process_replace)  info: Replaced 0.30.150 with 0.30.150 from testcluster-c2
Apr 23 11:05:48.786 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_replace operation for section 'all': OK (rc=0, origin=testcluster-c2/crmd/38, version=0.30.150)
Apr 23 11:05:48.794 testcluster-c1 pacemaker-schedulerd[1295874] (pcmk__set_instance_roles)     info: Promoting DrbdData:0 (Promoted testcluster-c1)
Apr 23 11:05:48.794 testcluster-c1 pacemaker-schedulerd[1295874] (pcmk__set_instance_roles)     info: DrbdData-clone: Promoted 1 instances of a possible 1
Apr 23 11:05:48.794 testcluster-c1 pacemaker-schedulerd[1295874] (RecurringOp)  info:  Start recurring monitor (20s) for DrbdFS on testcluster-c1
Apr 23 11:05:48.798 testcluster-c1 pacemaker-schedulerd[1295874] (rsc_action_default)   info: Leave   DrbdData:0        (Promoted testcluster-c1)
Apr 23 11:05:48.798 testcluster-c1 pacemaker-schedulerd[1295874] (rsc_action_default)   info: Leave   DrbdData:1        (Unpromoted testcluster-c2)
Apr 23 11:05:48.798 testcluster-c1 pacemaker-schedulerd[1295874] (log_list_item)        notice: Actions: Recover    DrbdFS              ( testcluster-c1 )
Apr 23 11:05:48.798 testcluster-c1 pacemaker-schedulerd[1295874] (log_list_item)        notice: Actions: Restart    docker-services     ( testcluster-c1 )  due to required DrbdFS start
Apr 23 11:05:48.798 testcluster-c1 pacemaker-schedulerd[1295874] (log_list_item)        notice: Actions: Restart    virtual_ip          ( testcluster-c1 )  due to required docker-services start
Apr 23 11:05:48.798 testcluster-c1 pacemaker-schedulerd[1295874] (rsc_action_default)   info: Leave   PingChecks:0      (Started testcluster-c1)
Apr 23 11:05:48.798 testcluster-c1 pacemaker-schedulerd[1295874] (rsc_action_default)   info: Leave   PingChecks:1      (Started testcluster-c2)
Apr 23 11:05:48.802 testcluster-c1 pacemaker-schedulerd[1295874] (pcmk__log_transition_summary)         notice: Calculated transition 496, saving inputs in /var/lib/pacemaker/pengine/pe-input-29.bz2
Apr 23 11:05:48.806 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_modify operation for section nodes: OK (rc=0, origin=testcluster-c2/crmd/39, version=0.30.150)
Apr 23 11:05:48.818 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: --- 0.30.150 2
Apr 23 11:05:48.818 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: +++ 0.30.151 (null)
Apr 23 11:05:48.818 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: -- /cib/status/node_state[@id='2']/lrm[@id='2']
Apr 23 11:05:48.818 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: +  /cib:  @num_updates=151
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_delete operation for section //node_state[@uname='testcluster-c2']/lrm: OK (rc=0, origin=testcluster-c2/crmd/40, version=0.30.151)
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: --- 0.30.151 2
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: +++ 0.30.152 (null)
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: +  /cib:  @num_updates=152
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: +  /cib/status/node_state[@id='2']:  @crm-debug-origin=do_lrm_query_internal
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++ /cib/status/node_state[@id='2']:  <lrm id="2"/>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                     <lrm_resources>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                       <lrm_resource id="virtual_ip" type="IPaddr2" class="ocf" provider="heartbeat">
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                         <lrm_rsc_op id="virtual_ip_last_0" operation_key="virtual_ip_monitor_0" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.
11.0" transition-key="5:2:7:3f9900ac-185f-4dc1-972d-934d41ca6dc9" transition-magic="0:7;5:2:7:3f9900ac-185f-4dc1-972d-934d41ca6dc9" exit-reason="" _on_node_="testcluster-c2" call-id="21" rc-code="7" op-status="0" interval="0" last-rc-change="1713422197" exec-time="89" queue-time="0" op
-digest="3a768873ca9f51044dbe8d63366fb7ce"/>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                       </lrm_resource>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                       <lrm_resource id="DrbdData" type="drbd" class="ocf" provider="linbit">
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                         <lrm_rsc_op id="DrbdData_last_0" operation_key="DrbdData_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.11.0" tr
ansition-key="12:1:0:3f9900ac-185f-4dc1-972d-934d41ca6dc9" transition-magic="0:0;12:1:0:3f9900ac-185f-4dc1-972d-934d41ca6dc9" exit-reason="" _on_node_="testcluster-c2" call-id="11" rc-code="0" op-status="0" interval="0" last-rc-change="1713422196" exec-time="985" queue-time="0" op-dige
st="c0e018b73fdf522b6cdd355e125af15e" op-force-restart="  drbd_resource  " op-restart-digest="c0e018b73fdf522b6cdd355e125af15e"/>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                         <lrm_rsc_op id="DrbdData_monitor_60000" operation_key="DrbdData_monitor_60000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_
set="3.11.0" transition-key="13:2:0:3f9900ac-185f-4dc1-972d-934d41ca6dc9" transition-magic="0:0;13:2:0:3f9900ac-185f-4dc1-972d-934d41ca6dc9" exit-reason="" _on_node_="testcluster-c2" call-id="13" rc-code="0" op-status="0" interval="60000" last-rc-change="1713422197" exec-time="129" que
ue-time="0" op-digest="db2fb3216fa7207fdc0b2bf31bbd09c1"/>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                       </lrm_resource>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                       <lrm_resource id="docker-services" type="docker-services" class="systemd">
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                         <lrm_rsc_op id="docker-services_last_0" operation_key="docker-services_monitor_0" operation="monitor" crm-debug-origin="build_active_RAs" crm_featu
re_set="3.11.0" transition-key="4:2:7:3f9900ac-185f-4dc1-972d-934d41ca6dc9" transition-magic="0:7;4:2:7:3f9900ac-185f-4dc1-972d-934d41ca6dc9" exit-reason="" _on_node_="testcluster-c2" call-id="17" rc-code="7" op-status="0" interval="0" last-rc-change="1713422197" exec-time="6" queue-ti
me="0" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                       </lrm_resource>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                       <lrm_resource id="PingChecks" type="ping" class="ocf" provider="pacemaker">
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                         <lrm_rsc_op id="PingChecks_last_0" operation_key="PingChecks_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.11.0
" transition-key="50:2:0:3f9900ac-185f-4dc1-972d-934d41ca6dc9" transition-magic="0:0;50:2:0:3f9900ac-185f-4dc1-972d-934d41ca6dc9" exit-reason="" _on_node_="testcluster-c2" call-id="27" rc-code="0" op-status="0" interval="0" last-rc-change="1713422197" exec-time="9196" queue-time="0" op
-digest="1ac9a59a0519ceb6f9188233ae443791" op-force-restart="  use_fping failure_score options host_list multiplier name pidfile  " op-restart-digest="e460710e452f12aebd0abcae1b26e5f9"/>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                         <lrm_rsc_op id="PingChecks_monitor_10000" operation_key="PingChecks_monitor_10000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feat
ure_set="3.11.0" transition-key="51:2:0:3f9900ac-185f-4dc1-972d-934d41ca6dc9" transition-magic="0:0;51:2:0:3f9900ac-185f-4dc1-972d-934d41ca6dc9" exit-reason="" _on_node_="testcluster-c2" call-id="28" rc-code="0" op-status="0" interval="10000" last-rc-change="1713422207" exec-time="9281
" queue-time="0" op-digest="56e4de959c35addb7fded1aeb1ae4e30"/>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                       </lrm_resource>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                       <lrm_resource id="DrbdFS" type="Filesystem" class="ocf" provider="heartbeat">
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                         <lrm_rsc_op id="DrbdFS_last_0" operation_key="DrbdFS_monitor_0" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.11.0" tr
ansition-key="4:1:7:3f9900ac-185f-4dc1-972d-934d41ca6dc9" transition-magic="0:7;4:1:7:3f9900ac-185f-4dc1-972d-934d41ca6dc9" exit-reason="" _on_node_="testcluster-c2" call-id="10" rc-code="7" op-status="0" interval="0" last-rc-change="1713422196" exec-time="121" queue-time="0" op-digest
="40b189eb6bf31d9b108771148daa913b"/>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                       </lrm_resource>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                     </lrm_resources>
Apr 23 11:05:48.822 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: ++                                   </lrm>
Apr 23 11:05:48.826 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_modify operation for section status: OK (rc=0, origin=testcluster-c2/crmd/41, version=0.30.152)
Apr 23 11:05:48.830 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_modify operation for section nodes: OK (rc=0, origin=testcluster-c2/crmd/44, version=0.30.152)
Apr 23 11:05:48.834 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: --- 0.30.152 2
Apr 23 11:05:48.834 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: +++ 0.30.153 (null)
Apr 23 11:05:48.834 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: +  /cib:  @num_updates=153
Apr 23 11:05:48.834 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: +  /cib/status/node_state[@id='1']:  @crmd=offline, @crm-debug-origin=do_state_transition, @join=down
Apr 23 11:05:48.834 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: +  /cib/status/node_state[@id='2']:  @crm-debug-origin=do_state_transition
Apr 23 11:05:48.834 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_modify operation for section status: OK (rc=0, origin=testcluster-c2/crmd/45, version=0.30.153)
Apr 23 11:05:49.346 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: --- 0.30.153 2
Apr 23 11:05:49.350 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: +++ 0.30.154 (null)
Apr 23 11:05:49.350 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: +  /cib:  @num_updates=154, @dc-uuid=2
Apr 23 11:05:50.922 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_modify operation for section cib: OK (rc=0, origin=testcluster-c2/crmd/46, version=0.30.154)
Apr 23 11:05:51.146 testcluster-c1 pacemaker-schedulerd[1295874] (unpack_config)        notice: On loss of quorum: Ignore
Apr 23 11:05:51.150 testcluster-c1 pacemaker-schedulerd[1295874] (determine_online_status)      info: Node testcluster-c1 is online
Apr 23 11:05:51.150 testcluster-c1 pacemaker-schedulerd[1295874] (determine_online_status)      info: Node testcluster-c2 is online
Apr 23 11:05:51.154 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: virtual_ip has failed 1 times on testcluster-c1
Apr 23 11:05:51.154 testcluster-c1 pacemaker-schedulerd[1295874] (pe__clear_failcount)  notice: Clearing failure of virtual_ip on testcluster-c1 because it expired | virtual_ip_clear_failcount_0
Apr 23 11:05:51.154 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: virtual_ip has failed 1 times on testcluster-c1
Apr 23 11:05:51.154 testcluster-c1 pacemaker-schedulerd[1295874] (pe__clear_failcount)  notice: Clearing failure of virtual_ip on testcluster-c1 because it expired | virtual_ip_clear_failcount_0
Apr 23 11:05:51.158 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: virtual_ip has failed 1 times on testcluster-c1
Apr 23 11:05:51.158 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: --- 0.30.154 2
Apr 23 11:05:51.158 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: +++ 0.30.155 (null)
Apr 23 11:05:51.158 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: -- /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='DrbdData']/lrm_rsc_op[@id='DrbdData_monitor_60000']
Apr 23 11:05:51.158 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: +  /cib:  @num_updates=155
Apr 23 11:05:51.158 testcluster-c1 pacemaker-schedulerd[1295874] (pe__clear_failcount)  notice: Clearing failure of virtual_ip on testcluster-c1 because it expired | virtual_ip_clear_failcount_0
Apr 23 11:05:51.158 testcluster-c1 pacemaker-schedulerd[1295874] (unpack_rsc_op)        notice: Rescheduling virtual_ip_monitor_30000 after failure expired on testcluster-c1 | actual=1 expected=0 magic=2:1;43:2:0:3f9900ac-185f-4dc1-972d-934d41ca6dc9
Apr 23 11:05:51.162 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: PingChecks:0 has failed 1 times on testcluster-c1
Apr 23 11:05:51.162 testcluster-c1 pacemaker-schedulerd[1295874] (pe__clear_failcount)  notice: Clearing failure of PingChecks:0 on testcluster-c1 because it expired | PingChecks:0_clear_failcount_0
Apr 23 11:05:51.162 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_delete operation for section status: OK (rc=0, origin=testcluster-c2/crmd/49, version=0.30.155)
Apr 23 11:05:51.162 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: PingChecks:0 has failed 1 times on testcluster-c1
Apr 23 11:05:51.162 testcluster-c1 pacemaker-schedulerd[1295874] (pe__clear_failcount)  notice: Clearing failure of PingChecks:0 on testcluster-c1 because it expired | PingChecks:0_clear_failcount_0
Apr 23 11:05:51.162 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: PingChecks:0 has failed 1 times on testcluster-c1
Apr 23 11:05:51.162 testcluster-c1 pacemaker-schedulerd[1295874] (pe__clear_failcount)  notice: Clearing failure of PingChecks:0 on testcluster-c1 because it expired | PingChecks:0_clear_failcount_0
Apr 23 11:05:51.162 testcluster-c1 pacemaker-schedulerd[1295874] (unpack_rsc_op)        notice: Rescheduling PingChecks_monitor_10000 after failure expired on testcluster-c1 | actual=1 expected=0 magic=2:1;51:4:0:a085676a-5b2b-4928-83f2-c01315e5aec5
Apr 23 11:05:51.162 testcluster-c1 pacemaker-schedulerd[1295874] (unpack_rsc_op_failure)        warning: Unexpected result (error: Process did not exit within specified timeout) was recorded for monitor of DrbdFS on testcluster-c1 at Apr 23 11:02:43 2024 | rc=1 id=DrbdFS_last_failure
_0
Apr 23 11:05:51.942 testcluster-c1 pacemaker-schedulerd[1295874] (pe_get_failcount)     info: DrbdFS has failed 1 times on testcluster-c1
Apr 23 11:05:51.942 testcluster-c1 pacemaker-schedulerd[1295874] (pcmk__threshold_reached)      info: DrbdFS can fail 999999 more times on testcluster-c1 before reaching migration threshold (1000000)
Apr 23 11:05:51.962 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: --- 0.30.155 2
Apr 23 11:05:51.962 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: Diff: +++ 0.30.156 (null)
Apr 23 11:05:51.962 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: +  /cib:  @num_updates=156
Apr 23 11:05:51.962 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: +  /cib/status/node_state[@id='2']:  @crm-debug-origin=do_update_resource
Apr 23 11:05:51.962 testcluster-c1 pacemaker-based     [1295870] (cib_perform_op)       info: +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='DrbdData']/lrm_rsc_op[@id='DrbdData_last_0']:  @operation_key=DrbdData_promote_0, @operation=promote, @crm-deb
ug-origin=do_update_resource, @transition-key=6:0:0:91e178a7-4098-4729-92b3-d7cf8fdc6e64, @transition-magic=-1:193;6:0:0:91e178a7-4098-4729-92b3-d7cf8fdc6e64, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1713863139, @exec-time=0
Apr 23 11:05:51.974 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_modify operation for section status: OK (rc=0, origin=testcluster-c2/crmd/50, version=0.30.156)
Apr 23 11:05:52.674 testcluster-c1 pacemaker-based     [1295870] (cib_process_ping)     info: Reporting our current digest to testcluster-c2: fddf1766600ffe74873bf40cd3a0cbcb for 0.30.156 (0x564d66568980 0)
Apr 23 11:05:53.074 testcluster-c1 pacemaker-schedulerd[1295874] (pcmk__set_instance_roles)     info: Promoting DrbdData:0 (Promoted testcluster-c1)
Apr 23 11:05:53.074 testcluster-c1 pacemaker-schedulerd[1295874] (pcmk__set_instance_roles)     info: DrbdData-clone: Promoted 1 instances of a possible 1
Apr 23 11:05:53.082 testcluster-c1 pacemaker-schedulerd[1295874] (RecurringOp)  info:  Start recurring monitor (20s) for DrbdFS on testcluster-c1
Apr 23 11:05:53.946 testcluster-c1 pacemaker-schedulerd[1295874] (rsc_action_default)   info: Leave   DrbdData:0        (Promoted testcluster-c1)
Apr 23 11:05:54.086 testcluster-c1 pacemaker-schedulerd[1295874] (rsc_action_default)   info: Leave   DrbdData:1        (Unpromoted testcluster-c2)
Apr 23 11:05:54.334 testcluster-c1 pacemaker-schedulerd[1295874] (log_list_item)        notice: Actions: Recover    DrbdFS              ( testcluster-c1 )
Apr 23 11:05:55.130 testcluster-c1 pacemaker-schedulerd[1295874] (log_list_item)        notice: Actions: Restart    docker-services     ( testcluster-c1 )  due to required DrbdFS start
Apr 23 11:05:55.906 testcluster-c1 pacemaker-schedulerd[1295874] (log_list_item)        notice: Actions: Restart    virtual_ip          ( testcluster-c1 )  due to required docker-services start
Apr 23 11:05:56.162 testcluster-c1 pacemaker-schedulerd[1295874] (rsc_action_default)   info: Leave   PingChecks:0      (Started testcluster-c1)
Apr 23 11:05:56.422 testcluster-c1 pacemaker-schedulerd[1295874] (rsc_action_default)   info: Leave   PingChecks:1      (Started testcluster-c2)
Apr 23 11:05:56.986 testcluster-c1 pacemaker-based     [1295870] (cib_file_backup)      info: Archived previous version as /var/lib/pacemaker/cib/cib-35.raw
Apr 23 11:05:57.446 testcluster-c1 pacemaker-based     [1295870] (cib_file_write_with_digest)   info: Wrote version 0.30.0 of the CIB to disk (digest: 6559c9de6a8bfa2291405312f21fe540)
Apr 23 11:05:57.590 testcluster-c1 pacemaker-based     [1295870] (cib_file_write_with_digest)   info: Reading cluster configuration file /var/lib/pacemaker/cib/cib.SOWrXW (digest: /var/lib/pacemaker/cib/cib.pDuq4R)
Apr 23 11:05:58.734 testcluster-c1 pacemaker-based     [1295870] (cib_process_ping)     info: Reporting our current digest to testcluster-c1: fddf1766600ffe74873bf40cd3a0cbcb for 0.30.156 (0x564d66568980 0)
Apr 23 11:05:59.138 testcluster-c1 pacemaker-schedulerd[1295874] (pcmk__log_transition_summary)         notice: Calculated transition 497, saving inputs in /var/lib/pacemaker/pengine/pe-input-30.bz2
Apr 23 11:05:59.166 testcluster-c1 pacemaker-schedulerd[1295874] (crm_signal_dispatch)  notice: Caught 'Terminated' signal | 15 (invoking handler)
Apr 23 11:05:59.174 testcluster-c1 pacemaker-schedulerd[1295874] (qb_ipcs_us_withdraw)  info: withdrawing server sockets
Apr 23 11:05:59.178 testcluster-c1 pacemaker-schedulerd[1295874] (crm_xml_cleanup)      info: Cleaning up memory from libxml2
Apr 23 11:05:59.198 testcluster-c1 pacemaker-schedulerd[1295874] (crm_exit)     info: Exiting pacemaker-schedulerd | with status 0
Apr 23 11:06:00.486 testcluster-c1 pacemakerd          [1295869] (pcmk_child_exit)      info: pacemaker-schedulerd[1295874] exited with status 0 (OK)
Apr 23 11:06:00.546 testcluster-c1 pacemaker-attrd     [1295873] (crm_signal_dispatch)  notice: Caught 'Terminated' signal | 15 (invoking handler)
Apr 23 11:06:00.550 testcluster-c1 pacemakerd          [1295869] (stop_child)   notice: Stopping pacemaker-attrd | sent signal 15 to process 1295873
Apr 23 11:06:00.570 testcluster-c1 pacemaker-attrd     [1295873] (main)         info: Shutting down attribute manager
Apr 23 11:06:00.582 testcluster-c1 pacemaker-attrd     [1295873] (qb_ipcs_us_withdraw)  info: withdrawing server sockets
Apr 23 11:06:01.078 testcluster-c1 pacemaker-attrd     [1295873] (attrd_cib_destroy_cb)         info: Connection disconnection complete
Apr 23 11:06:01.126 testcluster-c1 pacemaker-attrd     [1295873] (crm_xml_cleanup)      info: Cleaning up memory from libxml2
Apr 23 11:06:01.382 testcluster-c1 pacemaker-attrd     [1295873] (crm_exit)     info: Exiting pacemaker-attrd | with status 0
Apr 23 11:06:02.926 testcluster-c1 pacemakerd          [1295869] (pcmk_child_exit)      info: pacemaker-attrd[1295873] exited with status 0 (OK)
Apr 23 11:06:02.958 testcluster-c1 pacemakerd          [1295869] (stop_child)   notice: Stopping pacemaker-execd | sent signal 15 to process 1295872
Apr 23 11:06:03.006 testcluster-c1 pacemaker-execd     [1295872] (crm_signal_dispatch)  notice: Caught 'Terminated' signal | 15 (invoking handler)
Apr 23 11:06:03.010 testcluster-c1 pacemaker-execd     [1295872] (lrmd_exit)    info: Terminating with 0 clients
Apr 23 11:06:03.014 testcluster-c1 pacemaker-execd     [1295872] (qb_ipcs_us_withdraw)  info: withdrawing server sockets
Apr 23 11:06:03.022 testcluster-c1 pacemaker-execd     [1295872] (crm_xml_cleanup)      info: Cleaning up memory from libxml2
Apr 23 11:06:03.206 testcluster-c1 pacemaker-execd     [1295872] (crm_exit)     info: Exiting pacemaker-execd | with status 0
Apr 23 11:06:03.382 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_modify operation for section status: OK (rc=0, origin=testcluster-c2/attrd/5, version=0.30.156)
Apr 23 11:06:03.438 testcluster-c1 pacemaker-based     [1295870] (cib_process_request)  info: Completed cib_modify operation for section status: OK (rc=0, origin=testcluster-c2/attrd/6, version=0.30.156)
Apr 23 11:06:03.966 testcluster-c1 pacemakerd          [1295869] (pcmk_child_exit)      info: pacemaker-execd[1295872] exited with status 0 (OK)
Apr 23 11:06:03.970 testcluster-c1 pacemakerd          [1295869] (stop_child)   notice: Stopping pacemaker-fenced | sent signal 15 to process 1295871
Apr 23 11:06:03.978 testcluster-c1 pacemaker-fenced    [1295871] (crm_signal_dispatch)  notice: Caught 'Terminated' signal | 15 (invoking handler)
Apr 23 11:06:04.010 testcluster-c1 pacemaker-fenced    [1295871] (stonith_shutdown)     info: Terminating with 0 clients
Apr 23 11:06:04.630 testcluster-c1 pacemaker-fenced    [1295871] (cib_connection_destroy)       info: Connection to the CIB manager closed
Apr 23 11:06:04.702 testcluster-c1 pacemaker-fenced    [1295871] (qb_ipcs_us_withdraw)  info: withdrawing server sockets
Apr 23 11:06:04.778 testcluster-c1 pacemaker-fenced    [1295871] (crm_xml_cleanup)      info: Cleaning up memory from libxml2
Apr 23 11:06:04.842 testcluster-c1 pacemaker-fenced    [1295871] (crm_exit)     info: Exiting pacemaker-fenced | with status 0
Apr 23 11:06:06.258 testcluster-c1 pacemakerd          [1295869] (pcmk_child_exit)      info: pacemaker-fenced[1295871] exited with status 0 (OK)
Apr 23 11:06:06.266 testcluster-c1 pacemaker-based     [1295870] (crm_signal_dispatch)  notice: Caught 'Terminated' signal | 15 (invoking handler)
Apr 23 11:06:06.266 testcluster-c1 pacemakerd          [1295869] (stop_child)   notice: Stopping pacemaker-based | sent signal 15 to process 1295870
Apr 23 11:06:06.274 testcluster-c1 pacemaker-based     [1295870] (cib_shutdown)         info: Disconnected 0 clients
Apr 23 11:06:06.274 testcluster-c1 pacemaker-based     [1295870] (cib_shutdown)         info: All clients disconnected (0)
Apr 23 11:06:06.282 testcluster-c1 pacemaker-based     [1295870] (initiate_exit)        info: Sending disconnect notification to 2 peers...
Apr 23 11:06:06.334 testcluster-c1 pacemaker-based     [1295870] (cib_process_shutdown_req)     info: Peer testcluster-c1 is requesting to shut down
Apr 23 11:06:06.346 testcluster-c1 pacemaker-based     [1295870] (cib_process_shutdown_req)     info: Peer testcluster-c2 has acknowledged our shutdown request
Apr 23 11:06:06.346 testcluster-c1 pacemaker-based     [1295870] (terminate_cib)        info: cib_process_shutdown_req: Exiting from mainloop...
Apr 23 11:06:06.350 testcluster-c1 pacemaker-based     [1295870] (crm_cluster_disconnect)       info: Disconnecting from corosync cluster infrastructure
Apr 23 11:06:06.358 testcluster-c1 pacemaker-based     [1295870] (pcmk__corosync_disconnect)    notice: Disconnected from Corosync
Apr 23 11:06:06.366 testcluster-c1 pacemaker-based     [1295870] (crm_get_peer)         info: Created entry 48e34a90-0596-48f1-b2b2-171d1692aac5/0x564d66683e40 for node testcluster-c2/0 (1 total)
Apr 23 11:06:06.406 testcluster-c1 pacemaker-based     [1295870] (cib_peer_update_callback)     info: No more peers
Apr 23 11:06:06.422 testcluster-c1 pacemaker-based     [1295870] (terminate_cib)        info: cib_peer_update_callback: Exiting from mainloop...
Apr 23 11:06:06.426 testcluster-c1 pacemaker-based     [1295870] (crm_cluster_disconnect)       info: Disconnecting from corosync cluster infrastructure
Apr 23 11:06:06.426 testcluster-c1 pacemaker-based     [1295870] (cluster_disconnect_cpg)       info: No CPG connection
Apr 23 11:06:06.426 testcluster-c1 pacemaker-based     [1295870] (pcmk__corosync_disconnect)    notice: Disconnected from Corosync
Apr 23 11:06:06.426 testcluster-c1 pacemaker-based     [1295870] (qb_ipcs_us_withdraw)  info: withdrawing server sockets
Apr 23 11:06:06.442 testcluster-c1 pacemaker-based     [1295870] (qb_ipcs_us_withdraw)  info: withdrawing server sockets
Apr 23 11:06:06.442 testcluster-c1 pacemaker-based     [1295870] (qb_ipcs_us_withdraw)  info: withdrawing server sockets
Apr 23 11:06:06.458 testcluster-c1 pacemaker-based     [1295870] (crm_xml_cleanup)      info: Cleaning up memory from libxml2
Apr 23 11:06:09.682 testcluster-c1 pacemaker-based     [1295870] (crm_exit)     info: Exiting pacemaker-based | with status 0
Apr 23 11:06:10.450 testcluster-c1 pacemakerd          [1295869] (pcmk_child_exit)      info: pacemaker-based[1295870] exited with status 0 (OK)
Apr 23 11:06:10.458 testcluster-c1 pacemakerd          [1295869] (pcmk_shutdown_worker)         notice: Shutdown complete
Apr 23 11:06:10.458 testcluster-c1 pacemakerd          [1295869] (pcmk_shutdown_worker)         notice: Shutting down and staying down after fatal error
Apr 23 11:06:10.458 testcluster-c1 pacemakerd          [1295869] (pcmkd_shutdown_corosync)      info: Asking Corosync to shut down
Apr 23 11:06:10.530 testcluster-c1 pacemakerd          [1295869] (crm_xml_cleanup)      info: Cleaning up memory from libxml2
Apr 23 11:06:10.554 testcluster-c1 pacemakerd          [1295869] (crm_exit)     info: Exiting pacemakerd | with status 100

On 26.04.2024, at 05:54, Reid Wahl <nw...@redhat.com> wrote:

Any logs from Pacemaker?

On Thu, Apr 25, 2024 at 3:46 AM Alexander Eastwood via Users
<users@clusterlabs.org> wrote:

Hi all,

I’m trying to get a better understanding of why our cluster - or specifically corosync.service - entered a failed state. Here are all of the relevant corosync logs from this event, with the last line showing when I manually started the service again:

Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [CFG   ] Node 1 was shut down by sysadmin
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Unloading all Corosync service engines.
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync configuration map access
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync configuration service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync profile loading service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync resource monitoring service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync watchdog service
Apr 23 11:06:11 [1295854] testcluster-c1 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Apr 23 11:06:11 [1295854] testcluster-c1 corosync warning [KNET  ] host: host: 1 has no active links
Apr 23 11:06:11 [1295854] testcluster-c1 corosync notice  [MAIN  ] Corosync Cluster Engine exiting normally
Apr 23 13:18:36 [796246] testcluster-c1 corosync notice  [MAIN  ] Corosync Cluster Engine 3.1.6 starting up

The first line suggests that a manual shutdown of one of the cluster nodes, however neither me nor any of my colleagues did this. The ‘sysadmin’ surely must mean a person logging on to the server and running some command, as opposed to a system process?

Then in the 3rd row from the bottom there is the warning “host: host: 1 has no active links” which is followed by “Corosync Cluster Engine exiting normally”. Does this mean that the reason for the Cluster Engine exiting is the fact that there are no active links?

Finally, I am considering adding a systemd override file for the corosync service with the following content:

[Service]
Restart=on-failure

Is there any reason not to do this? And, given that the process exited normally, would I need to use Restart=always instead?

Many thanks

Alex
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



--
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to