[ClusterLabs] clear_failcount operation times out, makes it impossible to use the cluster

Krzysztof Bodora Mon, 02 Jan 2023 03:16:41 -0800

Hello Clusterlabs,

I'm getting this error in the logs:

Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: error:print_synapse: [Action 7]: In-flight crm opping_resource_clear_failcount_0 on swdal1-ISCSI01 (priority: 0,waiting: none)


My specfications:

OS: Debian 8
Pacemaker version: 1.1.12
Kernel version: 4.19.190

I'd like to know what can cause this error to happen and how to preventit in the future. I'm also currently unable to update to a newer versionof pacemaker.

Here is some context for when it happens. It seems that theping_resource resources are in 'Restart' state:

Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: notice:LogActions: Restart ping_resource:0 (Started swdal1-ISCSI01)Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: notice:LogActions: Restart ping_resource:1 (Started swdal1-ISCSI02)


which causes pacemaker to try to clear the failcounts on those resources:

Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:do_state_transition: State transition S_POLICY_ENGINE ->S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGEorigin=handle_response ]Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:do_te_invoke: Processing graph 11 (ref=pe_calc-dc-1671528262-59)derived from /var/lib/pacemaker/pengine/pe-input-518.bz2Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:te_crm_command: Executing crm-event (7): clear_failcount onswdal1-ISCSI01Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:handle_failcount_op: Removing failcount for ping_resourceDec 20 09:24:23 [57841] swdal1-ISCSI01 cib: info:cib_process_request: Forwarding cib_delete operation for section//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm_rsc_op[@id='ping_resource_last_failure_0']to master (origin=local/crmd/118)Dec 20 09:24:23 [57841] swdal1-ISCSI01 cib: info:cib_process_request: Completed cib_delete operation for section//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm_rsc_op[@id='ping_resource_last_failure_0']:OK (rc=0, origin=swdal1-ISCSI01/crmd/118, version=0.60.0)Dec 20 09:24:28 [57841] swdal1-ISCSI01 cib: info:cib_process_ping: Reporting our current digest to swdal1-ISCSI01:ccf71244504d3deb02d0da64fa72cedc for 0.60.0 (0x55788a83c4b0 0)Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: warning:action_timer_callback: Timer popped (timeout=20000, abort_level=0,complete=false)Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: error:print_synapse: [Action 7]: In-flight crm opping_resource_clear_failcount_0 on swdal1-ISCSI01 (priority: 0,waiting: none)Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: notice:abort_transition_graph: Transition aborted: Action lost(source=action_timer_callback:772, 0)Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: notice:run_graph: Transition 11 (Complete=1, Pending=0, Fired=0,Skipped=9, Incomplete=2,Source=/var/lib/pacemaker/pengine/pe-input-518.bz2): Stopped

Clearing the failcount fails, so the whole transition is aborted. Thismake it impossible to do anything in the cluster, for example movePool-0 resource, as it also trigger the clear_failcount operation whichfails and aborts the transition, for example:

Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: info:RecurringOp: Start recurring monitor (5s) for ping_resource:0 onswdal1-ISCSI01Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: info:RecurringOp: Start recurring monitor (5s) for ping_resource:1 onswdal1-ISCSI02Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: info:RecurringOp: Start recurring monitor (10s) for Pool-0 onswdal1-ISCSI02Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: notice:LogActions: Restart ping_resource:0 (Started swdal1-ISCSI01)Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: notice:LogActions: Restart ping_resource:1 (Started swdal1-ISCSI02)Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: info:LogActions: Leave Pool-1 (Started swdal1-ISCSI01)Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: notice:LogActions: Move Pool-0 (Started swdal1-ISCSI01 ->swdal1-ISCSI02)Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: notice:process_pe_message: Calculated Transition 19:/var/lib/pacemaker/pengine/pe-input-519.bz2Dec 20 09:35:04 [57862] swdal1-ISCSI01 crmd: info:do_state_transition: State transition S_POLICY_ENGINE ->S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGEorigin=handle_response ]Dec 20 09:35:04 [57862] swdal1-ISCSI01 crmd: info:do_te_invoke: Processing graph 19 (ref=pe_calc-dc-1671528904-75)derived from /var/lib/pacemaker/pengine/pe-input-519.bz2Dec 20 09:35:04 [57862] swdal1-ISCSI01 crmd: info:te_crm_command: Executing crm-event (7): clear_failcount onswdal1-ISCSI01Dec 20 09:35:04 [57862] swdal1-ISCSI01 crmd: info:handle_failcount_op: Removing failcount for ping_resourceDec 20 09:35:04 [57841] swdal1-ISCSI01 cib: info:cib_process_request: Forwarding cib_delete operation for section//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm_rsc_op[@id='ping_resource_last_failure_0']to master (origin=local/crmd/134)Dec 20 09:35:04 [57841] swdal1-ISCSI01 cib: info:cib_process_request: Completed cib_delete operation for section//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm_rsc_op[@id='ping_resource_last_failure_0']:OK (rc=0, origin=swdal1-ISCSI01/crmd/134, version=0.61.0)Dec 20 09:35:09 [57841] swdal1-ISCSI01 cib: info:cib_process_ping: Reporting our current digest to swdal1-ISCSI01:decc3ad1315820648f242167998a5880 for 0.61.0 (0x55788a8408e0 0)Dec 20 09:36:24 [57862] swdal1-ISCSI01 crmd: warning:action_timer_callback: Timer popped (timeout=20000, abort_level=0,complete=false)Dec 20 09:36:24 [57862] swdal1-ISCSI01 crmd: error:print_synapse: [Action 7]: In-flight crm opping_resource_clear_failcount_0 on swdal1-ISCSI01 (priority: 0,waiting: none)Dec 20 09:36:24 [57862] swdal1-ISCSI01 crmd: notice:abort_transition_graph: Transition aborted: Action lost(source=action_timer_callback:772, 0)Dec 20 09:36:24 [57862] swdal1-ISCSI01 crmd: notice:run_graph: Transition 19 (Complete=1, Pending=0, Fired=0,Skipped=12, Incomplete=2,Source=/var/lib/pacemaker/pengine/pe-input-519.bz2): Stopped

As you can see the 'stop' operation for resource Pool-0 did not evenrun, as the transition was stopped by the clear_failcount error. Thiserror kept happening until we restarted pacemaker. Here is some morecontext from one of the times this error has happened:

Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:process_pe_message: Input has not changed since last time, notsaving to diskDec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: notice:unpack_config: On loss of CCM Quorum: IgnoreDec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:determine_online_status: Node swdal1-ISCSI01 is onlineDec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:determine_online_status: Node swdal1-ISCSI02 is onlineDec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:determine_op_status: Operation monitor found resource Pool-0active on swdal1-ISCSI01Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:determine_op_status: Operation monitor found resource Pool-0active on swdal1-ISCSI01Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:determine_op_status: Operation monitor found resource Pool-1active on swdal1-ISCSI01Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:determine_op_status: Operation monitor found resource Pool-1active on swdal1-ISCSI01Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:clone_print: Clone Set: ping_resource-clone [ping_resource]Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:short_print: Started: [ swdal1-ISCSI01 swdal1-ISCSI02 ]Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:native_print: Pool-1 (ocf::oe:zfs): Started swdal1-ISCSI01Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:native_print: Pool-0 (ocf::oe:zfs): Started swdal1-ISCSI01Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:get_failcount_full: ping_resource:0 has failed 8 times onswdal1-ISCSI01Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:common_apply_stickiness: ping_resource-clone can fail 999992 moretimes on swdal1-ISCSI01 before being forced offDec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:get_failcount_full: ping_resource:1 has failed 8 times onswdal1-ISCSI01Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:common_apply_stickiness: ping_resource-clone can fail 999992 moretimes on swdal1-ISCSI01 before being forced offDec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:check_action_definition: params:reload <parameters multiplier="1000"dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.6010.151.16.60" attempts="4" timeout="3"/>Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:check_action_definition: Parameters to ping_resource:0_start_0 onswdal1-ISCSI01 changed: was 57524cd0b7204dd60c127ba66fb83cd2 vs. now1a37c0e0391890df8549f5fda647f4d9 (reload:3.0.9)0:0;14:28:0:a0f1b96e-5089-4dad-9073-8c8feac4ea3aDec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:get_failcount_full: ping_resource:0 has failed 8 times onswdal1-ISCSI01Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:check_action_definition: params:reload <parameters multiplier="1000"dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.6010.151.16.60" attempts="4" timeout="3" CRM_meta_timeout="15000"/>Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:check_action_definition: Parameters to ping_resource:0_monitor_5000 onswdal1-ISCSI01 changed: was f3b4adf4d46692f312296263faa50a75 vs. nowc0d10fc8996c295dd1213d4ca058c0e7 (reload:3.0.9)0:0;15:28:0:a0f1b96e-5089-4dad-9073-8c8feac4ea3aDec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:get_failcount_full: ping_resource:0 has failed 8 times onswdal1-ISCSI01Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:check_action_definition: params:reload <parameters multiplier="1000"dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.6010.151.16.60" attempts="4" timeout="3"/>Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:check_action_definition: Parameters to ping_resource:1_start_0 onswdal1-ISCSI02 changed: was 57524cd0b7204dd60c127ba66fb83cd2 vs. now1a37c0e0391890df8549f5fda647f4d9 (reload:3.0.9)0:0;17:7:0:0ea53274-56ef-48f6-9de1-38d635fa2530Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:check_action_definition: params:reload <parameters multiplier="1000"dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.6010.151.16.60" attempts="4" timeout="3" CRM_meta_timeout="15000"/>Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:check_action_definition: Parameters to ping_resource:1_monitor_5000 onswdal1-ISCSI02 changed: was f3b4adf4d46692f312296263faa50a75 vs. nowc0d10fc8996c295dd1213d4ca058c0e7 (reload:3.0.9)0:0;18:7:0:0ea53274-56ef-48f6-9de1-38d635fa2530Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:RecurringOp: Start recurring monitor (5s) for ping_resource:0 onswdal1-ISCSI01Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:RecurringOp: Start recurring monitor (5s) for ping_resource:1 onswdal1-ISCSI02Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: notice:LogActions: Restart ping_resource:0 (Started swdal1-ISCSI01)Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: notice:LogActions: Restart ping_resource:1 (Started swdal1-ISCSI02)Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:LogActions: Leave Pool-1 (Started swdal1-ISCSI01)Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:LogActions: Leave Pool-0 (Started swdal1-ISCSI01)Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: notice:process_pe_message: Calculated Transition 11:/var/lib/pacemaker/pengine/pe-input-518.bz2Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:do_state_transition: State transition S_POLICY_ENGINE ->S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGEorigin=handle_response ]Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:do_te_invoke: Processing graph 11 (ref=pe_calc-dc-1671528262-59)derived from /var/lib/pacemaker/pengine/pe-input-518.bz2Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:te_crm_command: Executing crm-event (7): clear_failcount onswdal1-ISCSI01Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:handle_failcount_op: Removing failcount for ping_resourceDec 20 09:24:23 [57841] swdal1-ISCSI01 cib: info:cib_process_request: Forwarding cib_delete operation for section//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm_rsc_op[@id='ping_resource_last_failure_0']to master (origin=local/crmd/118)Dec 20 09:24:23 [57841] swdal1-ISCSI01 cib: info:cib_process_request: Completed cib_delete operation for section//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm_rsc_op[@id='ping_resource_last_failure_0']:OK (rc=0, origin=swdal1-ISCSI01/crmd/118, version=0.60.0)Dec 20 09:24:28 [57841] swdal1-ISCSI01 cib: info:cib_process_ping: Reporting our current digest to swdal1-ISCSI01:ccf71244504d3deb02d0da64fa72cedc for 0.60.0 (0x55788a83c4b0 0)Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: warning:action_timer_callback: Timer popped (timeout=20000, abort_level=0,complete=false)Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: error:print_synapse: [Action 7]: In-flight crm opping_resource_clear_failcount_0 on swdal1-ISCSI01 (priority: 0,waiting: none)Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: notice:abort_transition_graph: Transition aborted: Action lost(source=action_timer_callback:772, 0)Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: notice:run_graph: Transition 11 (Complete=1, Pending=0, Fired=0,Skipped=9, Incomplete=2,Source=/var/lib/pacemaker/pengine/pe-input-518.bz2): StoppedDec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: info:do_state_transition: State transition S_TRANSITION_ENGINE ->S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]


I'd appreciate some information about this topic.

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] clear_failcount operation times out, makes it impossible to use the cluster

Reply via email to