It's an old installation, the error started appearing when one of the
nodes was disconnected and the OS was re-installed, after which the
pacemaker configuration was restored from a backup (pcs cluster
cib-push) and the node rejoined the cluster. The failcount itself was at
8 for some time before, though. The configuration looks like this:
pcs config
Cluster Name:
Corosync Nodes:
10.151.50.43 10.151.50.42
Pacemaker Nodes:
swdal1-ISCSI01 swdal1-ISCSI02
Resources:
Clone: ping_resource-clone
Resource: ping_resource (class=ocf provider=pacemaker type=ping)
Attributes: multiplier=1000 dampen=15s attempts=4 timeout=3
host_list="10.151.16.60 10.151.16.50 10.151.17.50 10.151.17.60"
Operations: start interval=0s timeout=60
(ping_resource-start-interval-0s)
stop interval=0s timeout=20 (ping_resource-stop-interval-0s)
monitor interval=5s timeout=15s
(ping_resource-monitor-interval-5s)
Resource: Pool-1 (class=ocf provider=oe type=zfs)
Attributes: pool_name=Pool-1 pool_id=9200090953161398950
encryption_password_hash=None
Meta Attrs: failure-timeout=30 is-managed=True
encryption_password_hash=None
Operations: start interval=0s timeout=300 (Pool-1-start-interval-0s)
stop interval=0s timeout=300 (Pool-1-stop-interval-0s)
monitor interval=10 timeout=60 (Pool-1-monitor-interval-10)
Resource: Pool-0 (class=ocf provider=oe type=zfs)
Attributes: pool_name=Pool-0 pool_id=4165732781319344895
encryption_password_hash=None
Meta Attrs: failure-timeout=30 is-managed=True
encryption_password_hash=None
Operations: start interval=0s timeout=300 (Pool-0-start-interval-0s)
stop interval=0s timeout=300 (Pool-0-stop-interval-0s)
monitor interval=10 timeout=60 (Pool-0-monitor-interval-10)
Stonith Devices:
Fencing Levels:
Location Constraints:
Resource: Pool-0
Enabled on: swdal1-ISCSI01 (score:1)
(id:location-Pool-0-swdal1-ISCSI01-1)
Constraint: location-Pool-0
Rule: score=-INFINITY boolean-op=or (id:location-Pool-0-rule)
Expression: pingd lt 1 (id:location-Pool-0-rule-expr)
Expression: not_defined pingd (id:location-Pool-0-rule-expr-1)
Resource: Pool-1
Enabled on: swdal1-ISCSI01 (score:1)
(id:location-Pool-1-swdal1-ISCSI01-1)
Constraint: location-Pool-1
Rule: score=-INFINITY boolean-op=or (id:location-Pool-1-rule)
Expression: pingd lt 1 (id:location-Pool-1-rule-expr)
Expression: not_defined pingd (id:location-Pool-1-rule-expr-1)
Ordering Constraints:
Colocation Constraints:
Resources Defaults:
resource-stickiness: 100000
Operations Defaults:
record-pending: true
Cluster Properties:
batch-limit: 1
cluster-infrastructure: corosync
cluster-recheck-interval: 180
dc-version: 1.1.12-1.1.12+git+561c4cf
no-quorum-policy: ignore
stonith-enabled: false
stop-orphan-resources: false
W dniu 02.01.2023 o 13:12, Ulrich Windl pisze:
Hi!
I wonder: Is this a new installation, or is it a new bug in an old installation?
For the first case I'd recommend to start with current software, and for the
second case please describe what had changed or what had triggered the
situation.
Also provide basic configuration data, please.
Regards,
Ulrich
Krzysztof Bodora <krzysztof.bod...@open-e.com> schrieb am 02.01.2023 um 12:16
in Nachricht <37d86b8d-c59f-5fe3-cba1-41d2c84fc...@open-e.com>:
Hello Clusterlabs,
I'm getting this error in the logs:
Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: error:
print_synapse: [Action 7]: In-flight crm op
ping_resource_clear_failcount_0 on swdal1-ISCSI01 (priority: 0,
waiting: none)
My specfications:
OS: Debian 8
Pacemaker version: 1.1.12
Kernel version: 4.19.190
I'd like to know what can cause this error to happen and how to prevent
it in the future. I'm also currently unable to update to a newer version
of pacemaker.
Here is some context for when it happens. It seems that the
ping_resource resources are in 'Restart' state:
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: notice:
LogActions: Restart ping_resource:0 (Started swdal1-ISCSI01)
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: notice:
LogActions: Restart ping_resource:1 (Started swdal1-ISCSI02)
which causes pacemaker to try to clear the failcounts on those resources:
Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:
do_state_transition: State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response ]
Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:
do_te_invoke: Processing graph 11 (ref=pe_calc-dc-1671528262-59)
derived from /var/lib/pacemaker/pengine/pe-input-518.bz2
Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:
te_crm_command: Executing crm-event (7): clear_failcount on
swdal1-ISCSI01
Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:
handle_failcount_op: Removing failcount for ping_resource
Dec 20 09:24:23 [57841] swdal1-ISCSI01 cib: info:
cib_process_request: Forwarding cib_delete operation for section
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
_rsc_op[@id='ping_resource_last_failure_0']
to master (origin=local/crmd/118)
Dec 20 09:24:23 [57841] swdal1-ISCSI01 cib: info:
cib_process_request: Completed cib_delete operation for section
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
_rsc_op[@id='ping_resource_last_failure_0']:
OK (rc=0, origin=swdal1-ISCSI01/crmd/118, version=0.60.0)
Dec 20 09:24:28 [57841] swdal1-ISCSI01 cib: info:
cib_process_ping: Reporting our current digest to swdal1-ISCSI01:
ccf71244504d3deb02d0da64fa72cedc for 0.60.0 (0x55788a83c4b0 0)
Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: warning:
action_timer_callback: Timer popped (timeout=20000, abort_level=0,
complete=false)
Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: error:
print_synapse: [Action 7]: In-flight crm op
ping_resource_clear_failcount_0 on swdal1-ISCSI01 (priority: 0,
waiting: none)
Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: notice:
abort_transition_graph: Transition aborted: Action lost
(source=action_timer_callback:772, 0)
Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: notice:
run_graph: Transition 11 (Complete=1, Pending=0, Fired=0,
Skipped=9, Incomplete=2,
Source=/var/lib/pacemaker/pengine/pe-input-518.bz2): Stopped
Clearing the failcount fails, so the whole transition is aborted. This
make it impossible to do anything in the cluster, for example move
Pool-0 resource, as it also trigger the clear_failcount operation which
fails and aborts the transition, for example:
Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: info:
RecurringOp: Start recurring monitor (5s) for ping_resource:0 on
swdal1-ISCSI01
Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: info:
RecurringOp: Start recurring monitor (5s) for ping_resource:1 on
swdal1-ISCSI02
Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: info:
RecurringOp: Start recurring monitor (10s) for Pool-0 on
swdal1-ISCSI02
Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: notice:
LogActions: Restart ping_resource:0 (Started swdal1-ISCSI01)
Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: notice:
LogActions: Restart ping_resource:1 (Started swdal1-ISCSI02)
Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: info:
LogActions: Leave Pool-1 (Started swdal1-ISCSI01)
Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: notice:
LogActions: Move Pool-0 (Started swdal1-ISCSI01 ->
swdal1-ISCSI02)
Dec 20 09:35:04 [57851] swdal1-ISCSI01 pengine: notice:
process_pe_message: Calculated Transition 19:
/var/lib/pacemaker/pengine/pe-input-519.bz2
Dec 20 09:35:04 [57862] swdal1-ISCSI01 crmd: info:
do_state_transition: State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response ]
Dec 20 09:35:04 [57862] swdal1-ISCSI01 crmd: info:
do_te_invoke: Processing graph 19 (ref=pe_calc-dc-1671528904-75)
derived from /var/lib/pacemaker/pengine/pe-input-519.bz2
Dec 20 09:35:04 [57862] swdal1-ISCSI01 crmd: info:
te_crm_command: Executing crm-event (7): clear_failcount on
swdal1-ISCSI01
Dec 20 09:35:04 [57862] swdal1-ISCSI01 crmd: info:
handle_failcount_op: Removing failcount for ping_resource
Dec 20 09:35:04 [57841] swdal1-ISCSI01 cib: info:
cib_process_request: Forwarding cib_delete operation for section
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
_rsc_op[@id='ping_resource_last_failure_0']
to master (origin=local/crmd/134)
Dec 20 09:35:04 [57841] swdal1-ISCSI01 cib: info:
cib_process_request: Completed cib_delete operation for section
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
_rsc_op[@id='ping_resource_last_failure_0']:
OK (rc=0, origin=swdal1-ISCSI01/crmd/134, version=0.61.0)
Dec 20 09:35:09 [57841] swdal1-ISCSI01 cib: info:
cib_process_ping: Reporting our current digest to swdal1-ISCSI01:
decc3ad1315820648f242167998a5880 for 0.61.0 (0x55788a8408e0 0)
Dec 20 09:36:24 [57862] swdal1-ISCSI01 crmd: warning:
action_timer_callback: Timer popped (timeout=20000, abort_level=0,
complete=false)
Dec 20 09:36:24 [57862] swdal1-ISCSI01 crmd: error:
print_synapse: [Action 7]: In-flight crm op
ping_resource_clear_failcount_0 on swdal1-ISCSI01 (priority: 0,
waiting: none)
Dec 20 09:36:24 [57862] swdal1-ISCSI01 crmd: notice:
abort_transition_graph: Transition aborted: Action lost
(source=action_timer_callback:772, 0)
Dec 20 09:36:24 [57862] swdal1-ISCSI01 crmd: notice:
run_graph: Transition 19 (Complete=1, Pending=0, Fired=0,
Skipped=12, Incomplete=2,
Source=/var/lib/pacemaker/pengine/pe-input-519.bz2): Stopped
As you can see the 'stop' operation for resource Pool-0 did not even
run, as the transition was stopped by the clear_failcount error. This
error kept happening until we restarted pacemaker. Here is some more
context from one of the times this error has happened:
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
process_pe_message: Input has not changed since last time, not
saving to disk
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: notice:
unpack_config: On loss of CCM Quorum: Ignore
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
determine_online_status: Node swdal1-ISCSI01 is online
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
determine_online_status: Node swdal1-ISCSI02 is online
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
determine_op_status: Operation monitor found resource Pool-0
active on swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
determine_op_status: Operation monitor found resource Pool-0
active on swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
determine_op_status: Operation monitor found resource Pool-1
active on swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
determine_op_status: Operation monitor found resource Pool-1
active on swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
clone_print: Clone Set: ping_resource-clone [ping_resource]
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
short_print: Started: [ swdal1-ISCSI01 swdal1-ISCSI02 ]
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
native_print: Pool-1 (ocf::oe:zfs): Started swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
native_print: Pool-0 (ocf::oe:zfs): Started swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
get_failcount_full: ping_resource:0 has failed 8 times on
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
common_apply_stickiness: ping_resource-clone can fail 999992 more
times on swdal1-ISCSI01 before being forced off
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
get_failcount_full: ping_resource:1 has failed 8 times on
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
common_apply_stickiness: ping_resource-clone can fail 999992 more
times on swdal1-ISCSI01 before being forced off
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
check_action_definition: params:reload <parameters multiplier="1000"
dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60
10.151.16.60" attempts="4" timeout="3"/>
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
check_action_definition: Parameters to ping_resource:0_start_0 on
swdal1-ISCSI01 changed: was 57524cd0b7204dd60c127ba66fb83cd2 vs. now
1a37c0e0391890df8549f5fda647f4d9 (reload:3.0.9)
0:0;14:28:0:a0f1b96e-5089-4dad-9073-8c8feac4ea3a
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
get_failcount_full: ping_resource:0 has failed 8 times on
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
check_action_definition: params:reload <parameters multiplier="1000"
dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60
10.151.16.60" attempts="4" timeout="3" CRM_meta_timeout="15000"/>
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
check_action_definition: Parameters to ping_resource:0_monitor_5000 on
swdal1-ISCSI01 changed: was f3b4adf4d46692f312296263faa50a75 vs. now
c0d10fc8996c295dd1213d4ca058c0e7 (reload:3.0.9)
0:0;15:28:0:a0f1b96e-5089-4dad-9073-8c8feac4ea3a
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
get_failcount_full: ping_resource:0 has failed 8 times on
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
check_action_definition: params:reload <parameters multiplier="1000"
dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60
10.151.16.60" attempts="4" timeout="3"/>
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
check_action_definition: Parameters to ping_resource:1_start_0 on
swdal1-ISCSI02 changed: was 57524cd0b7204dd60c127ba66fb83cd2 vs. now
1a37c0e0391890df8549f5fda647f4d9 (reload:3.0.9)
0:0;17:7:0:0ea53274-56ef-48f6-9de1-38d635fa2530
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
check_action_definition: params:reload <parameters multiplier="1000"
dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60
10.151.16.60" attempts="4" timeout="3" CRM_meta_timeout="15000"/>
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
check_action_definition: Parameters to ping_resource:1_monitor_5000 on
swdal1-ISCSI02 changed: was f3b4adf4d46692f312296263faa50a75 vs. now
c0d10fc8996c295dd1213d4ca058c0e7 (reload:3.0.9)
0:0;18:7:0:0ea53274-56ef-48f6-9de1-38d635fa2530
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
RecurringOp: Start recurring monitor (5s) for ping_resource:0 on
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
RecurringOp: Start recurring monitor (5s) for ping_resource:1 on
swdal1-ISCSI02
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: notice:
LogActions: Restart ping_resource:0 (Started swdal1-ISCSI01)
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: notice:
LogActions: Restart ping_resource:1 (Started swdal1-ISCSI02)
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
LogActions: Leave Pool-1 (Started swdal1-ISCSI01)
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: info:
LogActions: Leave Pool-0 (Started swdal1-ISCSI01)
Dec 20 09:24:23 [57851] swdal1-ISCSI01 pengine: notice:
process_pe_message: Calculated Transition 11:
/var/lib/pacemaker/pengine/pe-input-518.bz2
Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:
do_state_transition: State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response ]
Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:
do_te_invoke: Processing graph 11 (ref=pe_calc-dc-1671528262-59)
derived from /var/lib/pacemaker/pengine/pe-input-518.bz2
Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:
te_crm_command: Executing crm-event (7): clear_failcount on
swdal1-ISCSI01
Dec 20 09:24:23 [57862] swdal1-ISCSI01 crmd: info:
handle_failcount_op: Removing failcount for ping_resource
Dec 20 09:24:23 [57841] swdal1-ISCSI01 cib: info:
cib_process_request: Forwarding cib_delete operation for section
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
_rsc_op[@id='ping_resource_last_failure_0']
to master (origin=local/crmd/118)
Dec 20 09:24:23 [57841] swdal1-ISCSI01 cib: info:
cib_process_request: Completed cib_delete operation for section
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
_rsc_op[@id='ping_resource_last_failure_0']:
OK (rc=0, origin=swdal1-ISCSI01/crmd/118, version=0.60.0)
Dec 20 09:24:28 [57841] swdal1-ISCSI01 cib: info:
cib_process_ping: Reporting our current digest to swdal1-ISCSI01:
ccf71244504d3deb02d0da64fa72cedc for 0.60.0 (0x55788a83c4b0 0)
Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: warning:
action_timer_callback: Timer popped (timeout=20000, abort_level=0,
complete=false)
Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: error:
print_synapse: [Action 7]: In-flight crm op
ping_resource_clear_failcount_0 on swdal1-ISCSI01 (priority: 0,
waiting: none)
Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: notice:
abort_transition_graph: Transition aborted: Action lost
(source=action_timer_callback:772, 0)
Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: notice:
run_graph: Transition 11 (Complete=1, Pending=0, Fired=0,
Skipped=9, Incomplete=2,
Source=/var/lib/pacemaker/pengine/pe-input-518.bz2): Stopped
Dec 20 09:25:43 [57862] swdal1-ISCSI01 crmd: info:
do_state_transition: State transition S_TRANSITION_ENGINE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
I'd appreciate some information about this topic.
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/