Re: [ClusterLabs] Antw: [EXT] clear_failcount operation times out, makes it impossible to use the cluster

Krzysztof Bodora Wed, 04 Jan 2023 00:51:24 -0800

It's an old installation, the error started appearing when one of thenodes was disconnected and the OS was re-installed, after which thepacemaker configuration was restored from a backup (pcs clustercib-push) and the node rejoined the cluster. The failcount itself was at8 for some time before, though. The configuration looks like this:


pcs config


Cluster Name:
Corosync Nodes:
 10.151.50.43 10.151.50.42
Pacemaker Nodes:
 swdal1-ISCSI01 swdal1-ISCSI02

Resources:
 Clone: ping_resource-clone
  Resource: ping_resource (class=ocf provider=pacemaker type=ping)

Attributes: multiplier=1000 dampen=15s attempts=4 timeout=3host_list="10.151.16.60 10.151.16.50 10.151.17.50 10.151.17.60" Operations: start interval=0s timeout=60(ping_resource-start-interval-0s)

               stop interval=0s timeout=20 (ping_resource-stop-interval-0s)

monitor interval=5s timeout=15s(ping_resource-monitor-interval-5s)

 Resource: Pool-1 (class=ocf provider=oe type=zfs)

Attributes: pool_name=Pool-1 pool_id=9200090953161398950encryption_password_hash=None Meta Attrs: failure-timeout=30 is-managed=Trueencryption_password_hash=None

  Operations: start interval=0s timeout=300 (Pool-1-start-interval-0s)
              stop interval=0s timeout=300 (Pool-1-stop-interval-0s)
              monitor interval=10 timeout=60 (Pool-1-monitor-interval-10)
 Resource: Pool-0 (class=ocf provider=oe type=zfs)

Attributes: pool_name=Pool-0 pool_id=4165732781319344895encryption_password_hash=None Meta Attrs: failure-timeout=30 is-managed=Trueencryption_password_hash=None

  Operations: start interval=0s timeout=300 (Pool-0-start-interval-0s)
              stop interval=0s timeout=300 (Pool-0-stop-interval-0s)
              monitor interval=10 timeout=60 (Pool-0-monitor-interval-10)

Stonith Devices:
Fencing Levels:

Location Constraints:
  Resource: Pool-0

Enabled on: swdal1-ISCSI01 (score:1)(id:location-Pool-0-swdal1-ISCSI01-1)

    Constraint: location-Pool-0
      Rule: score=-INFINITY boolean-op=or (id:location-Pool-0-rule)
        Expression: pingd lt 1  (id:location-Pool-0-rule-expr)
        Expression: not_defined pingd (id:location-Pool-0-rule-expr-1)
  Resource: Pool-1

Enabled on: swdal1-ISCSI01 (score:1)(id:location-Pool-1-swdal1-ISCSI01-1)

    Constraint: location-Pool-1
      Rule: score=-INFINITY boolean-op=or (id:location-Pool-1-rule)
        Expression: pingd lt 1  (id:location-Pool-1-rule-expr)
        Expression: not_defined pingd (id:location-Pool-1-rule-expr-1)
Ordering Constraints:
Colocation Constraints:

Resources Defaults:
 resource-stickiness: 100000
Operations Defaults:
 record-pending: true

Cluster Properties:
 batch-limit: 1
 cluster-infrastructure: corosync
 cluster-recheck-interval: 180
 dc-version: 1.1.12-1.1.12+git+561c4cf
 no-quorum-policy: ignore
 stonith-enabled: false
 stop-orphan-resources: false

W dniu 02.01.2023 o 13:12, Ulrich Windl pisze:

Hi!

I wonder: Is this a new installation, or is it a new bug in an old installation?
For the first case I'd recommend to start with current software, and for the 
second case please describe what had changed or what had triggered the 
situation.
Also provide basic configuration data, please.

Regards,
Ulrich

Krzysztof Bodora <krzysztof.bod...@open-e.com> schrieb am 02.01.2023 um 12:16

in Nachricht <37d86b8d-c59f-5fe3-cba1-41d2c84fc...@open-e.com>:

Hello Clusterlabs,

I'm getting this error in the logs:

Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:    error:
print_synapse:     [Action    7]: In-flight crm op
ping_resource_clear_failcount_0   on swdal1-ISCSI01 (priority: 0,
waiting: none)

My specfications:

OS: Debian 8
Pacemaker version: 1.1.12
Kernel version: 4.19.190

I'd like to know what can cause this error to happen and how to prevent
it in the future. I'm also currently unable to update to a newer version
of pacemaker.

Here is some context for when it happens. It seems that the
ping_resource resources are in 'Restart' state:

Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice:
LogActions:        Restart ping_resource:0 (Started swdal1-ISCSI01)
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice:
LogActions:        Restart ping_resource:1 (Started swdal1-ISCSI02)

which causes pacemaker to try to clear the failcounts on those resources:

Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info:
do_state_transition:       State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response ]
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info:
do_te_invoke:      Processing graph 11 (ref=pe_calc-dc-1671528262-59)
derived from /var/lib/pacemaker/pengine/pe-input-518.bz2
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info:
te_crm_command:    Executing crm-event (7): clear_failcount on
swdal1-ISCSI01
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info:
handle_failcount_op:       Removing failcount for ping_resource
Dec 20 09:24:23 [57841] swdal1-ISCSI01        cib:     info:
cib_process_request:       Forwarding cib_delete operation for section
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
_rsc_op[@id='ping_resource_last_failure_0']
to master (origin=local/crmd/118)
Dec 20 09:24:23 [57841] swdal1-ISCSI01        cib:     info:
cib_process_request:       Completed cib_delete operation for section
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
_rsc_op[@id='ping_resource_last_failure_0']:
OK (rc=0, origin=swdal1-ISCSI01/crmd/118, version=0.60.0)
Dec 20 09:24:28 [57841] swdal1-ISCSI01        cib:     info:
cib_process_ping:  Reporting our current digest to swdal1-ISCSI01:
ccf71244504d3deb02d0da64fa72cedc for 0.60.0 (0x55788a83c4b0 0)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:  warning:
action_timer_callback:     Timer popped (timeout=20000, abort_level=0,
complete=false)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:    error:
print_synapse:     [Action    7]: In-flight crm op
ping_resource_clear_failcount_0   on swdal1-ISCSI01 (priority: 0,
waiting: none)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:   notice:
abort_transition_graph:    Transition aborted: Action lost
(source=action_timer_callback:772, 0)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:   notice:
run_graph:         Transition 11 (Complete=1, Pending=0, Fired=0,
Skipped=9, Incomplete=2,
Source=/var/lib/pacemaker/pengine/pe-input-518.bz2): Stopped

Clearing the failcount fails, so the whole transition is aborted. This
make it impossible to do anything in the cluster, for example move
Pool-0 resource, as it also trigger the clear_failcount operation which
fails and aborts the transition, for example:

Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:     info:
RecurringOp:        Start recurring monitor (5s) for ping_resource:0 on
swdal1-ISCSI01
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:     info:
RecurringOp:        Start recurring monitor (5s) for ping_resource:1 on
swdal1-ISCSI02
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:     info:
RecurringOp:        Start recurring monitor (10s) for Pool-0 on
swdal1-ISCSI02
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:   notice:
LogActions:        Restart ping_resource:0 (Started swdal1-ISCSI01)
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:   notice:
LogActions:        Restart ping_resource:1 (Started swdal1-ISCSI02)
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:     info:
LogActions:        Leave   Pool-1  (Started swdal1-ISCSI01)
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:   notice:
LogActions:        Move    Pool-0  (Started swdal1-ISCSI01 ->
swdal1-ISCSI02)
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:   notice:
process_pe_message:        Calculated Transition 19:
/var/lib/pacemaker/pengine/pe-input-519.bz2
Dec 20 09:35:04 [57862] swdal1-ISCSI01       crmd:     info:
do_state_transition:       State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response ]
Dec 20 09:35:04 [57862] swdal1-ISCSI01       crmd:     info:
do_te_invoke:      Processing graph 19 (ref=pe_calc-dc-1671528904-75)
derived from /var/lib/pacemaker/pengine/pe-input-519.bz2
Dec 20 09:35:04 [57862] swdal1-ISCSI01       crmd:     info:
te_crm_command:    Executing crm-event (7): clear_failcount on
swdal1-ISCSI01
Dec 20 09:35:04 [57862] swdal1-ISCSI01       crmd:     info:
handle_failcount_op:       Removing failcount for ping_resource
Dec 20 09:35:04 [57841] swdal1-ISCSI01        cib:     info:
cib_process_request:       Forwarding cib_delete operation for section
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
_rsc_op[@id='ping_resource_last_failure_0']
to master (origin=local/crmd/134)
Dec 20 09:35:04 [57841] swdal1-ISCSI01        cib:    info:
cib_process_request:       Completed cib_delete operation for section
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
_rsc_op[@id='ping_resource_last_failure_0']:
OK (rc=0, origin=swdal1-ISCSI01/crmd/134, version=0.61.0)
Dec 20 09:35:09 [57841] swdal1-ISCSI01        cib:     info:
cib_process_ping:  Reporting our current digest to swdal1-ISCSI01:
decc3ad1315820648f242167998a5880 for 0.61.0 (0x55788a8408e0 0)
Dec 20 09:36:24 [57862] swdal1-ISCSI01       crmd:  warning:
action_timer_callback:     Timer popped (timeout=20000, abort_level=0,
complete=false)
Dec 20 09:36:24 [57862] swdal1-ISCSI01       crmd:    error:
print_synapse:     [Action    7]: In-flight crm op
ping_resource_clear_failcount_0   on swdal1-ISCSI01 (priority: 0,
waiting: none)
Dec 20 09:36:24 [57862] swdal1-ISCSI01       crmd:   notice:
abort_transition_graph:    Transition aborted: Action lost
(source=action_timer_callback:772, 0)
Dec 20 09:36:24 [57862] swdal1-ISCSI01       crmd:   notice:
run_graph:         Transition 19 (Complete=1, Pending=0, Fired=0,
Skipped=12, Incomplete=2,
Source=/var/lib/pacemaker/pengine/pe-input-519.bz2): Stopped

As you can see the 'stop' operation for resource Pool-0 did not even
run, as the transition was stopped by the clear_failcount error. This
error kept happening until we restarted pacemaker. Here is some more
context from one of the times this error has happened:

Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
process_pe_message:        Input has not changed since last time, not
saving to disk
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice:
unpack_config:     On loss of CCM Quorum: Ignore
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
determine_online_status:   Node swdal1-ISCSI01 is online
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
determine_online_status:   Node swdal1-ISCSI02 is online
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
determine_op_status:       Operation monitor found resource Pool-0
active on swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
determine_op_status:       Operation monitor found resource Pool-0
active on swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
determine_op_status:       Operation monitor found resource Pool-1
active on swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
determine_op_status:       Operation monitor found resource Pool-1
active on swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
clone_print:        Clone Set: ping_resource-clone [ping_resource]
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
short_print:            Started: [ swdal1-ISCSI01 swdal1-ISCSI02 ]
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
native_print:      Pool-1  (ocf::oe:zfs):  Started swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
native_print:      Pool-0  (ocf::oe:zfs):  Started swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
get_failcount_full:        ping_resource:0 has failed 8 times on
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
common_apply_stickiness:   ping_resource-clone can fail 999992 more
times on swdal1-ISCSI01 before being forced off
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
get_failcount_full:        ping_resource:1 has failed 8 times on
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
common_apply_stickiness:   ping_resource-clone can fail 999992 more
times on swdal1-ISCSI01 before being forced off
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
check_action_definition:   params:reload   <parameters multiplier="1000"
dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60
10.151.16.60" attempts="4" timeout="3"/>
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
check_action_definition:   Parameters to ping_resource:0_start_0 on
swdal1-ISCSI01 changed: was 57524cd0b7204dd60c127ba66fb83cd2 vs. now
1a37c0e0391890df8549f5fda647f4d9 (reload:3.0.9)
0:0;14:28:0:a0f1b96e-5089-4dad-9073-8c8feac4ea3a
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
get_failcount_full:        ping_resource:0 has failed 8 times on
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
check_action_definition:   params:reload   <parameters multiplier="1000"
dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60
10.151.16.60" attempts="4" timeout="3" CRM_meta_timeout="15000"/>
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
check_action_definition:   Parameters to ping_resource:0_monitor_5000 on
swdal1-ISCSI01 changed: was f3b4adf4d46692f312296263faa50a75 vs. now
c0d10fc8996c295dd1213d4ca058c0e7 (reload:3.0.9)
0:0;15:28:0:a0f1b96e-5089-4dad-9073-8c8feac4ea3a
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
get_failcount_full:        ping_resource:0 has failed 8 times on
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
check_action_definition:   params:reload   <parameters multiplier="1000"
dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60
10.151.16.60" attempts="4" timeout="3"/>
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
check_action_definition:   Parameters to ping_resource:1_start_0 on
swdal1-ISCSI02 changed: was 57524cd0b7204dd60c127ba66fb83cd2 vs. now
1a37c0e0391890df8549f5fda647f4d9 (reload:3.0.9)
0:0;17:7:0:0ea53274-56ef-48f6-9de1-38d635fa2530
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
check_action_definition:   params:reload   <parameters multiplier="1000"
dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60
10.151.16.60" attempts="4" timeout="3" CRM_meta_timeout="15000"/>
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
check_action_definition:   Parameters to ping_resource:1_monitor_5000 on
swdal1-ISCSI02 changed: was f3b4adf4d46692f312296263faa50a75 vs. now
c0d10fc8996c295dd1213d4ca058c0e7 (reload:3.0.9)
0:0;18:7:0:0ea53274-56ef-48f6-9de1-38d635fa2530
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
RecurringOp:        Start recurring monitor (5s) for ping_resource:0 on
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
RecurringOp:        Start recurring monitor (5s) for ping_resource:1 on
swdal1-ISCSI02
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice:
LogActions:        Restart ping_resource:0 (Started swdal1-ISCSI01)
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice:
LogActions:        Restart ping_resource:1 (Started swdal1-ISCSI02)
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
LogActions:        Leave   Pool-1  (Started swdal1-ISCSI01)
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info:
LogActions:        Leave   Pool-0  (Started swdal1-ISCSI01)
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice:
process_pe_message:        Calculated Transition 11:
/var/lib/pacemaker/pengine/pe-input-518.bz2
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info:
do_state_transition:       State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response ]
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info:
do_te_invoke:      Processing graph 11 (ref=pe_calc-dc-1671528262-59)
derived from /var/lib/pacemaker/pengine/pe-input-518.bz2
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info:
te_crm_command:    Executing crm-event (7): clear_failcount on
swdal1-ISCSI01
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info:
handle_failcount_op:       Removing failcount for ping_resource
Dec 20 09:24:23 [57841] swdal1-ISCSI01        cib:     info:
cib_process_request:       Forwarding cib_delete operation for section
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
_rsc_op[@id='ping_resource_last_failure_0']
to master (origin=local/crmd/118)
Dec 20 09:24:23 [57841] swdal1-ISCSI01        cib:     info:
cib_process_request:       Completed cib_delete operation for section
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
_rsc_op[@id='ping_resource_last_failure_0']:
OK (rc=0, origin=swdal1-ISCSI01/crmd/118, version=0.60.0)
Dec 20 09:24:28 [57841] swdal1-ISCSI01        cib:     info:
cib_process_ping:  Reporting our current digest to swdal1-ISCSI01:
ccf71244504d3deb02d0da64fa72cedc for 0.60.0 (0x55788a83c4b0 0)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:  warning:
action_timer_callback:     Timer popped (timeout=20000, abort_level=0,
complete=false)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:    error:
print_synapse:     [Action    7]: In-flight crm op
ping_resource_clear_failcount_0   on swdal1-ISCSI01 (priority: 0,
waiting: none)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:   notice:
abort_transition_graph:    Transition aborted: Action lost
(source=action_timer_callback:772, 0)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:   notice:
run_graph:         Transition 11 (Complete=1, Pending=0, Fired=0,
Skipped=9, Incomplete=2,
Source=/var/lib/pacemaker/pengine/pe-input-518.bz2): Stopped
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:     info:
do_state_transition:       State transition S_TRANSITION_ENGINE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]

I'd appreciate some information about this topic.

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] clear_failcount operation times out, makes it impossible to use the cluster

Reply via email to