Hello guys !

Need your help to try to understand and debug what I'm facing in one of my 
clusters.

I set up fencing with this detail:

# pcs -f stonith_cfg stonith create fence_ao_pg01 fence_vmware_soap ipaddr=<IP> 
ssl_insecure=1 login="<User>" passwd="<Passwd>" pcmk_reboot_action=reboot 
pcmk_host_list="ao-pg01-p.axadmin.net" power_wait=3 op monitor interval=60s
# pcs -f stonith_cfg stonith create fence_ao_pg02 fence_vmware_soap ipaddr=<IP> 
ssl_insecure=1 login="<User>" passwd="<Passwd>" pcmk_reboot_action=reboot 
pcmk_host_list="ao-pg02-p.axadmin.net" power_wait=3 op monitor interval=60s

# pcs -f stonith_cfg constraint location fence_ao_pg01 avoids 
ao-pg01-p.axadmin.net=INFINITY
# pcs -f stonith_cfg constraint location fence_ao_pg02 avoids 
ao-pg02-p.axadmin.net=INFINITY

# pcs cluster cib-push stonith_cfg

The pcs status shows all ok during some time and then it turns to:

[root@ao-pg01-p ~]# pcs status --full
Cluster name: ao_cl_p_01
Stack: corosync
Current DC: ao-pg01-p.axadmin.net (1) (version 1.1.19-8.el7_6.4-c3c624ea3d) - 
partition with quorum
Last updated: Tue May 21 12:18:46 2019
Last change: Fri May 17 18:54:32 2019 by hacluster via crmd on 
ao-pg01-p.axadmin.net

2 nodes configured
3 resources configured

Online: [ ao-pg01-p.axadmin.net (1) ao-pg02-p.axadmin.net (2) ]

Full list of resources:

 ao-cl-p-01-vip01    (ocf::heartbeat:IPaddr2):    Started ao-pg01-p.axadmin.net
 fence_ao_pg01    (stonith:fence_vmware_soap):    Stopped
 fence_ao_pg02    (stonith:fence_vmware_soap):    Stopped

Node Attributes:
* Node ao-pg01-p.axadmin.net (1):
* Node ao-pg02-p.axadmin.net (2):

Migration Summary:
* Node ao-pg02-p.axadmin.net (2):
   fence_ao_pg01: migration-threshold=1000000 fail-count=1000000 
last-failure='Sat May 18 00:22:22 2019'
* Node ao-pg01-p.axadmin.net (1):
   fence_ao_pg02: migration-threshold=1000000 fail-count=1000000 
last-failure='Fri May 17 20:52:53 2019'

Failed Actions:
* fence_ao_pg01_start_0 on ao-pg02-p.axadmin.net 'unknown error' (1): call=22, 
status=Timed Out, exitreason='',
    last-rc-change='Sat May 18 00:19:49 2019', queued=0ms, exec=20022ms
* fence_ao_pg02_start_0 on ao-pg01-p.axadmin.net 'unknown error' (1): call=84, 
status=Timed Out, exitreason='',
    last-rc-change='Fri May 17 20:52:33 2019', queued=0ms, exec=20032ms

PCSD Status:
  ao-pg02-p.axadmin.net: Online
  ao-pg01-p.axadmin.net: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled


>From the output I see there seems to be a 'Timed Out' but I'd like to 
>understand if this is a configuration issue
or something else I'm not aware of.

I'm attaching part of the log that shows the problem related to 17-May.

Regards
Francisco Javier​               Lopez
IT System Engineer       |      Global IT
O: +34 619 728 249<tel:+34%20619%20728%20249>    |      M: +34 619 728 
249<tel:+34%20619%20728%20249>    |
[email protected]<mailto:[email protected]>       
 |      Solera.com<https://www.solera.com/>
Audatex Datos, S.A.      |      Avda. de Bruselas, 36, Salida 16, A‑1 
(Diversia)        ,       Alcobendas      ,       Madrid  ,       28108   ,     
  Spain
[cid:[email protected]]


________________________________

" Este e-mail y sus archivos adjuntos son confidenciales y están dirigidos 
exclusivamente a la(s) persona(s) destinataria prevista. Si ha recibido este 
mensaje por error, por favor, notifique inmediatamente al remitente y elimine 
este mensaje. La empresa no firma contratos por e-mail y todas las 
negociaciones están sujetas a la firma de un contrato por escrito.

This e-mail and any attached files are confidential and intended for the named 
addressee(s) only. If you have received this message in error, please notify 
the sender and delete the email immediately. The company does not conclude 
contracts by email and all negotiations are subject to written contract. "
May 17 20:38:40 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
unpack_node_loop:  Node 1 is already processed
May 17 20:38:40 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
common_print:      ao-cl-p-01-vip01        (ocf::heartbeat:IPaddr2):       
Started ao-pg01-p.axadmin.net
May 17 20:38:40 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
common_print:      fence_ao_pg01   (stonith:fence_vmware_soap):    Started 
ao-pg02-p.axadmin.net
May 17 20:38:40 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
common_print:      fence_ao_pg02   (stonith:fence_vmware_soap):    Started 
ao-pg01-p.axadmin.net
May 17 20:38:40 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
pe_get_failcount:  fence_ao_pg02 has failed 12 times on ao-pg01-p.axadmin.net
May 17 20:38:40 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
check_migration_threshold: fence_ao_pg02 can fail 999988 more times on 
ao-pg01-p.axadmin.net before being forced off
...
...

May 17 20:52:33 [127215] ao-pg01-p.axadmin.net stonith-ng:     info: 
st_child_term:     Child 48496 timed out, sending SIGTERM
May 17 20:52:33 [127215] ao-pg01-p.axadmin.net stonith-ng:   notice: 
stonith_action_async_done: Child process 48496 performing action 'monitor' 
timed out with signal 15
May 17 20:52:33 [127215] ao-pg01-p.axadmin.net stonith-ng:   notice: 
log_operation:     Operation 'monitor' [48496] for device 'fence_ao_pg02' 
returned: -62 (Timer expired)
May 17 20:52:33 [127219] ao-pg01-p.axadmin.net       crmd:    error: 
process_lrm_event: Result of monitor operation for fence_ao_pg02 on 
ao-pg01-p.axadmin.net: Timed Out | call=81 key=fence_ao_pg02_monitor_60000 
timeout=20000ms
May 17 20:52:33 [127214] ao-pg01-p.axadmin.net        cib:     info: 
cib_process_request:       Forwarding cib_modify operation for section status 
to all (origin=local/crmd/210)
May 17 20:52:33 [127214] ao-pg01-p.axadmin.net        cib:     info: 
cib_perform_op:    Diff: --- 0.36.110 2
May 17 20:52:33 [127214] ao-pg01-p.axadmin.net        cib:     info: 
cib_perform_op:    Diff: +++ 0.36.111 (null)
May 17 20:52:33 [127214] ao-pg01-p.axadmin.net        cib:     info: 
cib_perform_op:    +  /cib:  @num_updates=111
May 17 20:52:33 [127214] ao-pg01-p.axadmin.net        cib:     info: 
cib_perform_op:    +  
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='fence_ao_pg02']/lrm_rsc_op[@id='fence_ao_pg02_last_failure_0']:
  @transition-key=3:41:0:5ee48ba5-e614-43dd-890f-3f930f78ce44, 
@transition-magic=2:1;3:41:0:5ee48ba5-e614-43dd-890f-3f930f78ce44, @call-id=81, 
@last-rc-change=1558119133, @exec-time=20042
May 17 20:52:33 [127214] ao-pg01-p.axadmin.net        cib:     info: 
cib_process_request:       Completed cib_modify operation for section status: 
OK (rc=0, origin=ao-pg01-p.axadmin.net/crmd/210, version=0.36.111)
May 17 20:52:33 [127215] ao-pg01-p.axadmin.net stonith-ng:     info: 
update_cib_stonith_devices_v2:     Updating device list from the cib: modify 
lrm_rsc_op[@id='fence_ao_pg02_last_failure_0']
May 17 20:52:33 [127215] ao-pg01-p.axadmin.net stonith-ng:     info: 
cib_devices_update:        Updating devices to version 0.36.111
May 17 20:52:33 [127219] ao-pg01-p.axadmin.net       crmd:     info: 
abort_transition_graph:    Transition aborted by operation 
fence_ao_pg02_monitor_60000 'modify' on ao-pg01-p.axadmin.net: Old event | 
magic=2:1;3:41:0:5ee48ba5-e614-43dd-890f-3f930f78ce44 cib=0.36.111 
source=process_graph_event:499 complete=true
May 17 20:52:33 [127219] ao-pg01-p.axadmin.net       crmd:     info: 
update_failcount:  Updating failcount for fence_ao_pg02 on 
ao-pg01-p.axadmin.net after failed monitor: rc=1 (update=value++, 
time=1558119153)
May 17 20:52:33 [127219] ao-pg01-p.axadmin.net       crmd:     info: 
process_graph_event:       Detected action (41.3) 
fence_ao_pg02_monitor_60000.81=unknown error: failed
...
...
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
determine_online_status_fencing:   Node ao-pg02-p.axadmin.net is active
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
determine_online_status:   Node ao-pg02-p.axadmin.net is online
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
determine_online_status_fencing:   Node ao-pg01-p.axadmin.net is active
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
determine_online_status:   Node ao-pg01-p.axadmin.net is online
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:  warning: 
unpack_rsc_op_failure:     Processing failed monitor of fence_ao_pg02 on 
ao-pg01-p.axadmin.net: unknown error | rc=1
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
unpack_node_loop:  Node 2 is already processed
May 17 20:52:33 [127219] ao-pg01-p.axadmin.net       crmd:     info: 
abort_transition_graph:    Transition aborted by 
status-1-fail-count-fence_ao_pg02.monitor_60000 doing modify 
fail-count-fence_ao_pg02#monitor_60000=13: Transient attribute change | 
cib=0.36.112 source=abort_unless_down:341 
path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-fail-count-fence_ao_pg02.monitor_60000']
 complete=true
...
....
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
common_print:      ao-cl-p-01-vip01        (ocf::heartbeat:IPaddr2):       
Started ao-pg01-p.axadmin.net
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
common_print:      fence_ao_pg01   (stonith:fence_vmware_soap):    Started 
ao-pg02-p.axadmin.net
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
common_print:      fence_ao_pg02   (stonith:fence_vmware_soap):    FAILED 
ao-pg01-p.axadmin.net
May 17 20:52:33 [127219] ao-pg01-p.axadmin.net       crmd:     info: 
abort_transition_graph:    Transition aborted by 
status-1-last-failure-fence_ao_pg02.monitor_60000 doing modify 
last-failure-fence_ao_pg02#monitor_60000=1558119153: Transient attribute change 
| cib=0.36.113 source=abort_unless_down:341 
path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-fence_ao_pg02.monitor_60000']
 complete=true
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
pe_get_failcount:  fence_ao_pg02 has failed 12 times on ao-pg01-p.axadmin.net
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
check_migration_threshold: fence_ao_pg02 can fail 999988 more times on 
ao-pg01-p.axadmin.net before being forced off
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
RecurringOp:        Start recurring monitor (60s) for fence_ao_pg02 on 
ao-pg01-p.axadmin.net
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
LogActions:        Leave   ao-cl-p-01-vip01        (Started 
ao-pg01-p.axadmin.net)
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
LogActions:        Leave   fence_ao_pg01   (Started ao-pg02-p.axadmin.net)
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:   notice: LogAction: 
 * Recover    fence_ao_pg02        ( ao-pg01-p.axadmin.net )
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:   notice: 
process_pe_message:        Calculated transition 43, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-280.bz2
May 17 20:52:33 [127219] ao-pg01-p.axadmin.net       crmd:     info: 
handle_response:   pe_calc calculation pe_calc-dc-1558119153-115 is obsolete
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:   notice: 
unpack_config:     On loss of CCM Quorum: Ignore
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
determine_online_status_fencing:   Node ao-pg02-p.axadmin.net is active
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
determine_online_status:   Node ao-pg02-p.axadmin.net is online
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
determine_online_status_fencing:   Node ao-pg01-p.axadmin.net is active
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
determine_online_status:   Node ao-pg01-p.axadmin.net is online
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:  warning: 
unpack_rsc_op_failure:     Processing failed monitor of fence_ao_pg02 on 
ao-pg01-p.axadmin.net: unknown error | rc=1
...
...
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
common_print:      ao-cl-p-01-vip01        (ocf::heartbeat:IPaddr2):       
Started ao-pg01-p.axadmin.net
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
common_print:      fence_ao_pg01   (stonith:fence_vmware_soap):    Started 
ao-pg02-p.axadmin.net
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
common_print:      fence_ao_pg02   (stonith:fence_vmware_soap):    FAILED 
ao-pg01-p.axadmin.net
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
pe_get_failcount:  fence_ao_pg02 has failed 13 times on ao-pg01-p.axadmin.net
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
check_migration_threshold: fence_ao_pg02 can fail 999987 more times on 
ao-pg01-p.axadmin.net before being forced off
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
RecurringOp:        Start recurring monitor (60s) for fence_ao_pg02 on 
ao-pg01-p.axadmin.net
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
LogActions:        Leave   ao-cl-p-01-vip01        (Started 
ao-pg01-p.axadmin.net)
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:     info: 
LogActions:        Leave   fence_ao_pg01   (Started ao-pg02-p.axadmin.net)
May 17 20:52:33 [127218] ao-pg01-p.axadmin.net    pengine:   notice: LogAction: 
 * Recover    fence_ao_pg02        ( ao-pg01-p.axadmin.net )
...
...
May 17 20:52:33 [127219] ao-pg01-p.axadmin.net       crmd:     info: 
process_lrm_event: Result of monitor operation for fence_ao_pg02 on 
ao-pg01-p.axadmin.net: Cancelled | call=81 key=fence_ao_pg02_monitor_60000 
confirmed=true
May 17 20:52:33 [127214] ao-pg01-p.axadmin.net        cib:     info: 
cib_process_request:       Completed cib_modify operation for section status: 
OK (rc=0, origin=ao-pg01-p.axadmin.net/crmd/214, version=0.36.114)
May 17 20:52:33 [127215] ao-pg01-p.axadmin.net stonith-ng:     info: 
update_cib_stonith_devices_v2:     Updating device list from the cib: modify 
lrm_rsc_op[@id='fence_ao_pg02_last_0']
May 17 20:52:33 [127215] ao-pg01-p.axadmin.net stonith-ng:     info: 
cib_devices_update:        Updating devices to version 0.36.114
May 17 20:52:33 [127219] ao-pg01-p.axadmin.net       crmd:   notice: 
process_lrm_event: Result of stop operation for fence_ao_pg02 on 
ao-pg01-p.axadmin.net: 0 (ok) | call=83 key=fence_ao_pg02_stop_0 confirmed=true 
cib-update=215
May 17 20:52:33 [127215] ao-pg01-p.axadmin.net stonith-ng:   notice: 
unpack_config:     On loss of CCM Quorum: Ignore
May 17 20:52:33 [127214] ao-pg01-p.axadmin.net        cib:     info: 
cib_process_request:       Forwarding cib_modify operation for section status 
to all (origin=local/crmd/215)
May 17 20:52:33 [127215] ao-pg01-p.axadmin.net stonith-ng:     info: 
cib_device_update: Device fence_ao_pg01 has been disabled on 
ao-pg01-p.axadmin.net: score=-INFINITY
...
...

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to