Hello. Please forgive the length of this email but I wanted to provide as much details as possible.
I'm trying to set up a cluster of two nodes for my service. I have a problem with a scenario where the network between two nodes gets broken and they can no longer see each other. This causes split-brain. I know that proper way of implementing this would be to employ STONITH, but it is not feasible for me now (I don't have necessary hardware support and I don't want to introduce another point of failure by introducing shared storage based STONITH). In order to work-around the split-brain scenario I introduced pingd to my cluster, which in theory should do what I expect. pingd pings a network device, so when the NIC is broken on one of my nodes, this node should not run the resources because pingd would fail for it. pingd resource is configured to update the value of variable 'pingd' (interval: 5s, dampen: 3s, multiplier:1000). Based on the value of pingd I have a location constraint which sets score to -INFINITY for resource DimProdClusterIP when 'pingd' is not 1000. All other resources are colocated with DimProdClusterIP, and DimProdClusterIP should start before all other resources. Based on that setup I would expect that when the resources run on dimprod01 and I disconnect dimprod02 from the network, the resources will not start on dimprod02. Unfortunately I see that after a token interval + consensus interval my resources are brought up for a moment and then go down again. This is undesirable, as it causes DRBD split-brain inconsistency and cluster IP may also be taken over by the node which is down. I tried to debug it, but I can't figure out why it doesn't work. I would appreciate any help/pointers. Following are some details of my setup and snippet of pacemaker logs with comments: Setup details: pcs status: Cluster name: dimprodcluster Cluster Summary: * Stack: corosync * Current DC: dimprod02 (version 2.0.5-9.el8_4.1-ba59be7122) - partition with quorum * Last updated: Tue Aug 3 08:20:32 2021 * Last change: Mon Aug 2 18:24:39 2021 by root via cibadmin on dimprod01 * 2 nodes configured * 8 resource instances configured Node List: * Online: [ dimprod01 dimprod02 ] Full List of Resources: * DimProdClusterIP (ocf::heartbeat:IPaddr2): Started dimprod01 * WyrDimProdServer (systemd:wyr-dim): Started dimprod01 * Clone Set: WyrDimProdServerData-clone [WyrDimProdServerData] (promotable): * Masters: [ dimprod01 ] * Slaves: [ dimprod02 ] * WyrDimProdFS (ocf::heartbeat:Filesystem): Started dimprod01 * DimTestClusterIP (ocf::heartbeat:IPaddr2): Started dimprod01 * Clone Set: ping-clone [ping]: * Started: [ dimprod01 dimprod02 ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled pcs constraint Location Constraints: Resource: DimProdClusterIP Constraint: location-DimProdClusterIP Rule: score=-INFINITY Expression: pingd ne 1000 Ordering Constraints: start DimProdClusterIP then promote WyrDimProdServerData-clone (kind:Mandatory) promote WyrDimProdServerData-clone then start WyrDimProdFS (kind:Mandatory) start WyrDimProdFS then start WyrDimProdServer (kind:Mandatory) start WyrDimProdServer then start DimTestClusterIP (kind:Mandatory) Colocation Constraints: WyrDimProdServer with DimProdClusterIP (score:INFINITY) DimTestClusterIP with DimProdClusterIP (score:INFINITY) WyrDimProdServerData-clone with DimProdClusterIP (score:INFINITY) (with-rsc-role:Master) WyrDimProdFS with DimProdClusterIP (score:INFINITY) Ticket Constraints: pcs resource config ping Resource: ping (class=ocf provider=pacemaker type=ping) Attributes: dampen=3s host_list=193.30.22.33 multiplier=1000 Operations: monitor interval=5s timeout=4s (ping-monitor-interval-5s) start interval=0s timeout=60s (ping-start-interval-0s) stop interval=0s timeout=5s (ping-stop-interval-0s) cat /etc/corosync/corosync.conf totem { version: 2 cluster_name: dimprodcluster transport: knet crypto_cipher: aes256 crypto_hash: sha256 token: 10000 interface { knet_ping_interval: 1000 knet_ping_timeout: 1000 } } nodelist { node { ring0_addr: dimprod01 name: dimprod01 nodeid: 1 } node { ring0_addr: dimprod02 name: dimprod02 nodeid: 2 } } quorum { provider: corosync_votequorum two_node: 1 } logging { to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes timestamp: on debug:on } Logs: When the network is connected 'pingd' takes value of 1000: Aug 03 08:23:01 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (attrd_client_update) debug: Broadcasting pingd[dimprod02]=1000 (writer) Aug 03 08:23:01 dimprod02.my.clustertest.com attrd_updater [3369856] (pcmk__node_attr_request) debug: Asked pacemaker-attrd to update pingd=1000 for dimprod02: OK (0) Aug 03 08:23:01 dimprod02.my.clustertest.com attrd_updater [3369856] (crm_xml_cleanup) info: Cleaning up memory from libxml2 Aug 03 08:23:01 dimprod02.my.clustertest.com attrd_updater [3369856] (crm_exit) info: Exiting attrd_updater | with status 0 When the network is down we update 'pingd' to 0: Aug 03 08:23:09 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (attrd_client_update) debug: Broadcasting pingd[dimprod02]=0 (writer) Aug 03 08:23:09 dimprod02.my.clustertest.com attrd_updater [3370053] (pcmk__node_attr_request) debug: Asked pacemaker-attrd to update pingd=0 for dimprod02: OK (0) Aug 03 08:23:09 dimprod02.my.clustertest.com attrd_updater [3370053] (crm_xml_cleanup) info: Cleaning up memory from libxml2 Aug 03 08:23:09 dimprod02.my.clustertest.com attrd_updater [3370053] (crm_exit) info: Exiting attrd_updater | with status 0 And again: Aug 03 08:23:17 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (attrd_client_update) debug: Broadcasting pingd[dimprod02]=0 (writer) Aug 03 08:23:17 dimprod02.my.clustertest.com attrd_updater [3370109] (pcmk__node_attr_request) debug: Asked pacemaker-attrd to update pingd=0 for dimprod02: OK (0) Aug 03 08:23:17 dimprod02.my.clustertest.com attrd_updater [3370109] (crm_xml_cleanup) info: Cleaning up memory from libxml2 Aug 03 08:23:17 dimprod02.my.clustertest.com attrd_updater [3370109] (crm_exit) info: Exiting attrd_updater | with status 0 Then the node realizes it is not connected to the other node: Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (pcmk_cpg_membership) info: Group attrd event 8: dimprod01 (node 1 pid 2118843) left via cluster exit Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based [2827043] (pcmk_cpg_membership) info: Group cib event 8: dimprod01 (node 1 pid 2118840) left via cluster exit Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (crm_update_peer_proc) info: pcmk_cpg_membership: Node dimprod01[1] - corosync-cpg is now offline Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based [2827043] (crm_update_peer_proc) info: pcmk_cpg_membership: Node dimprod01[1] - corosync-cpg is now offline Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (pcmk_cpg_membership) info: Group crmd event 8: dimprod01 (node 1 pid 2118845) left via cluster exit Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced [2827044] (pcmk_cpg_membership) info: Group stonith-ng event 8: dimprod01 (node 1 pid 2118841) left via cluster exit Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (crm_update_peer_proc) info: pcmk_cpg_membership: Node dimprod01[1] - corosync-cpg is now offline Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (crm_update_peer_state_iter) notice: Node dimprod01 state is now lost | nodeid=1 previous=member source=crm_update_peer_proc Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced [2827044] (crm_update_peer_proc) info: pcmk_cpg_membership: Node dimprod01[1] - corosync-cpg is now offline Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (peer_update_callback) info: Node dimprod01 is no longer a peer | DC=true old=0x4000000 new=0x0000000 Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (attrd_peer_remove) notice: Removing all dimprod01 attributes for peer loss Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based [2827043] (crm_update_peer_state_iter) notice: Node dimprod01 state is now lost | nodeid=1 previous=member source=crm_update_peer_proc Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (attrd_peer_remove) debug: Removed #attrd-protocol[dimprod01] for peer loss Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based [2827043] (crm_reap_dead_member) info: Removing node with name dimprod01 and id 1 from membership cache Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (attrd_peer_remove) debug: Removed master-WyrDimProdServerData[dimprod01] for peer loss Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced [2827044] (crm_update_peer_state_iter) notice: Node dimprod01 state is now lost | nodeid=1 previous=member source=crm_update_peer_proc Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based [2827043] (reap_crm_member) notice: Purged 1 peer with id=1 and/or uname=dimprod01 from the membership cache Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (attrd_peer_remove) debug: Removed last-failure-WyrDimProdFS#start_0[dimprod01] for peer loss Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (controld_delete_node_state) info: Deleting transient attributes for node dimprod01 (via CIB call 466) | xpath=//node_state[@uname='dimprod01']/transient_attributes Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based [2827043] (pcmk_cpg_membership) info: Group cib event 8: dimprod02 (node 2 pid 2827043) is member Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (attrd_peer_remove) debug: Removed fail-count-WyrDimProdFS#start_0[dimprod01] for peer loss Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced [2827044] (st_peer_update_callback) debug: Broadcasting our uname because of node 1 Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (attrd_peer_remove) debug: Removed pingd[dimprod01] for peer loss Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (crm_reap_dead_member) info: Removing node with name dimprod01 and id 1 from membership cache Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced [2827044] (crm_reap_dead_member) info: Removing node with name dimprod01 and id 1 from membership cache Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (reap_crm_member) notice: Purged 1 peer with id=1 and/or uname=dimprod01 from the membership cache Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (pcmk_cpg_membership) info: Group attrd event 8: dimprod02 (node 2 pid 2827046) is member Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (match_down_event) debug: No reason to expect node 1 to be down Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced [2827044] (reap_crm_member) notice: Purged 1 peer with id=1 and/or uname=dimprod01 from the membership cache Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced [2827044] (pcmk_cpg_membership) info: Group stonith-ng event 8: dimprod02 (node 2 pid 2827044) is member Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (peer_update_callback) warning: Stonith/shutdown of node dimprod01 was not expected Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (abort_transition_graph) info: Transition 99 aborted: Node failure | source=peer_update_callback:280 complete=true Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based [2827043] (cib_process_request) info: Forwarding cib_delete operation for section //node_state[@uname='dimprod01']/transient_attributes to all (origin=local/crmd/466) Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (pcmk_cpg_membership) info: Group crmd event 8: dimprod02 (node 2 pid 2827048) is member Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (pcmk__set_flags_as) debug: FSA action flags 0x2000000000000 (new_actions) for controller set by s_crmd_fsa:198 Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (s_crmd_fsa) debug: Processing I_PE_CALC: [ state=S_IDLE cause=C_FSA_INTERNAL origin=abort_transition_graph ] Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based [2827043] (cib_process_request) info: Forwarding cib_modify operation for section status to all (origin=local/crmd/467) Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (do_state_transition) notice: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (pcmk__set_flags_as) debug: FSA action flags 0x00000020 (A_INTEGRATE_TIMER_STOP) for controller set by do_state_transition:559 Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (pcmk__set_flags_as) debug: FSA action flags 0x00000080 (A_FINALIZE_TIMER_STOP) for controller set by do_state_transition:565 Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (pcmk__set_flags_as) debug: FSA action flags 0x00000200 (A_DC_TIMER_STOP) for controller set by do_state_transition:569 Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (check_join_counts) debug: Sole active cluster node is fully joined Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (pcmk__clear_flags_as) debug: FSA action flags 0x00000200 (an_action) for controller cleared by do_fsa_action:108 Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (pcmk__clear_flags_as) debug: FSA action flags 0x00000020 (an_action) for controller cleared by do_fsa_action:108 Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (pcmk__clear_flags_as) debug: FSA action flags 0x00000080 (an_action) for controller cleared by do_fsa_action:108 Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (pcmk__clear_flags_as) debug: FSA action flags 0x2000000000000 (an_action) for controller cleared by do_fsa_action:108 Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (do_pe_invoke) debug: Query 468: Requesting the current CIB: S_POLICY_ENGINE Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (pcmk_quorum_notification) info: Quorum retained | membership=1140 members=1 Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd [2827046] (attrd_peer_update) notice: Setting pingd[dimprod02]: 1000 -> 0 | from dimprod02 Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (pcmk_quorum_notification) debug: Member[0] 2 Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (crm_update_peer_state_iter) notice: Node dimprod01 state is now lost | nodeid=1 previous=member source=crm_reap_unseen_nodes Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (peer_update_callback) info: Cluster node dimprod01 is now lost (was member) Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (match_down_event) debug: No reason to expect node 1 to be down Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld [2827048] (peer_update_callback) warning: Stonith/shutdown of node dimprod01 was not expected And then the node decides to allocate the resources to this node and start it, ignoring the location constraint Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (unpack_config) debug: STONITH timeout: 60000 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (unpack_config) debug: STONITH of failed nodes is disabled Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (unpack_config) debug: Concurrent fencing is enabled Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (unpack_config) debug: Stop all active resources: false Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (unpack_config) debug: Cluster is symmetric - resources can run anywhere by default Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (unpack_config) debug: On loss of quorum: Stop ALL resources Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (unpack_config) debug: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (determine_online_status) info: Node dimprod02 is online Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (unpack_find_resource) debug: Internally renamed WyrDimProdServerData on dimprod02 to WyrDimProdServerData:0 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (unpack_find_resource) debug: Internally renamed ping on dimprod02 to ping:0 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: DimProdClusterIP (ocf::heartbeat:IPaddr2): Stopped Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: WyrDimProdServer (systemd:wyr-dim): Stopped Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set: WyrDimProdServerData-clone [WyrDimProdServerData] (promotable): Slaves: [ dimprod02 ] Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set: WyrDimProdServerData-clone [WyrDimProdServerData] (promotable): Stopped: [ dimprod01 ] Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: WyrDimProdFS (ocf::heartbeat:Filesystem): Stopped Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: DimTestClusterIP (ocf::heartbeat:IPaddr2): Stopped Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set: ping-clone [ping]: Started: [ dimprod02 ] Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set: ping-clone [ping]: Stopped: [ dimprod01 ] Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource WyrDimProdServerData:0: preferring current location (node=dimprod02, weight=100) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource ping:0: preferring current location (node=dimprod02, weight=100) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning dimprod02 to DimProdClusterIP Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning dimprod02 to WyrDimProdServer Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (distribute_children) debug: Allocating up to 2 WyrDimProdServerData-clone instances to a possible 1 nodes (at most 1 per host, 2 optimal) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning dimprod02 to WyrDimProdServerData:0 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for resource WyrDimProdServerData:1 are unavailable, unclean or shutting down (dimprod01: 0, -1000000) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not allocate a node for WyrDimProdServerData:1 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource WyrDimProdServerData:1 cannot run anywhere Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (distribute_children) debug: Allocated 1 WyrDimProdServerData-clone instances of a possible 2 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) debug: WyrDimProdServerData:0 promotion score: 1000 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) info: Promoting WyrDimProdServerData:0 (Slave dimprod02) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) debug: WyrDimProdServerData:1 promotion score: 1000 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) info: WyrDimProdServerData-clone: Promoted 1 instances of a possible 1 to master Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning dimprod02 to WyrDimProdFS Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning dimprod02 to DimTestClusterIP Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (distribute_children) debug: Allocating up to 2 ping-clone instances to a possible 1 nodes (at most 1 per host, 2 optimal) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning dimprod02 to ping:0 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for resource ping:1 are unavailable, unclean or shutting down (dimprod01: 0, -1000000) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not allocate a node for ping:1 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource ping:1 cannot run anywhere Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (distribute_children) debug: Allocated 1 ping-clone instances of a possible 2 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (RecurringOp) info: Start recurring monitor (30s) for DimProdClusterIP on dimprod02 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (RecurringOp) info: Start recurring monitor (60s) for WyrDimProdServer on dimprod02 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (RecurringOp) info: Cancelling action WyrDimProdServerData:0_monitor_60000 (Slave vs. Master) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (create_promotable_actions) debug: Creating actions for WyrDimProdServerData-clone Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (RecurringOp) info: Cancelling action WyrDimProdServerData:0_monitor_60000 (Slave vs. Master) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (RecurringOp) info: Start recurring monitor (20s) for WyrDimProdFS on dimprod02 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (RecurringOp) info: Start recurring monitor (30s) for DimTestClusterIP on dimprod02 Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogAction) notice: * Start DimProdClusterIP ( dimprod02 ) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogAction) notice: * Start WyrDimProdServer ( dimprod02 ) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogAction) notice: * Promote WyrDimProdServerData:0 ( Slave -> Master dimprod02 ) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogActions) info: Leave WyrDimProdServerData:1 (Stopped) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogAction) notice: * Start WyrDimProdFS ( dimprod02 ) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogAction) notice: * Start DimTestClusterIP ( dimprod02 ) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogActions) info: Leave ping:0 (Started dimprod02) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogActions) info: Leave ping:1 (Stopped) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (action2xml) debug: Using anonymous clone name WyrDimProdServerData for WyrDimProdServerData:0 (aka. WyrDimProdServerData) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (action2xml) debug: Using anonymous clone name WyrDimProdServerData for WyrDimProdServerData:0 (aka. WyrDimProdServerData) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (action2xml) debug: Using anonymous clone name WyrDimProdServerData for WyrDimProdServerData:0 (aka. WyrDimProdServerData) Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (action2xml) debug: Using anonymous clone name WyrDimProdServerData for WyrDimProdServerData:0 (aka. WyrDimProdServerData) When the resources are started, the node decides that they can not be allocated and closes them: Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (determine_online_status) info: Node dimprod02 is online Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (unpack_find_resource) debug: Internally renamed WyrDimProdServerData on dimprod02 to WyrDimProdServerData:0 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (unpack_find_resource) debug: Internally renamed ping on dimprod02 to ping:0 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: DimProdClusterIP (ocf::heartbeat:IPaddr2): Started dimprod02 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: WyrDimProdServer (systemd:wyr-dim): Started dimprod02 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set: WyrDimProdServerData-clone [WyrDimProdServerData] (promotable): Masters: [ dimprod02 ] Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set: WyrDimProdServerData-clone [WyrDimProdServerData] (promotable): Stopped: [ dimprod01 ] Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: WyrDimProdFS (ocf::heartbeat:Filesystem): Started dimprod02 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: DimTestClusterIP (ocf::heartbeat:IPaddr2): Stopped Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set: ping-clone [ping]: Started: [ dimprod02 ] Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set: ping-clone [ping]: Stopped: [ dimprod01 ] Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource DimProdClusterIP: preferring current location (node=dimprod02, weight=100) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource WyrDimProdServer: preferring current location (node=dimprod02, weight=100) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource WyrDimProdServerData:0: preferring current location (node=dimprod02, weight=100) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource WyrDimProdFS: preferring current location (node=dimprod02, weight=100) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource ping:0: preferring current location (node=dimprod02, weight=100) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__native_merge_weights) info: DimProdClusterIP: Rolling back optional scores from WyrDimProdServerData-clone Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__native_merge_weights) info: DimProdClusterIP: Rolling back optional scores from WyrDimProdFS Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__native_merge_weights) info: DimProdClusterIP: Rolling back optional scores from WyrDimProdServer Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__native_merge_weights) info: DimProdClusterIP: Rolling back optional scores from DimTestClusterIP Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for resource DimProdClusterIP are unavailable, unclean or shutting down (dimprod02: 1, -1000000) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not allocate a node for DimProdClusterIP Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Processing DimProdClusterIP_monitor_30000 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource DimProdClusterIP cannot run anywhere Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for resource WyrDimProdServer are unavailable, unclean or shutting down (dimprod02: 1, -1000000) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not allocate a node for WyrDimProdServer Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource WyrDimProdServer cannot run anywhere Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (distribute_children) debug: Allocating up to 2 WyrDimProdServerData-clone instances to a possible 1 nodes (at most 1 per host, 2 optimal) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning dimprod02 to WyrDimProdServerData:0 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for resource WyrDimProdServerData:1 are unavailable, unclean or shutting down (dimprod01: 0, -1000000) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not allocate a node for WyrDimProdServerData:1 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource WyrDimProdServerData:1 cannot run anywhere Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (distribute_children) debug: Allocated 1 WyrDimProdServerData-clone instances of a possible 2 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (filter_colocation_constraint) error: WyrDimProdServerData:0 must be colocated with DimProdClusterIP but is not (dimprod02 vs. unallocated) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) debug: WyrDimProdServerData:0 promotion score: 10000 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) info: Promoting WyrDimProdServerData:0 (Master dimprod02) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) debug: WyrDimProdServerData:1 promotion score: 10000 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) info: WyrDimProdServerData-clone: Promoted 1 instances of a possible 1 to master Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for resource WyrDimProdFS are unavailable, unclean or shutting down (dimprod01: 0, -1000000) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not allocate a node for WyrDimProdFS Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Processing WyrDimProdFS_monitor_20000 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource WyrDimProdFS cannot run anywhere Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for resource DimTestClusterIP are unavailable, unclean or shutting down (dimprod01: 0, -1000000) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not allocate a node for DimTestClusterIP Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource DimTestClusterIP cannot run anywhere Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (distribute_children) debug: Allocating up to 2 ping-clone instances to a possible 1 nodes (at most 1 per host, 2 optimal) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning dimprod02 to ping:0 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for resource ping:1 are unavailable, unclean or shutting down (dimprod01: 0, -1000000) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not allocate a node for ping:1 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource ping:1 cannot run anywhere Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (distribute_children) debug: Allocated 1 ping-clone instances of a possible 2 Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (create_promotable_actions) debug: Creating actions for WyrDimProdServerData-clone Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogAction) notice: * Stop DimProdClusterIP ( dimprod02 ) due to node availability Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogAction) notice: * Stop WyrDimProdServer ( dimprod02 ) due to node availability Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogActions) info: Leave WyrDimProdServerData:0 (Master dimprod02) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogActions) info: Leave WyrDimProdServerData:1 (Stopped) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogAction) notice: * Stop WyrDimProdFS ( dimprod02 ) due to node availability Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogActions) info: Leave DimTestClusterIP (Stopped) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogActions) info: Leave ping:0 (Started dimprod02) Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047] (LogActions) info: Leave ping:1 (Stopped) So the final result is OK, I would just like to avoid the start of the resources on disconnected node. I have no idea how I could further debug it. I will appreciate any help. If that's helpful I can provide full debug log and more configuration details. Regards, Janusz.
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/