Hi all,
I have a two node cluster with the following details:
- Ubuntu 10.04.4 LTS (I know its old…)
- corosync 1.2.0
- pacemaker 1.0.8+hg15494-2ubuntu2
Following configuration is applied to corosync:
totem {
version: 2
token: 3000
token_retransmits_before_loss_const: 10
join: 60
consensus: 5000
vsftype: none
max_messages: 20
clear_node_high_bit: yes
secauth: off
threads: 0
rrp_mode: none
cluster_name: firewall-ha
interface {
ringnumber: 0
bindnetaddr: 192.168.211.1
broadcast : yes
mcastport: 5405
ttl : 1
}
transport: udpu
}
nodelist {
node {
ring0_addr: 192.168.211.1
name: net1
nodeid: 1
}
node {
ring0_addr: 192.168.211.2
name: net2
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
amf {
mode: disabled
}
service {
ver: 0
name: pacemaker
}
aisexec {
user: root
group: root
}
logging {
fileline: off
to_stderr: yes
to_logfile: yes
to_syslog: yes
logfile: /var/log/corosync/corosync.log
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
tags: enter|leave|trace1|trace2|trace3|trace4|trace6
}
}
Here is an output of crm status after starting coro sync on both nodes:
============
Last updated: Mon Jan 16 21:24:18 2017
Stack: openais
Current DC: net1 - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
0 Resources configured.
============
Online: [ net1 net2 ]
Now if I kill net2 with:
killall -9 corosync
The primary host don’t « see » anything, the cluster still appear to be online
on net1:
============
Last updated: Mon Jan 16 21:25:25 2017
Stack: openais
Current DC: net1 - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
0 Resources configured.
============
Online: [ net1 net2 ]
I just see this part in the logs:
Jan 16 21:35:21 corosync [TOTEM ] A processor failed, forming new configuration.
And then, when I start corosync on net2, cluster stays offline:
============
Last updated: Mon Jan 16 21:38:13 2017
Stack: openais
Current DC: NONE
2 Nodes configured, 2 expected votes
0 Resources configured.
============
OFFLINE: [ net1 net2 ]
I have to kill corosync on both nodes, and start on both node together to get
back online.
When the two nodes are up, I can see trafic with tcpdump:
21:41:49.653780 IP 192.168.211.1.5404 > 255.255.255.255.5405: UDP, length 82
21:41:49.678846 IP 192.168.211.1.5404 > 192.168.211.2.5405: UDP, length 70
21:41:49.680339 IP 192.168.211.2.5404 > 192.168.211.1.5405: UDP, length 70
21:41:49.889424 IP 192.168.211.1.5404 > 255.255.255.255.5405: UDP, length 82
21:41:49.910492 IP 192.168.211.1.5404 > 192.168.211.2.5405: UDP, length 70
21:41:49.911990 IP 192.168.211.2.5404 > 192.168.211.1.5405: UDP, length 70
Here is the output of the state of the ring on net1:
corosync-cfgtool -s
Printing ring status.
Local node ID 30648512
RING ID 0
id = 192.168.211.1
status = ring 0 active with no faults
And net2:
Printing ring status.
Local node ID 47425728
RING ID 0
id = 192.168.211.2
status = ring 0 active with no faults
Here is the log on net1 when I start the cluster on both nodes:
Jan 16 21:41:52 net1 crmd: [15288]: info: crm_timer_popped: Election Trigger
(I_DC_TIMEOUT) just popped!
Jan 16 21:41:52 net1 crmd: [15288]: WARN: do_log: FSA: Input I_DC_TIMEOUT from
crm_timer_popped() received in state S_PENDING
Jan 16 21:41:52 net1 crmd: [15288]: info: do_state_transition: State transition
S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED
origin=crm_timer_popped ]
Jan 16 21:41:52 net1 crmd: [15288]: info: do_state_transition: State transition
S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL
origin=do_election_check ]
Jan 16 21:41:52 net1 crmd: [15288]: info: do_te_control: Registering TE UUID:
53d7e000-3468-4548-b9f9-5bdb9ac9bfc7
Jan 16 21:41:52 net1 crmd: [15288]: WARN: cib_client_add_notify_callback:
Callback already present
Jan 16 21:41:52 net1 crmd: [15288]: info: set_graph_functions: Setting custom
graph functions
Jan 16 21:41:52 net1 crmd: [15288]: info: unpack_graph: Unpacked transition -1:
0 actions in 0 synapses
Jan 16 21:41:52 net1 crmd: [15288]: info: do_dc_takeover: Taking over DC status
for this partition
Jan 16 21:41:52 net1 cib: [15284]: info: cib_process_readwrite: We are now in
R/W mode
Jan 16 21:41:52 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_master for section 'all' (origin=local/crmd/6,
version=0.51.0): ok (rc=0)
Jan 16 21:41:52 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_modify for section cib (origin=local/crmd/7, version=0.51.0):
ok (rc=0)
Jan 16 21:41:53 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_modify for section crm_config (origin=local/crmd/9,
version=0.51.0): ok (rc=0)
Jan 16 21:41:53 net1 crmd: [15288]: info: join_make_offer: Making join offers
based on membership 36
Jan 16 21:41:53 net1 crmd: [15288]: info: do_dc_join_offer_all: join-1: Waiting
on 2 outstanding join acks
Jan 16 21:41:53 net1 crmd: [15288]: info: ais_dispatch: Membership 36: quorum
retained
Jan 16 21:41:53 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_modify for section crm_config (origin=local/crmd/11,
version=0.51.0): ok (rc=0)
Jan 16 21:41:53 net1 crmd: [15288]: info: crm_ais_dispatch: Setting expected
votes to 2
Jan 16 21:41:53 net1 crmd: [15288]: info: config_query_callback: Checking for
expired actions every 900000ms
Jan 16 21:41:53 net1 crmd: [15288]: info: config_query_callback: Sending
expected-votes=2 to corosync
Jan 16 21:41:53 net1 crmd: [15288]: info: update_dc: Set DC to net1 (3.0.1)
Jan 16 21:41:53 net1 crmd: [15288]: info: ais_dispatch: Membership 36: quorum
retained
Jan 16 21:41:53 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_modify for section crm_config (origin=local/crmd/14,
version=0.51.0): ok (rc=0)
Jan 16 21:41:53 net1 crmd: [15288]: info: crm_ais_dispatch: Setting expected
votes to 2
Jan 16 21:41:53 net1 crmd: [15288]: info: te_connect_stonith: Attempting
connection to fencing daemon...
Jan 16 21:41:53 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_modify for section crm_config (origin=local/crmd/17,
version=0.51.0): ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: te_connect_stonith: Connected
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition
S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL
origin=check_join_state ]
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: All 2 cluster
nodes responded to the join offer.
Jan 16 21:41:54 net1 crmd: [15288]: info: do_dc_join_finalize: join-1: Syncing
the CIB from net1 to the rest of the cluster
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_sync for section 'all' (origin=local/crmd/18, version=0.51.0):
ok (rc=0)
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_modify for section nodes (origin=local/crmd/19,
version=0.51.0): ok (rc=0)
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_modify for section nodes (origin=local/crmd/20,
version=0.51.0): ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: update_attrd: Connecting to attrd...
Jan 16 21:41:54 net1 attrd: [15286]: info: find_hash_entry: Creating hash entry
for terminate
Jan 16 21:41:54 net1 attrd: [15286]: info: find_hash_entry: Creating hash entry
for shutdown
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_delete for section
//node_state[@uname='net1']/transient_attributes (origin=local/crmd/21,
version=0.51.0): ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: erase_xpath_callback: Deletion of
"//node_state[@uname='net1']/transient_attributes": ok (rc=0)
Jan 16 21:41:54 net1 attrd: [15286]: info: crm_new_peer: Node net2 now has id:
47425728
Jan 16 21:41:54 net1 crmd: [15288]: info: do_dc_join_ack: join-1: Updating node
state to member for net2
Jan 16 21:41:54 net1 attrd: [15286]: info: crm_new_peer: Node 47425728 is now
known as net2
Jan 16 21:41:54 net1 crmd: [15288]: info: do_dc_join_ack: join-1: Updating node
state to member for net1
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_delete for section
//node_state[@uname='net2']/transient_attributes (origin=net2/crmd/7,
version=0.51.0): ok (rc=0)
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_delete for section //node_state[@uname='net2']/lrm
(origin=local/crmd/22, version=0.51.0): ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: erase_xpath_callback: Deletion of
"//node_state[@uname='net2']/lrm": ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition
S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL
origin=check_join_state ]
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: All 2 cluster
nodes are eligible to run resources.
Jan 16 21:41:54 net1 crmd: [15288]: info: do_dc_join_final: Ensuring DC, quorum
and node attributes are up-to-date
Jan 16 21:41:54 net1 crmd: [15288]: info: crm_update_quorum: Updating quorum
status to true (call=28)
Jan 16 21:41:54 net1 crmd: [15288]: info: abort_transition_graph:
do_te_invoke:191 - Triggered transition abort (complete=1) : Peer Cancelled
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke: Query 29: Requesting
the current CIB: S_POLICY_ENGINE
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_delete for section //node_state[@uname='net1']/lrm
(origin=local/crmd/24, version=0.51.1): ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: erase_xpath_callback: Deletion of
"//node_state[@uname='net1']/lrm": ok (rc=0)
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_modify for section nodes (origin=local/crmd/26,
version=0.51.2): ok (rc=0)
Jan 16 21:41:54 net1 cib: [15284]: info: log_data_element: cib:diff: - <cib
admin_epoch="0" epoch="51" num_updates="2" />
Jan 16 21:41:54 net1 cib: [15284]: info: log_data_element: cib:diff: + <cib
dc-uuid="net1" admin_epoch="0" epoch="52" num_updates="1" />
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation
complete: op cib_modify for section cib (origin=local/crmd/28, version=0.52.1):
ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: abort_transition_graph: need_abort:59
- Triggered transition abort (complete=1) : Non-status change
Jan 16 21:41:54 net1 crmd: [15288]: info: need_abort: Aborting on change to
admin_epoch
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke: Query 30: Requesting
the current CIB: S_POLICY_ENGINE
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke_callback: Invoking the
PE: query=30, ref=pe_calc-dc-1484599314-11, seq=36, quorate=1
Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_local_callback: Sending full
refresh (origin=crmd)
Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_trigger_update: Sending flush
op to all hosts for: shutdown (<null>)
Jan 16 21:41:54 net1 pengine: [15287]: notice: unpack_config: On loss of CCM
Quorum: Ignore
Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_trigger_update: Sending flush
op to all hosts for: terminate (<null>)
Jan 16 21:41:54 net1 pengine: [15287]: info: unpack_config: Node scores: 'red'
= -INFINITY, 'yellow' = 0, 'green' = 0
Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net2
is online
Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net1
is online
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition
S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response ]
Jan 16 21:41:54 net1 crmd: [15288]: info: unpack_graph: Unpacked transition 0:
2 actions in 2 synapses
Jan 16 21:41:54 net1 crmd: [15288]: info: do_te_invoke: Processing graph 0
(ref=pe_calc-dc-1484599314-11) derived from /var/lib/pengine/pe-input-713.bz2
Jan 16 21:41:54 net1 crmd: [15288]: info: te_rsc_command: Initiating action 2:
probe_complete probe_complete on net1 (local) - no waiting
Jan 16 21:41:54 net1 crmd: [15288]: info: te_rsc_command: Initiating action 3:
probe_complete probe_complete on net2 - no waiting
Jan 16 21:41:54 net1 crmd: [15288]: info: run_graph:
====================================================
Jan 16 21:41:54 net1 crmd: [15288]: notice: run_graph: Transition 0
(Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-713.bz2): Complete
Jan 16 21:41:54 net1 crmd: [15288]: info: te_graph_trigger: Transition 0 is now
complete
Jan 16 21:41:54 net1 crmd: [15288]: info: notify_crmd: Transition 0 status:
done - <null>
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
origin=notify_crmd ]
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: Starting PEngine
Recheck Timer
Jan 16 21:41:54 net1 attrd: [15286]: info: find_hash_entry: Creating hash entry
for probe_complete
Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_trigger_update: Sending flush
op to all hosts for: probe_complete (true)
Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_perform_update: Sent update
10: probe_complete=true
Jan 16 21:41:54 net1 crmd: [15288]: info: abort_transition_graph:
te_update_diff:146 - Triggered transition abort (complete=1,
tag=transient_attributes, id=net1, magic=NA, cib=0.52.2) : Transient attribute:
update
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition
S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: All 2 cluster
nodes are eligible to run resources.
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke: Query 31: Requesting
the current CIB: S_POLICY_ENGINE
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke_callback: Invoking the
PE: query=31, ref=pe_calc-dc-1484599314-14, seq=36, quorate=1
Jan 16 21:41:54 net1 crmd: [15288]: info: abort_transition_graph:
te_update_diff:146 - Triggered transition abort (complete=1,
tag=transient_attributes, id=net2, magic=NA, cib=0.52.3) : Transient attribute:
update
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke: Query 32: Requesting
the current CIB: S_POLICY_ENGINE
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke_callback: Invoking the
PE: query=32, ref=pe_calc-dc-1484599314-15, seq=36, quorate=1
Jan 16 21:41:54 net1 pengine: [15287]: info: process_pe_message: Transition 0:
PEngine Input stored in: /var/lib/pengine/pe-input-713.bz2
Jan 16 21:41:54 net1 pengine: [15287]: notice: unpack_config: On loss of CCM
Quorum: Ignore
Jan 16 21:41:54 net1 pengine: [15287]: info: unpack_config: Node scores: 'red'
= -INFINITY, 'yellow' = 0, 'green' = 0
Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net2
is online
Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net1
is online
Jan 16 21:41:54 net1 crmd: [15288]: info: handle_response: pe_calc calculation
pe_calc-dc-1484599314-14 is obsolete
Jan 16 21:41:54 net1 cib: [15311]: info: write_cib_contents: Archived previous
version as /var/lib/heartbeat/crm/cib-52.raw
Jan 16 21:41:54 net1 pengine: [15287]: info: process_pe_message: Transition 1:
PEngine Input stored in: /var/lib/pengine/pe-input-714.bz2
Jan 16 21:41:54 net1 pengine: [15287]: notice: unpack_config: On loss of CCM
Quorum: Ignore
Jan 16 21:41:54 net1 pengine: [15287]: info: unpack_config: Node scores: 'red'
= -INFINITY, 'yellow' = 0, 'green' = 0
Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net2
is online
Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net1
is online
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition
S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response ]
Jan 16 21:41:54 net1 crmd: [15288]: info: unpack_graph: Unpacked transition 2:
0 actions in 0 synapses
Jan 16 21:41:54 net1 crmd: [15288]: info: do_te_invoke: Processing graph 2
(ref=pe_calc-dc-1484599314-15) derived from /var/lib/pengine/pe-input-715.bz2
Jan 16 21:41:54 net1 crmd: [15288]: info: run_graph:
====================================================
Jan 16 21:41:54 net1 crmd: [15288]: notice: run_graph: Transition 2
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-715.bz2): Complete
Jan 16 21:41:54 net1 crmd: [15288]: info: te_graph_trigger: Transition 2 is now
complete
Jan 16 21:41:54 net1 crmd: [15288]: info: notify_crmd: Transition 2 status:
done - <null>
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
origin=notify_crmd ]
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: Starting PEngine
Recheck Timer
Jan 16 21:41:54 net1 cib: [15311]: info: write_cib_contents: Wrote version
0.52.0 of the CIB to disk (digest: 44f7626e0420b36260b8c67e9e576a7e)
Jan 16 21:41:54 net1 pengine: [15287]: info: process_pe_message: Transition 2:
PEngine Input stored in: /var/lib/pengine/pe-input-715.bz2
Jan 16 21:41:54 net1 cib: [15311]: info: retrieveCib: Reading cluster
configuration from: /var/lib/heartbeat/crm/cib.ZpwuYr (digest:
/var/lib/heartbeat/crm/cib.zVG2jb)
Jan 16 21:50:50 net1 cib: [15284]: info: cib_stats: Processed 48 operations
(2708.00us average, 0% utilization) in the last 10min
Any help is appreciated !
Thanks
_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org