Hello Jan. Thank you very much for your help, It was definitely related to the way I was blocking the packets.
Since I'm running these tests with VMs, I just paused node1 and then node2 got the IP after a few seconds. About the issues in the documentation: I'll recreate the scenario from scratch to replicate the steps (since I want to automate them with Ansible), and I'll use this opportunity to check exactly what is missing and report it. Have a great weekend. Regards, Marcelo H. Terres <mhter...@gmail.com> https://www.mundoopensource.com.br https://twitter.com/mhterres https://linkedin.com/in/marceloterres On Fri, 19 Mar 2021 at 08:25, Jan Friesse <jfrie...@redhat.com> wrote: > Marcelo, > > > Hello. > > > > I have configured corosync with 2 nodes and added a qdevice to help with > > the quorum. > > > > On node1 I added firewall rules to block connections from node2 and the > > qdevice, trying to simulate a network issue. > > Just please make sure to block both incoming and also outgoing packets. > Qdevice will handle blocking of just one direction well (because of tcp) > and corosync 3.x with knet too. But corosync 2.x has a big problem with > "asymmetric" blocking. Also config suggest that multicast is used - > please make sure to block also multicast in that case. > > > > > The problem I'm having is that one node1 I can see it dropping the > > service (the IP), but on node2 it never gets the IP, it is like the > qdevice > > is not voting. > > > > This is my corosync.conf: > > > > totem { > > version: 2 > > cluster_name: cluster1 > > token: 3000 > > token_retransmits_before_loss_const: 10 > > clear_node_high_bit: yes > > crypto_cipher: none > > crypto_hash: none > > } > > interface { > > ringnumber: 0 > > bindnetaddr: X.X.X.X > > mcastaddr: 239.255.43.2 > > mcastport: 5405 > > ttl: 1 > > } > > nodelist{ > > node { > > ring0_addr: X.X.X.2 > > name: node1.domain.com > > nodeid: 2 > > } > > node { > > ring0_addr: X.X.X.3 > > name: node2.domain.com > > nodeid: 3 > > } > > } > > > > logging { > > to_logfile: yes > > logfile: /var/log/cluster/corosync.log > > to_syslog: yes > > } > > > > #} > > > > quorum { > > provider: corosync_votequorum > > device { > > votes: 1 > > model: net > > net { > > tls: off > > host: qdevice.domain.com > > algorithm: lms > > } > > heuristics { > > mode: on > > exec_ping: /usr/bin/ping -q -c 1 "qdevice.domain.com" > > } > > } > > } > > > > > > I'm getting this on the qdevice host (before adding the firewall rules), > so > > looks like the cluster is properly configured: > > > > pcs qdevice status net --full > > > Correct. What is the status after blocking is enabled? > > > QNetd address: *:5403 > > TLS: Supported (client certificate required) > > Connected clients: 2 > > Connected clusters: 1 > > Maximum send/receive size: 32768/32768 bytes > > Cluster "cluster1": > > Algorithm: LMS > > Tie-breaker: Node with lowest node ID > > Node ID 3: > > Client address: ::ffff:X.X.X.3:59746 > > HB interval: 8000ms > > Configured node list: 2, 3 > > Ring ID: 2.95d > > Membership node list: 2, 3 > > Heuristics: Pass (membership: Pass, regular: Undefined) > > TLS active: No > > Vote: ACK (ACK) > > Node ID 2: > > Client address: ::ffff:X.X.X.2:33944 > > HB interval: 8000ms > > Configured node list: 2, 3 > > Ring ID: 2.95d > > Membership node list: 2, 3 > > Heuristics: Pass (membership: Pass, regular: Undefined) > > TLS active: No > > Vote: ACK (ACK) > > > > These are partial logs on node2 after activating the firewall rules on > > node1. These logs repeats all the time until I remove the firewall rules: > > > > Mar 18 12:48:56 [7202] node2.domain.com stonith-ng: info: > crm_cs_flush: > > Sent 0 CPG messages (1 remaining, last=16): Try again (6) > > Mar 18 12:48:56 [7201] node2.domain.com cib: info: > crm_cs_flush: > > Sent 0 CPG messages (2 remaining, last=87): Try again (6) > > Mar 18 12:48:56 [7202] node2.domain.com stonith-ng: info: > crm_cs_flush: > > Sent 0 CPG messages (1 remaining, last=16): Try again (6) > > Mar 18 12:48:56 [7185] node2.domain.com pacemakerd: info: > crm_cs_flush: > > Sent 0 CPG messages (1 remaining, last=13): Try again (6) > > [7177] node2.domain.com corosyncinfo [VOTEQ ] waiting for quorum > device > > Qdevice poll (but maximum for 30000 ms) > > [7177] node2.domain.com corosyncnotice [TOTEM ] A new membership > > (X.X.X.3:2469) was formed. Members > > ^^ This is weird. I'm pretty sure something is broken with the way how > packets are blocked (or log is incomplete) > > > [7177] node2.domain.com corosyncwarning [CPG ] downlist left_list: 0 > > received > > [7177] node2.domain.com corosyncwarning [TOTEM ] Discarding JOIN message > > during flush, nodeid=3 > > Mar 18 12:48:56 [7201] node2.domain.com cib: info: > crm_cs_flush: > > Sent 0 CPG messages (2 remaining, last=87): Try again (6) > > Mar 18 12:48:56 [7202] node2.domain.com stonith-ng: info: > crm_cs_flush: > > Sent 0 CPG messages (1 remaining, last=16): Try again (6) > > Mar 18 12:48:56 [7185] node2.domain.com pacemakerd: info: > crm_cs_flush: > > Sent 0 CPG messages (1 remaining, last=13): Try again (6) > > Mar 18 12:48:56 [7201] node2.domain.com cib: info: > crm_cs_flush: > > Sent 0 CPG messages (2 remaining, last=87): Try again (6) > > Mar 18 12:48:56 [7185] node2.domain.com pacemakerd: info: > crm_cs_flush: > > Sent 0 CPG messages (1 remaining, last=13): Try again (6) > > If it repeats over and over again then it's 99.9% because of way packets > are blocked. > > > > > Also on node2: > > > > pcs quorum status > > Error: Unable to get quorum status: Unable to get node address for nodeid > > 2: CS_ERR_NOT_EXIST > > > > And these are the logs on the qdevice host: > > > > Mar 18 12:48:50 debug algo-lms: membership list from node 3 partition > > (3.99d) > > Mar 18 12:48:50 debug algo-util: all_ring_ids_match: seen nodeid 2 > > (client 0x55a99ce070d0) ring_id (2.995) > > Mar 18 12:48:50 debug algo-util: nodeid 2 in our partition has > different > > ring_id (2.995) to us (3.99d) > > Mar 18 12:48:50 debug algo-lms: nodeid 3: ring ID (3.99d) not unique in > > this membership, waiting > > Mar 18 12:48:50 debug Algorithm result vote is Wait for reply > > Mar 18 12:48:52 debug algo-lms: Client 0x55a99cdfe590 (cluster > cluster1, > > node_id 3) Timer callback > > Mar 18 12:48:52 debug algo-util: all_ring_ids_match: seen nodeid 2 > > (client 0x55a99ce070d0) ring_id (2.995) > > Mar 18 12:48:52 debug algo-util: nodeid 2 in our partition has > different > > ring_id (2.995) to us (3.99d) > > Mar 18 12:48:52 debug algo-lms: nodeid 3: ring ID (3.99d) not unique in > > this membership, waiting > > Mar 18 12:48:52 debug Algorithm for client ::ffff:X.X.X.3:59762 decided > > to reschedule timer and not send vote with value Wait for reply > > Mar 18 12:48:53 debug Client closed connection > > Mar 18 12:48:53 debug Client ::ffff:X.X.X.2:33960 (init_received 1, > > cluster cluster1, node_id 2) disconnect > > Mar 18 12:48:53 debug algo-lms: Client 0x55a99ce070d0 (cluster > cluster1, > > node_id 2) disconnect > > Mar 18 12:48:53 info algo-lms: server going down 0 > > Mar 18 12:48:54 debug algo-lms: Client 0x55a99cdfe590 (cluster > cluster1, > > node_id 3) Timer callback > > Mar 18 12:48:54 debug algo-util: partition (3.99d) (0x55a99ce07780) > has 1 > > nodes > > Mar 18 12:48:54 debug algo-lms: Only 1 partition. This is votequorum's > > problem, not ours > > Mar 18 12:48:54 debug Algorithm for client ::ffff:X.X.X.3:59762 decided > > to not reschedule timer and send vote with value ACK > > Mar 18 12:48:54 debug Sending vote info to client ::ffff:X.X.X.3:59762 > > (cluster cluster1, node_id 3) > > Mar 18 12:48:54 debug msg seq num = 1 > > Mar 18 12:48:54 debug vote = ACK > > Mar 18 12:48:54 debug Client ::ffff:X.X.X.3:59762 (cluster cluster1, > > node_id 3) replied back to vote info message > > Mar 18 12:48:54 debug msg seq num = 1 > > Mar 18 12:48:54 debug algo-lms: Client 0x55a99cdfe590 (cluster > cluster1, > > node_id 3) replied back to vote info message > > Mar 18 12:48:54 debug Client ::ffff:X.X.X.3:59762 (cluster cluster1, > > node_id 3) sent membership node list. > > Mar 18 12:48:54 debug msg seq num = 8 > > Mar 18 12:48:54 debug ring id = (3.9a1) > > Mar 18 12:48:54 debug heuristics = Pass > > Mar 18 12:48:54 debug node list: > > Mar 18 12:48:54 debug node_id = 3, data_center_id = 0, node_state = > > not set > > Mar 18 12:48:54 debug > > Mar 18 12:48:54 debug algo-lms: membership list from node 3 partition > > (3.9a1) > > Mar 18 12:48:54 debug algo-util: partition (3.99d) (0x55a99ce073f0) > has 1 > > nodes > > Mar 18 12:48:54 debug algo-lms: Only 1 partition. This is votequorum's > > problem, not ours > > Mar 18 12:48:54 debug Algorithm result vote is ACK > > Mar 18 12:48:58 debug Client ::ffff:X.X.X.3:59762 (cluster cluster1, > > node_id 3) sent membership node list. > > Mar 18 12:48:58 debug msg seq num = 9 > > Mar 18 12:48:58 debug ring id = (3.9a5) > > Mar 18 12:48:58 debug heuristics = Pass > > Mar 18 12:48:58 debug node list: > > Mar 18 12:48:58 debug node_id = 3, data_center_id = 0, node_state = > > not set > > > > > > I'm running it on CentOS7 servers and tried to follow the RH7 official > > docs, but I found a few issues there, and a bug that they won't correct, > > What issues you've found? Could you please report them so doc team can > fix them? > > > since there is a workaround. In the end, looks like it is working fine, > > except for this voting issue. > > > > After lots of time looking for answers on Google, I decided to send a > > message here, and hopefully you can help me to fix it (it is probably a > > silly mistake). > > I would bet it's really way how traffic is blocked. > > Regards, > Honza > > > > > Any help will be appreciated. > > > > Thank you. > > > > Marcelo H. Terres <mhter...@gmail.com> > > https://www.mundoopensource.com.br > > https://twitter.com/mhterres > > https://linkedin.com/in/marceloterres > > > > > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/