Hello Jan, Thanks for your input. Turns out there was a typo in the configuration file (totem instead of token) ...
It should be fine now. Regards. Since you -----Original Message----- From: Jan Friesse <[email protected]> Sent: mardi 23 juillet 2019 16:08 To: Cluster Labs - All topics related to open-source clustering welcomed <[email protected]>; Jean-Jacques Pons <[email protected]> Subject: Re: [ClusterLabs] 2 nodes split brain with token timeout Jean-Jacques, > Hello everyone, > > I'm having stability issue with a 2 nodes active/passive HA infrastructure > (Zabbix VMs in this case). > Daily backup create a latency, slowing Corosync scheduling and triggering a > token timeout. It frequently ends up on a split brain issue, where service is > started on both nodes at the same time. > > I did increase the token timeout to 4000 by updating corosync.conf, on both > nodes, followed by the command "sudo corosync-cfgtool -R". > But this doesn't reflect in the log message ... Which message you mean? "not scheduled" one? > 1st question : Why ? I'm almost sure it is reflected. > 2nd question : I find reference to increasing > token_retransmits_before_loss_const. Should I ? To which value ? Nope. > > Best regards. > > JJ > > > NODE 2 > Jul 22 13:30:52 FRPLZABPXY02 corosync[11552]: [MAIN ] Corosync main process > was not scheduled for 9902.1504 ms (threshold is 800.0000 ms). Consider token > timeout increase. Machine was not scheduled for 9 second, so 4 second token timeout is not enough. Regards, Honza > Jul 22 13:30:52 FRPLZABPXY02 corosync[11552]: [TOTEM ] A processor failed, > forming new configuration. > Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]: [TOTEM ] A new > membership (10.XX.YY.1:5808) was formed. Members joined: 1 left: 1 Jul > 22 13:31:03 FRPLZABPXY02 corosync[11552]: [TOTEM ] Failed to receive > the leave message. failed: 1 Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]: > [QUORUM] Members[2]: 1 2 Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]: [MAIN > ] Completed service synchronization, ready to provide service. > > > NODE1 > Jul 22 13:30:55 FRPLZABPXY01 corosync[1110]: [TOTEM ] A processor failed, > forming new configuration. > Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]: [TOTEM ] A new > membership (10.XX.YY.1:5804) was formed. Members left: 2 Jul 22 > 13:30:56 FRPLZABPXY01 corosync[1110]: [TOTEM ] Failed to receive the > leave message. failed: 2 Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]: > [QUORUM] Members[1]: 1 Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]: [MAIN ] > Completed service synchronization, ready to provide service. > Jul 22 13:31:03 FRPLZABPXY01 corosync[1110]: [TOTEM ] A new > membership (10.XX.YY.1:5808) was formed. Members joined: 2 Jul 22 > 13:31:03 FRPLZABPXY01 corosync[1110]: [QUORUM] Members[2]: 1 2 Jul 22 > 13:31:03 FRPLZABPXY01 corosync[1110]: [MAIN ] Completed service > synchronization, ready to provide service. > > > cat /etc/corosync/corosync.conf > totem { > version: 2 > secauth: off > cluster_name: FRPLZABPXY > transport: udpu > totem: 4000 > interface { > ringnumber: 0 > bindnetaddr: 10.XX.YY.2 > broadcast: yes > mcastport: 5405 > } > } > nodelist { > node { > ring0_addr: 10.XX.YY.1 > name: FRPLZABPXY01 > nodeid: 1 > } > > node { > ring0_addr: 10.XX.YY.2 > name: FRPLZABPXY02 > nodeid: 2 > } > } > quorum { > provider: corosync_votequorum > two_node: 1 > } > logging { > to_logfile: yes > logfile: /var/log/cluster/corosync.log > to_syslog: yes > } > > > sudo corosync-cmapctl | grep -E "(.config.totem.|^totem.)" > runtime.config.totem.consensus (u32) = 1200 > runtime.config.totem.downcheck (u32) = 1000 > runtime.config.totem.fail_recv_const (u32) = 2500 > runtime.config.totem.heartbeat_failures_allowed (u32) = 0 > runtime.config.totem.hold (u32) = 180 runtime.config.totem.join (u32) > = 50 runtime.config.totem.max_messages (u32) = 17 > runtime.config.totem.max_network_delay (u32) = 50 > runtime.config.totem.merge (u32) = 200 > runtime.config.totem.miss_count_const (u32) = 5 > runtime.config.totem.rrp_autorecovery_check_timeout (u32) = 1000 > runtime.config.totem.rrp_problem_count_mcast_threshold (u32) = 100 > runtime.config.totem.rrp_problem_count_threshold (u32) = 10 > runtime.config.totem.rrp_problem_count_timeout (u32) = 2000 > runtime.config.totem.rrp_token_expired_timeout (u32) = 238 > runtime.config.totem.send_join (u32) = 0 > runtime.config.totem.seqno_unchanged_const (u32) = 30 > runtime.config.totem.token (u32) = 1000 > runtime.config.totem.token_retransmit (u32) = 238 > runtime.config.totem.token_retransmits_before_loss_const (u32) = 4 > runtime.config.totem.window_size (u32) = 50 totem.cluster_name (str) = > FRPLZABPXY totem.interface.0.bindnetaddr (str) = 10.XX.YY.2 > totem.interface.0.broadcast (str) = yes totem.interface.0.mcastport > (u16) = 5405 totem.secauth (str) = off totem.totem (str) = 4000 > totem.transport (str) = udpu totem.version (u32) = 2 > > > > _______________________________________________ > Manage your subscription: > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.clusterlabs > .org_mailman_listinfo_users&d=DwICAw&c=aj9Ha5yMZahcf_BRtDCWCQ&r=5R2Ocp > 35xnaT44LYYtyPb7QQ4yrV00pN0EuDvf5qP5M&m=EtV_TD7xre5ALwKmPr-d33yj9zuBFH > dZnBUaKn7w4yY&s=kj0N_9SYd2M-wq7rWHNPqrcTh8p_VdSrvPa05AxyaqY&e= > > ClusterLabs home: > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.clusterlabs.o > rg_&d=DwICAw&c=aj9Ha5yMZahcf_BRtDCWCQ&r=5R2Ocp35xnaT44LYYtyPb7QQ4yrV00pN0EuDvf5qP5M&m=EtV_TD7xre5ALwKmPr-d33yj9zuBFHdZnBUaKn7w4yY&s=KFVKKRHOJUdTTyPtYwTI-1xPPCFurXGqTtVVsAreY7M&e= > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
