Re: [ClusterLabs] 2 nodes split brain with token timeout

Jan Friesse Tue, 23 Jul 2019 07:08:59 -0700

Jean-Jacques,

Hello everyone,


I'm having stability issue with a 2 nodes active/passive HA infrastructure 
(Zabbix VMs in this case).
Daily backup create a latency, slowing Corosync scheduling and triggering a 
token timeout. It frequently ends up on a split brain issue, where service is 
started on both nodes at the same time.

I did increase the token timeout to 4000 by updating corosync.conf, on both nodes, 
followed by the command "sudo corosync-cfgtool -R".
But this doesn't reflect in the log message ...


Which message you mean? "not scheduled" one?

1st question : Why ?


I'm almost sure it is reflected.

2nd question : I find reference to increasing 
token_retransmits_before_loss_const. Should I ? To which value ?


Nope.


Best regards.

JJ


NODE 2
Jul 22 13:30:52 FRPLZABPXY02 corosync[11552]:  [MAIN  ] Corosync main process 
was not scheduled for 9902.1504 ms (threshold is 800.0000 ms). Consider token 
timeout increase.

Machine was not scheduled for 9 second, so 4 second token timeout is notenough.


Regards,
  Honza

Jul 22 13:30:52 FRPLZABPXY02 corosync[11552]:  [TOTEM ] A processor failed, 
forming new configuration.
Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]:  [TOTEM ] A new membership 
(10.XX.YY.1:5808) was formed. Members joined: 1 left: 1
Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]:  [TOTEM ] Failed to receive the 
leave message. failed: 1
Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]:  [QUORUM] Members[2]: 1 2
Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]:  [MAIN  ] Completed service 
synchronization, ready to provide service.


NODE1
Jul 22 13:30:55 FRPLZABPXY01 corosync[1110]:  [TOTEM ] A processor failed, 
forming new configuration.
Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]:  [TOTEM ] A new membership 
(10.XX.YY.1:5804) was formed. Members left: 2
Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]:  [TOTEM ] Failed to receive the 
leave message. failed: 2
Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]:  [QUORUM] Members[1]: 1
Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]:  [MAIN  ] Completed service 
synchronization, ready to provide service.
Jul 22 13:31:03 FRPLZABPXY01 corosync[1110]:  [TOTEM ] A new membership 
(10.XX.YY.1:5808) was formed. Members joined: 2
Jul 22 13:31:03 FRPLZABPXY01 corosync[1110]:  [QUORUM] Members[2]: 1 2
Jul 22 13:31:03 FRPLZABPXY01 corosync[1110]:  [MAIN  ] Completed service 
synchronization, ready to provide service.


cat /etc/corosync/corosync.conf
totem {
     version: 2
     secauth: off
     cluster_name: FRPLZABPXY
     transport: udpu
     totem: 4000
     interface {
         ringnumber: 0
         bindnetaddr: 10.XX.YY.2
         broadcast: yes
         mcastport: 5405
     }
}
nodelist {
     node {
         ring0_addr: 10.XX.YY.1
         name: FRPLZABPXY01
         nodeid: 1
     }

     node {
         ring0_addr: 10.XX.YY.2
         name: FRPLZABPXY02
         nodeid: 2
     }
}
quorum {
     provider: corosync_votequorum
     two_node: 1
}
logging {
     to_logfile: yes
     logfile: /var/log/cluster/corosync.log
     to_syslog: yes
}


sudo corosync-cmapctl | grep -E "(.config.totem.|^totem.)"
runtime.config.totem.consensus (u32) = 1200
runtime.config.totem.downcheck (u32) = 1000
runtime.config.totem.fail_recv_const (u32) = 2500
runtime.config.totem.heartbeat_failures_allowed (u32) = 0
runtime.config.totem.hold (u32) = 180
runtime.config.totem.join (u32) = 50
runtime.config.totem.max_messages (u32) = 17
runtime.config.totem.max_network_delay (u32) = 50
runtime.config.totem.merge (u32) = 200
runtime.config.totem.miss_count_const (u32) = 5
runtime.config.totem.rrp_autorecovery_check_timeout (u32) = 1000
runtime.config.totem.rrp_problem_count_mcast_threshold (u32) = 100
runtime.config.totem.rrp_problem_count_threshold (u32) = 10
runtime.config.totem.rrp_problem_count_timeout (u32) = 2000
runtime.config.totem.rrp_token_expired_timeout (u32) = 238
runtime.config.totem.send_join (u32) = 0
runtime.config.totem.seqno_unchanged_const (u32) = 30
runtime.config.totem.token (u32) = 1000
runtime.config.totem.token_retransmit (u32) = 238
runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
runtime.config.totem.window_size (u32) = 50
totem.cluster_name (str) = FRPLZABPXY
totem.interface.0.bindnetaddr (str) = 10.XX.YY.2
totem.interface.0.broadcast (str) = yes
totem.interface.0.mcastport (u16) = 5405
totem.secauth (str) = off
totem.totem (str) = 4000
totem.transport (str) = udpu
totem.version (u32) = 2



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] 2 nodes split brain with token timeout

Reply via email to