Token timeout -> network issue ?
Just run a continious ping (with timestamp) and log it into a file (from each
host to other host + qdevice ip).
Best Regards,Strahil Nikolov
On Thu, Feb 17, 2022 at 11:38, Sebastien BASTARD<[email protected]>
wrote: Hello CoroSync's team !
We currently have a proxmox cluster with 2 servers (at different providers and
different cities) and another server, in our company, with qdevice.
Schematic : (A) Proxmox Server A (Provider One) ---------------------- (B)
Proxmox Server B (Provider Two)
| |
\----------------------------------------------------------/
|
(C) Qdevice on Debian server (in the company)
Between each server, we have approximately 50 ms of latency.
Between servers A and B, each virtual server is synchronized each 5 minutes, so
if a server stops working, the second server starts the same virtual server.
We don't need High Availability. We can wait 5 minutes without services. After
this delay, the virtual machine must start on another server if the first
server does not work anymore.
With the corosync default configuration, fencing occurs on the servers randomly
(average of 4/5 days), so we modified the configuration with this (bold text is
our modification) :
logging {
debug: off
to_syslog: yes}
nodelist {
node {
name: serverA
nodeid: 1
quorum_votes: 1
ring0_addr: xx.xx.xx.xx
}
node {
name: serverB
nodeid: 3
quorum_votes: 1
ring0_addr: xx.xx.xx.xx
}}
quorum {
device {
model: net
net {
algorithm: ffsplit
host: xx.xx.xx.xx
tls: on
}
votes: 1
}
provider: corosync_votequorum}
totem {
cluster_name: cluster
config_version: 24
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
token_retransmits_before_loss_const: 40
token: 30000
}
With this configuration, the fence of the servers continues but with an average
of 15 days.
Our current problem is that when fencing occurs on a server, the second server
has the same behaviour somes minutes after ... And each time.
I tested the cluster with a cut off power of the server A, and all worked
great. Server B starts the virtual machines of server A.
But in real life, when a server can't talk with another main server, it seems
that the two servers believe that they isoled of other.
So, after a lot of tests, I don't know which is the best way to have a cluster
that works correctly..
Currently, the cluster stops working more than the servers have a real problem.
Maybe my configuration is not good or another ?
So, I need your help =)
Here is the kernel logs of the reboot of the server A ( result the command line
<< cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :
...
Feb 16 09:55:00 serverA corosync[2762]: [KNET ] host: host: 3 (passive) best
link: 0 (pri: 1)
Feb 16 09:55:00 serverA corosync[2762]: [KNET ] host: host: 3 has no active
links
Feb 16 09:55:22 serverA corosync[2762]: [TOTEM ] Token has not been received
in 22500 ms
Feb 16 09:55:30 serverA corosync[2762]: [TOTEM ] A processor failed, forming
new configuration: token timed out (30000ms), waiting 36000ms for consensus.
Feb 16 09:55:38 serverA corosync[2762]: [KNET ] rx: host: 3 link: 0 is up
Feb 16 09:55:38 serverA corosync[2762]: [KNET ] host: host: 3 (passive) best
link: 0 (pri: 1)
Feb 16 09:55:55 serverA watchdog-mux[1890]: client watchdog expired - disable
watchdog updatesReboot....
Here is the kernel logs of the reboot of the server B ( result the command line
<< cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :
Feb 16 09:48:42 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 is down
Feb 16 09:48:42 serverB corosync[2728]: [KNET ] host: host: 1 (passive) best
link: 0 (pri: 1)
Feb 16 09:48:42 serverB corosync[2728]: [KNET ] host: host: 1 has no active
links
Feb 16 09:48:57 serverB corosync[2728]: [KNET ] rx: host: 1 link: 0 is up
Feb 16 09:48:57 serverB corosync[2728]: [KNET ] host: host: 1 (passive) best
link: 0 (pri: 1)
Feb 16 09:53:56 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 is down
Feb 16 09:53:56 serverB corosync[2728]: [KNET ] host: host: 1 (passive) best
link: 0 (pri: 1)
Feb 16 09:53:56 serverB corosync[2728]: [KNET ] host: host: 1 has no active
links
Feb 16 09:54:12 serverB corosync[2728]: [KNET ] rx: host: 1 link: 0 is up
Feb 16 09:54:12 serverB corosync[2728]: [KNET ] host: host: 1 (passive) best
link: 0 (pri: 1)
Feb 16 09:55:22 serverB corosync[2728]: [TOTEM ] Token has not been received
in 22500 ms
Feb 16 09:55:30 serverB corosync[2728]: [TOTEM ] A processor failed, forming
new configuration: token timed out (30000ms), waiting 36000ms for consensus.
Feb 16 09:55:35 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 is down
Feb 16 09:55:35 serverB corosync[2728]: [KNET ] host: host: 1 (passive) best
link: 0 (pri: 1)
Feb 16 09:55:35 serverB corosync[2728]: [KNET ] host: host: 1 has no active
links
Feb 16 09:55:55 serverB watchdog-mux[2280]: client watchdog expired - disable
watchdog updatesReboot
Do you have an idea why when fencing occurs on one server, the other server has
the same behavior ?
Thanks for your help.
Best regards.
Seb._______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/