Re: [ClusterLabs] The 2 servers of the cluster randomly reboot almost together

Strahil Nikolov via Users Thu, 17 Feb 2022 14:03:14 -0800

Token timeout -> network issue ?
Just run a continious ping (with timestamp) and log it into a file (from each 
host to other host + qdevice ip).
Best Regards,Strahil Nikolov
 
 
  On Thu, Feb 17, 2022 at 11:38, Sebastien BASTARD<[email protected]> 
wrote:   Hello CoroSync's team !
We currently have a proxmox cluster with 2 servers (at different providers and 
different cities) and another server, in our company, with qdevice.


   Schematic :   (A) Proxmox Server A (Provider One) ---------------------- (B) 
Proxmox Server B (Provider Two)
                 |                                                          |
                 \----------------------------------------------------------/
                                               |
                  (C) Qdevice on Debian server (in the company)  
Between each server, we have approximately 50 ms of latency.

Between servers A and B, each virtual server is synchronized each 5 minutes, so 
if a server stops working, the second server starts the same virtual server.
We don't need High Availability. We can wait 5 minutes without services. After 
this delay, the virtual machine must start on another server if the first 
server does not work anymore.
With the corosync default configuration, fencing occurs on the servers randomly 
(average of 4/5 days), so we modified the configuration with this (bold text is 
our modification) :


logging {
  debug: off
  to_syslog: yes}
nodelist {
  node {
    name: serverA
    nodeid: 1
    quorum_votes: 1
    ring0_addr: xx.xx.xx.xx
  }
  node {
    name: serverB
    nodeid: 3
    quorum_votes: 1
    ring0_addr: xx.xx.xx.xx
  }}
quorum {
  device {
    model: net
    net {
      algorithm: ffsplit
      host: xx.xx.xx.xx
      tls: on
    }
    votes: 1
  }
  provider: corosync_votequorum}
totem {
  cluster_name: cluster
  config_version: 24
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
  token_retransmits_before_loss_const: 40
  token: 30000







}


With this configuration, the fence of the servers continues but with an average 
of 15 days.
Our current problem is that when fencing occurs on a server, the second server 
has the same behaviour somes minutes after ... And each time.
I tested the cluster with a cut off power of the server A, and all worked 
great. Server B starts the virtual machines of server A.
But in real life, when a server can't talk with another main server, it seems 
that the two servers believe that they isoled of other.
So, after a lot of tests, I don't know which is the best way to have a cluster 
that works correctly..

Currently, the cluster stops working more than the servers have a real problem.
Maybe my configuration is not good or another ? 
So, I need your help =)
Here is the kernel logs of the reboot of the server A ( result the command line 
<< cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :
...
Feb 16 09:55:00 serverA corosync[2762]:   [KNET  ] host: host: 3 (passive) best 
link: 0 (pri: 1)
Feb 16 09:55:00 serverA corosync[2762]:   [KNET  ] host: host: 3 has no active 
links
Feb 16 09:55:22 serverA corosync[2762]:   [TOTEM ] Token has not been received 
in 22500 ms 
Feb 16 09:55:30 serverA corosync[2762]:   [TOTEM ] A processor failed, forming 
new configuration: token timed out (30000ms), waiting 36000ms for consensus.
Feb 16 09:55:38 serverA corosync[2762]:   [KNET  ] rx: host: 3 link: 0 is up
Feb 16 09:55:38 serverA corosync[2762]:   [KNET  ] host: host: 3 (passive) best 
link: 0 (pri: 1)
Feb 16 09:55:55 serverA watchdog-mux[1890]: client watchdog expired - disable 
watchdog updatesReboot....


Here is the kernel logs of the reboot of the server B ( result the command line 
<< cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :

Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is down
Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive) best 
link: 0 (pri: 1)
Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] host: host: 1 has no active 
links
Feb 16 09:48:57 serverB corosync[2728]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 16 09:48:57 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive) best 
link: 0 (pri: 1)
Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is down
Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive) best 
link: 0 (pri: 1)
Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] host: host: 1 has no active 
links
Feb 16 09:54:12 serverB corosync[2728]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 16 09:54:12 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive) best 
link: 0 (pri: 1)
Feb 16 09:55:22 serverB corosync[2728]:   [TOTEM ] Token has not been received 
in 22500 ms 
Feb 16 09:55:30 serverB corosync[2728]:   [TOTEM ] A processor failed, forming 
new configuration: token timed out (30000ms), waiting 36000ms for consensus.
Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is down
Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive) best 
link: 0 (pri: 1)
Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] host: host: 1 has no active 
links
Feb 16 09:55:55 serverB watchdog-mux[2280]: client watchdog expired - disable 
watchdog updatesReboot


Do you have an idea why when fencing occurs on one server, the other server has 
the same behavior ? 
Thanks for your help.

Best regards.  

Seb._______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] The 2 servers of the cluster randomly reboot almost together

Reply via email to