Thank you Ulrich for your script ! I launched it, with 10 seconds delay :
- on Server A, to ping Server B - on Server B, to ping server A - on QDevice, to ping server A and Server B I currently can't ping Qdevice from server A and B, because it is behind a firewall which only authorizes port 5403. Tomorrow, I will see the results. Best regards. Le jeu. 17 févr. 2022 à 12:22, Ulrich Windl < [email protected]> a écrit : > Hi! > > It seems your problem is the network. Maybe check the connectivity between > all nodes (and quorum device). > Some time ago I wrote a simple script that can log ups and downs (you'll > ahve to adjust for non-LAN traffic); maybe it helps: > ---- > # Test Host Status (Up, Down) via ping (ICMP Echo) > #$Id: up-down-test.sh,v 1.2 2018/03/07 15:17:32 windl Exp $ > > # Written for SLES 11 SP3 by Ulrich Windl > TESTHOST="${1:-localhost}" > SDELAY="${2:-300}" > IFACE_OPT="${3:+-I$3}" > STATE=0 > WHEN=$(date +%s) > > # add time stamp to message and echo it > log_time() > { > typeset t="$1"; shift > echo "$@ $t ($(date -d@"$t" -u +%F_%T))" > } > > trap 'log_time $(date +%s) "---EXIT"' EXIT > log_time $(date +%s) "---START" > while sleep "$SDELAY" > do > if ping -c3 -i0.33 $IFACE_OPT -n -q "$TESTHOST" >/dev/null; then > _STATE=1 > else > _STATE=0 > fi > if [ $STATE -ne $_STATE ]; then > _WHEN=$(date +%s) > ((DELTA = $_WHEN - $WHEN)) > log_time $_WHEN "$STATE ($DELTA) -> $_STATE" > STATE="$_STATE" > WHEN="$_WHEN" > fi > done > ------ > > The script expects three parameters: host_to_test > delay_between_checks_in_seconds [interface_to_use] > Without parameters it checks localhost every 5 minutes. > > Obviously you cluster cannot have higher avaulability than your network. > First you need to get an impression how reliable your network is. Then, > maybe, tune the cluster parameters. > > Regards, > Ulrich > > >>> Sebastien BASTARD <[email protected]> schrieb am 17.02.2022 um > 10:37 in > Nachricht > <caajzqdwoubrbcea83yogi4s9mpppirexc5t3apfmtebsk+6...@mail.gmail.com>: > > Hello CoroSync's team ! > > > > We currently have a proxmox cluster with 2 servers (at different > providers > > and different cities) and another server, in our company, with qdevice. > > > > Schematic : > > > > (A) Proxmox Server A (Provider One) ---------------------- (B) Proxmox > > Server B (Provider Two) > > | > > > | > > > > \----------------------------------------------------------/ > > | > > (C) Qdevice on Debian server (in the company) > > > > > > Between each server, we have approximately 50 ms of latency. > > > > Between servers A and B, each virtual server is synchronized each 5 > > minutes, so if a server stops working, the second server starts the same > > virtual server. > > > > We don't need High Availability. We can wait 5 minutes without services. > > After this delay, the virtual machine must start on another server if the > > first server does not work anymore. > > > > With the corosync default configuration, fencing occurs on the servers > > randomly (average of 4/5 days), so we modified the configuration with > this > > (bold text is our modification) : > > > > logging { > > debug: off > > to_syslog: yes > > } > > > > nodelist { > > node { > > name: serverA > > nodeid: 1 > > quorum_votes: 1 > > ring0_addr: xx.xx.xx.xx > > } > > node { > > name: serverB > > nodeid: 3 > > quorum_votes: 1 > > ring0_addr: xx.xx.xx.xx > > } > > } > > > > quorum { > > device { > > model: net > > net { > > algorithm: ffsplit > > host: xx.xx.xx.xx > > tls: on > > } > > votes: 1 > > } > > provider: corosync_votequorum > > } > > > > totem { > > cluster_name: cluster > > config_version: 24 > > interface { > > linknumber: 0 > > } > > ip_version: ipv4-6 > > link_mode: passive > > secauth: on > > version: 2 > > *token_retransmits_before_loss_const: 40* > > *token: 30000* > > > > } > > > > > > > > With this configuration, the fence of the servers continues but with an > > average of 15 days. > > > > Our current problem is that when fencing occurs on a server, the second > > server has the same behaviour somes minutes after ... And each time. > > > > I tested the cluster with a cut off power of the server A, and all worked > > great. Server B starts the virtual machines of server A. > > > > But in real life, when a server can't talk with another main server, it > > seems that the two servers believe that they isoled of other. > > > > So, after a lot of tests, I don't know which is the best way to have a > > cluster that works correctly.. > > > > Currently, the cluster stops working more than the servers have a real > > problem. > > > > Maybe my configuration is not good or another ? > > > > So, I need your help =) > > > > *Here is the kernel logs of the reboot of the server A ( result the > command > > line << cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :* > > > > ... > > Feb 16 09:55:00 serverA corosync[2762]: [KNET ] host: host: 3 > (passive) > > best link: 0 (pri: 1) > > Feb 16 09:55:00 serverA corosync[2762]: [KNET ] host: host: 3 has no > > active links > > Feb 16 09:55:22 serverA corosync[2762]: [TOTEM ] Token has not been > > received in 22500 ms > > Feb 16 09:55:30 serverA corosync[2762]: [TOTEM ] A processor failed, > > forming new configuration: token timed out (30000ms), waiting 36000ms for > > consensus. > > Feb 16 09:55:38 serverA corosync[2762]: [KNET ] rx: host: 3 link: 0 > is up > > Feb 16 09:55:38 serverA corosync[2762]: [KNET ] host: host: 3 > (passive) > > best link: 0 (pri: 1) > > Feb 16 09:55:55 serverA watchdog-mux[1890]: client watchdog expired - > > disable watchdog updates > > *Reboot* > > .... > > > > > > *Here is the kernel logs of the reboot of the server B **( result the > > command line << cat /var/log/daemon.log | grep -E 'watchdog|corosync' > >>> ) :* > > > > Feb 16 09:48:42 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 > is > > down > > Feb 16 09:48:42 serverB corosync[2728]: [KNET ] host: host: 1 > (passive) > > best link: 0 (pri: 1) > > Feb 16 09:48:42 serverB corosync[2728]: [KNET ] host: host: 1 has no > > active links > > Feb 16 09:48:57 serverB corosync[2728]: [KNET ] rx: host: 1 link: 0 > is up > > Feb 16 09:48:57 serverB corosync[2728]: [KNET ] host: host: 1 > (passive) > > best link: 0 (pri: 1) > > Feb 16 09:53:56 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 > is > > down > > Feb 16 09:53:56 serverB corosync[2728]: [KNET ] host: host: 1 > (passive) > > best link: 0 (pri: 1) > > Feb 16 09:53:56 serverB corosync[2728]: [KNET ] host: host: 1 has no > > active links > > Feb 16 09:54:12 serverB corosync[2728]: [KNET ] rx: host: 1 link: 0 > is up > > Feb 16 09:54:12 serverB corosync[2728]: [KNET ] host: host: 1 > (passive) > > best link: 0 (pri: 1) > > Feb 16 09:55:22 serverB corosync[2728]: [TOTEM ] Token has not been > > received in 22500 ms > > Feb 16 09:55:30 serverB corosync[2728]: [TOTEM ] A processor failed, > > forming new configuration: token timed out (30000ms), waiting 36000ms for > > consensus. > > Feb 16 09:55:35 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 > is > > down > > Feb 16 09:55:35 serverB corosync[2728]: [KNET ] host: host: 1 > (passive) > > best link: 0 (pri: 1) > > Feb 16 09:55:35 serverB corosync[2728]: [KNET ] host: host: 1 has no > > active links > > Feb 16 09:55:55 serverB watchdog-mux[2280]: client watchdog expired - > > disable watchdog updates > > *Reboot* > > > > > > Do you have an idea why when fencing occurs on one server, the other > server > > has the same behavior ? > > > > Thanks for your help. > > > > Best regards. > > > > Seb. > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > -- Sébastien BASTARD *Ingénieur R&D* | Domalys • Créateurs d’autonomie | phone : +33 5 49 83 00 08 | site : www.domalys.com | email : [email protected] | address : 58 Rue du Vercors 86240 Fontaine-Le-Comte <https://www.domalys.com/> <https://www.facebook.com/domalys/> <https://twitter.com/domalysfr> <https://www.youtube.com/channel/UCRLVU19hjkZ0dv29FaPJacw> <https://www.linkedin.com/company/domalys/?originalSubdomain=fr> <https://youtu.be/77t5rETTwQs> <https://www.ces.tech> <https://www.ces.tech>
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
