On Fri, May 20, 2022 at 7:43 AM Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote: > > >>> Jan Friesse <jfrie...@redhat.com> schrieb am 19.05.2022 um 14:55 in > Nachricht > <1abb8468-6619-329f-cb01-3f51112db...@redhat.com>: > > Hi, > > > > On 19/05/2022 10:16, Leditzky, Fabian via Users wrote: > >> Hello > >> > >> We have been dealing with our pacemaker/corosync clusters becoming > unstable. > >> The OS is Debian 10 and we use Debian packages for pacemaker and corosync, > >> version 3.0.1‑5+deb10u1 and 3.0.1‑2+deb10u1 respectively. > > > > Seems like pcmk version is not so important for behavior you've > > described. Corosync 3.0.1 is super old, are you able to reproduce the > > I'm running corosync-2.4.5-12.7.1.x86_64 (SLES15 SP3) here ;-) > > Are you mixing "super old" with "super buggy"?
Actually 3.0.1 is older than 2.4.5 and on top 2.4.5 is the head of a mature branch while 3.0.1 is the beginning of a new branch that brought substantial changes. Klaus > > Regards, > Ulrich > > > behavior with 3.1.6? What is the version of knet? There were quite a few > > fixes so last one (1.23) is really recommended. > > > > You can try to compile yourself, or use proxmox repo > > (http://download.proxmox.com/debian/pve/) which contains newer version > > of packages. > > > >> We use knet over UDP transport. > >> > >> We run multiple 2‑node and 4‑8 node clusters, primarily managing VIP > > resources. > >> The issue we experience presents itself as a spontaneous disagreement of > >> the status of cluster members. In two node clusters, each node > spontaneously > >> sees the other node as offline, despite network connectivity being OK. > >> In larger clusters, the status can be inconsistent across the nodes. > >> E.g.: node1 sees 2,4 as offline, node 2 sees 1,4 as offline while node 3 > and > > 4 see every node as online. > > > > This really shouldn't happen. > > > >> The cluster becomes generally unresponsive to resource actions in this > > state. > > > > Expected > > > >> Thus far we have been unable to restore cluster health without restarting > > corosync. > >> > >> We are running packet captures 24/7 on the clusters and have custom > tooling > >> to detect lost UDP packets on knet ports. So far we could not see > > significant > >> packet loss trigger an event, at most we have seen a single UDP packet > > dropped > >> some seconds before the cluster fails. > >> > >> However, even if the root cause is indeed a flaky network, we do not > > understand > >> why the cluster cannot recover on its own in any way. The issues definitely > > > persist > >> beyond the presence of any intermittent network problem. > > > > Try newer version. If problem persist, it's good idea to monitor if > > packets are really passed thru. Corosync always (at least) creates > > single node membership. > > > > Regards, > > Honza > > > >> > >> We were able to artificially break clusters by inducing packet loss with an > > > iptables rule. > >> Dropping packets on a single node of an 8‑node cluster can cause > malfunctions > > on > >> multiple other cluster nodes. The expected behavior would be detecting that > > > the > >> artificially broken node failed but keeping the rest of the cluster > stable. > >> We were able to reproduce this also on Debian 11 with more recent > > corosync/pacemaker > >> versions. > >> > >> Our configuration basic, we do not significantly deviate from the > defaults. > >> > >> We will be very grateful for any insights into this problem. > >> > >> Thanks, > >> Fabian > >> > >> // corosync.conf > >> totem { > >> version: 2 > >> cluster_name: cluster01 > >> crypto_cipher: aes256 > >> crypto_hash: sha512 > >> transport: knet > >> } > >> logging { > >> fileline: off > >> to_stderr: no > >> to_logfile: no > >> to_syslog: yes > >> debug: off > >> timestamp: on > >> logger_subsys { > >> subsys: QUORUM > >> debug: off > >> } > >> } > >> quorum { > >> provider: corosync_votequorum > >> two_node: 1 > >> expected_votes: 2 > >> } > >> nodelist { > >> node { > >> name: node01 > >> nodeid: 01 > >> ring0_addr: 10.0.0.10 > >> } > >> node { > >> name: node02 > >> nodeid: 02 > >> ring0_addr: 10.0.0.11 > >> } > >> } > >> > >> // crm config show > >> node 1: node01 \ > >> attributes standby=off > >> node 2: node02 \ > >> attributes standby=off maintenance=off > >> primitive IP‑clusterC1 IPaddr2 \ > >> params ip=10.0.0.20 nic=eth0 cidr_netmask=24 \ > >> meta migration‑threshold=2 target‑role=Started is‑managed=true \ > >> op monitor interval=20 timeout=60 on‑fail=restart > >> primitive IP‑clusterC2 IPaddr2 \ > >> params ip=10.0.0.21 nic=eth0 cidr_netmask=24 \ > >> meta migration‑threshold=2 target‑role=Started is‑managed=true \ > >> op monitor interval=20 timeout=60 on‑fail=restart > >> location STICKY‑IP‑clusterC1 IP‑clusterC1 100: node01 > >> location STICKY‑IP‑clusterC2 IP‑clusterC2 100: node02 > >> property cib‑bootstrap‑options: \ > >> have‑watchdog=false \ > >> dc‑version=2.0.1‑9e909a5bdd \ > >> cluster‑infrastructure=corosync \ > >> cluster‑name=cluster01 \ > >> stonith‑enabled=no \ > >> no‑quorum‑policy=ignore \ > >> last‑lrm‑refresh=1632230917 > >> > >> > >> ________________________________ > >> [https://go.aciworldwide.com/rs/030‑ROK‑804/images/aci‑footer.jpg] > > <http://www.aciworldwide.com> > >> This email message and any attachments may contain confidential, > proprietary > > or non‑public information. The information is intended solely for the > > designated recipient(s). If an addressing or transmission error has > > misdirected this email, please notify the sender immediately and destroy > this > > email. Any review, dissemination, use or reliance upon this information by > > unintended recipients is prohibited. Any opinions expressed in this email > are > > those of the author personally. > >> _______________________________________________ > >> Manage your subscription: > >> https://lists.clusterlabs.org/mailman/listinfo/users > >> > >> ClusterLabs home: https://www.clusterlabs.org/ > >> > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/