Re: [ClusterLabs] Antw: [EXT] Re: Cluster unable to find back together

Klaus Wenninger Mon, 23 May 2022 10:43:46 -0700

On Fri, May 20, 2022 at 7:43 AM Ulrich Windl
<ulrich.wi...@rz.uni-regensburg.de> wrote:
>
> >>> Jan Friesse <jfrie...@redhat.com> schrieb am 19.05.2022 um 14:55 in
> Nachricht
> <1abb8468-6619-329f-cb01-3f51112db...@redhat.com>:
> > Hi,
> >
> > On 19/05/2022 10:16, Leditzky, Fabian via Users wrote:
> >> Hello
> >>
> >> We have been dealing with our pacemaker/corosync clusters becoming
> unstable.
> >> The OS is Debian 10 and we use Debian packages for pacemaker and corosync,
> >> version 3.0.1‑5+deb10u1 and 3.0.1‑2+deb10u1 respectively.
> >
> > Seems like pcmk version is not so important for behavior you've
> > described. Corosync 3.0.1 is super old, are you able to reproduce the
>
> I'm running corosync-2.4.5-12.7.1.x86_64 (SLES15 SP3) here ;-)
>
> Are you mixing "super old" with "super buggy"?


Actually 3.0.1 is older than 2.4.5 and on top 2.4.5 is the head of a mature
branch while 3.0.1 is the beginning of a new branch that brought
substantial changes.

Klaus
>
> Regards,
> Ulrich
>
> > behavior with 3.1.6? What is the version of knet? There were quite a few
> > fixes so last one (1.23) is really recommended.
> >
> > You can try to compile yourself, or use proxmox repo
> > (http://download.proxmox.com/debian/pve/) which contains newer version
> > of packages.
> >
> >> We use knet over UDP transport.
> >>
> >> We run multiple 2‑node and 4‑8 node clusters, primarily managing VIP
> > resources.
> >> The issue we experience presents itself as a spontaneous disagreement of
> >> the status of cluster members. In two node clusters, each node
> spontaneously
> >> sees the other node as offline, despite network connectivity being OK.
> >> In larger clusters, the status can be inconsistent across the nodes.
> >> E.g.: node1 sees 2,4 as offline, node 2 sees 1,4 as offline while node 3
> and
> > 4 see every node as online.
> >
> > This really shouldn't happen.
> >
> >> The cluster becomes generally unresponsive to resource actions in this
> > state.
> >
> > Expected
> >
> >> Thus far we have been unable to restore cluster health without restarting
> > corosync.
> >>
> >> We are running packet captures 24/7 on the clusters and have custom
> tooling
> >> to detect lost UDP packets on knet ports. So far we could not see
> > significant
> >> packet loss trigger an event, at most we have seen a single UDP packet
> > dropped
> >> some seconds before the cluster fails.
> >>
> >> However, even if the root cause is indeed a flaky network, we do not
> > understand
> >> why the cluster cannot recover on its own in any way. The issues definitely
>
> > persist
> >> beyond the presence of any intermittent network problem.
> >
> > Try newer version. If problem persist, it's good idea to monitor if
> > packets are really passed thru. Corosync always (at least) creates
> > single node membership.
> >
> > Regards,
> >    Honza
> >
> >>
> >> We were able to artificially break clusters by inducing packet loss with an
>
> > iptables rule.
> >> Dropping packets on a single node of an 8‑node cluster can cause
> malfunctions
> > on
> >> multiple other cluster nodes. The expected behavior would be detecting that
>
> > the
> >> artificially broken node failed but keeping the rest of the cluster
> stable.
> >> We were able to reproduce this also on Debian 11 with more recent
> > corosync/pacemaker
> >> versions.
> >>
> >> Our configuration basic, we do not significantly deviate from the
> defaults.
> >>
> >> We will be very grateful for any insights into this problem.
> >>
> >> Thanks,
> >> Fabian
> >>
> >> // corosync.conf
> >> totem {
> >>      version: 2
> >>      cluster_name: cluster01
> >>      crypto_cipher: aes256
> >>      crypto_hash: sha512
> >>      transport: knet
> >> }
> >> logging {
> >>      fileline: off
> >>      to_stderr: no
> >>      to_logfile: no
> >>      to_syslog: yes
> >>      debug: off
> >>      timestamp: on
> >>      logger_subsys {
> >>          subsys: QUORUM
> >>          debug: off
> >>      }
> >> }
> >> quorum {
> >>      provider: corosync_votequorum
> >>      two_node: 1
> >>      expected_votes: 2
> >> }
> >> nodelist {
> >>      node {
> >>          name: node01
> >>          nodeid: 01
> >>          ring0_addr: 10.0.0.10
> >>      }
> >>      node {
> >>          name: node02
> >>          nodeid: 02
> >>          ring0_addr: 10.0.0.11
> >>      }
> >> }
> >>
> >> // crm config show
> >> node 1: node01 \
> >>      attributes standby=off
> >> node 2: node02 \
> >>      attributes standby=off maintenance=off
> >> primitive IP‑clusterC1 IPaddr2 \
> >>      params ip=10.0.0.20 nic=eth0 cidr_netmask=24 \
> >>      meta migration‑threshold=2 target‑role=Started is‑managed=true \
> >>      op monitor interval=20 timeout=60 on‑fail=restart
> >> primitive IP‑clusterC2 IPaddr2 \
> >>      params ip=10.0.0.21 nic=eth0 cidr_netmask=24 \
> >>      meta migration‑threshold=2 target‑role=Started is‑managed=true \
> >>      op monitor interval=20 timeout=60 on‑fail=restart
> >> location STICKY‑IP‑clusterC1 IP‑clusterC1 100: node01
> >> location STICKY‑IP‑clusterC2 IP‑clusterC2 100: node02
> >> property cib‑bootstrap‑options: \
> >>      have‑watchdog=false \
> >>      dc‑version=2.0.1‑9e909a5bdd \
> >>      cluster‑infrastructure=corosync \
> >>      cluster‑name=cluster01 \
> >>      stonith‑enabled=no \
> >>      no‑quorum‑policy=ignore \
> >>      last‑lrm‑refresh=1632230917
> >>
> >>
> >> ________________________________
> >>   [https://go.aciworldwide.com/rs/030‑ROK‑804/images/aci‑footer.jpg]
> > <http://www.aciworldwide.com>
> >> This email message and any attachments may contain confidential,
> proprietary
> > or non‑public information. The information is intended solely for the
> > designated recipient(s). If an addressing or transmission error has
> > misdirected this email, please notify the sender immediately and destroy
> this
> > email. Any review, dissemination, use or reliance upon this information by
> > unintended recipients is prohibited. Any opinions expressed in this email
> are
> > those of the author personally.
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >>
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: Cluster unable to find back together

Reply via email to