Re: [ClusterLabs] Antw: [EXT] Re: Cluster unable to find back together

2022-05-23 Thread Klaus Wenninger
On Fri, May 20, 2022 at 7:43 AM Ulrich Windl
 wrote:
>
> >>> Jan Friesse  schrieb am 19.05.2022 um 14:55 in
> Nachricht
> <1abb8468-6619-329f-cb01-3f51112db...@redhat.com>:
> > Hi,
> >
> > On 19/05/2022 10:16, Leditzky, Fabian via Users wrote:
> >> Hello
> >>
> >> We have been dealing with our pacemaker/corosync clusters becoming
> unstable.
> >> The OS is Debian 10 and we use Debian packages for pacemaker and corosync,
> >> version 3.0.1‑5+deb10u1 and 3.0.1‑2+deb10u1 respectively.
> >
> > Seems like pcmk version is not so important for behavior you've
> > described. Corosync 3.0.1 is super old, are you able to reproduce the
>
> I'm running corosync-2.4.5-12.7.1.x86_64 (SLES15 SP3) here ;-)
>
> Are you mixing "super old" with "super buggy"?

Actually 3.0.1 is older than 2.4.5 and on top 2.4.5 is the head of a mature
branch while 3.0.1 is the beginning of a new branch that brought
substantial changes.

Klaus
>
> Regards,
> Ulrich
>
> > behavior with 3.1.6? What is the version of knet? There were quite a few
> > fixes so last one (1.23) is really recommended.
> >
> > You can try to compile yourself, or use proxmox repo
> > (http://download.proxmox.com/debian/pve/) which contains newer version
> > of packages.
> >
> >> We use knet over UDP transport.
> >>
> >> We run multiple 2‑node and 4‑8 node clusters, primarily managing VIP
> > resources.
> >> The issue we experience presents itself as a spontaneous disagreement of
> >> the status of cluster members. In two node clusters, each node
> spontaneously
> >> sees the other node as offline, despite network connectivity being OK.
> >> In larger clusters, the status can be inconsistent across the nodes.
> >> E.g.: node1 sees 2,4 as offline, node 2 sees 1,4 as offline while node 3
> and
> > 4 see every node as online.
> >
> > This really shouldn't happen.
> >
> >> The cluster becomes generally unresponsive to resource actions in this
> > state.
> >
> > Expected
> >
> >> Thus far we have been unable to restore cluster health without restarting
> > corosync.
> >>
> >> We are running packet captures 24/7 on the clusters and have custom
> tooling
> >> to detect lost UDP packets on knet ports. So far we could not see
> > significant
> >> packet loss trigger an event, at most we have seen a single UDP packet
> > dropped
> >> some seconds before the cluster fails.
> >>
> >> However, even if the root cause is indeed a flaky network, we do not
> > understand
> >> why the cluster cannot recover on its own in any way. The issues definitely
>
> > persist
> >> beyond the presence of any intermittent network problem.
> >
> > Try newer version. If problem persist, it's good idea to monitor if
> > packets are really passed thru. Corosync always (at least) creates
> > single node membership.
> >
> > Regards,
> >Honza
> >
> >>
> >> We were able to artificially break clusters by inducing packet loss with an
>
> > iptables rule.
> >> Dropping packets on a single node of an 8‑node cluster can cause
> malfunctions
> > on
> >> multiple other cluster nodes. The expected behavior would be detecting that
>
> > the
> >> artificially broken node failed but keeping the rest of the cluster
> stable.
> >> We were able to reproduce this also on Debian 11 with more recent
> > corosync/pacemaker
> >> versions.
> >>
> >> Our configuration basic, we do not significantly deviate from the
> defaults.
> >>
> >> We will be very grateful for any insights into this problem.
> >>
> >> Thanks,
> >> Fabian
> >>
> >> // corosync.conf
> >> totem {
> >>  version: 2
> >>  cluster_name: cluster01
> >>  crypto_cipher: aes256
> >>  crypto_hash: sha512
> >>  transport: knet
> >> }
> >> logging {
> >>  fileline: off
> >>  to_stderr: no
> >>  to_logfile: no
> >>  to_syslog: yes
> >>  debug: off
> >>  timestamp: on
> >>  logger_subsys {
> >>  subsys: QUORUM
> >>  debug: off
> >>  }
> >> }
> >> quorum {
> >>  provider: corosync_votequorum
> >>  two_node: 1
> >>  expected_votes: 2
> >> }
> >> nodelist {
> >>  node {
> >>  name: node01
> >>  nodeid: 01
> >>  ring0_addr: 10.0.0.10
> >>  }
> >>  node {
> >>  name: node02
> >>  nodeid: 02
> >>  ring0_addr: 10.0.0.11
> >>  }
> >> }
> >>
> >> // crm config show
> >> node 1: node01 \
> >>  attributes standby=off
> >> node 2: node02 \
> >>  attributes standby=off maintenance=off
> >> primitive IP‑clusterC1 IPaddr2 \
> >>  params ip=10.0.0.20 nic=eth0 cidr_netmask=24 \
> >>  meta migration‑threshold=2 target‑role=Started is‑managed=true \
> >>  op monitor interval=20 timeout=60 on‑fail=restart
> >> primitive IP‑clusterC2 IPaddr2 \
> >>  params ip=10.0.0.21 nic=eth0 cidr_netmask=24 \
> >>  meta migration‑threshold=2 target‑role=Started is‑managed=true \
> >>  op monitor interval=20 timeout=60 on‑fail=restart
> >> location STICKY‑IP‑clusterC1 IP‑clusterC1 100: node01
> >> location STICKY‑IP‑clusterC2 

[ClusterLabs] corosync-cfgtool -s shows all links not connected for one particular node

2022-05-23 Thread Dirk Gassen

Greetings,

I have a four-node cluster on Ubuntu Focal with the following versions:

libknet1: 1.15-1ubuntu1
corosync: 3.0.3-2ubuntu2.1
pacemaker: 2.0.3-3ubuntu4.3


Each node is connected to two networks:

testras1:
  eth0  10.1.8.24/26
  eth1  192.168.21.227/24
testras2:
  eth0  10.1.8.25/26 
  eth1  192.168.21.119/24

testras3:
  eth0  10.1.8.66/26
  eth1  192.168.21.13/24
testras4:
  eth0  10.1.8.77/26 
  eth1  192.168.21.19/24


The totem section of corosync.conf on all nodes:

totem {
version: 2
cluster_name: BERND-RAS
# Disable encryption
secauth: off
interface {
linknumber: 0
#knet_transport: udp|sctp
#knet_link_priority: 0
}
interface {
linknumber: 1
#knet_transport: udp|sctp
#knet_link_priority: 1
}
transport: knet
}

and the nodelist section:
nodelist { 
node {

ring0_addr: 192.168.21.227
ring1_addr: 10.1.8.24
nodeid: 2036952047
name: testras1
}
node {
ring0_addr: 192.168.21.119
ring1_addr: 10.1.8.25
nodeid: 2036951939
name: testras2
}
node {
ring0_addr: 192.168.21.13
ring1_addr: 10.1.8.66
nodeid: 1921682113
name: testras3
}
node {
ring0_addr: 192.168.21.19
ring1_addr: 10.1.8.77
nodeid: 1921682119
name: testras4
}
}


On all nodes crm_mon shows all four nodes online:

Node List:
  * Online: [ testras1 testras2 testras3 testras4 ]

and "corosync-cfgtool -s" shows the very same:

Printing link status.
Local node ID 2036952047
LINK ID 0
addr= 192.168.21.227
status:
nodeid 1921682113:  link enabled:1  link connected:1
nodeid 1921682119:  link enabled:1  link connected:1
nodeid 2036951939:  link enabled:1  link connected:1
nodeid 2036952047:  link enabled:1  link connected:1
LINK ID 1
addr= 10.1.8.24
status:
nodeid 1921682113:  link enabled:1  link connected:1
nodeid 1921682119:  link enabled:0  link connected:1
nodeid 2036951939:  link enabled:1  link connected:1
nodeid 2036952047:  link enabled:1  link connected:1



However, when I add a node that doesn't exist that changes:

node {
ring0_addr: 192.168.120.13
ring1_addr: 10.1.8.99
nodeid: 2036942833
name: testras5
}

Now "corosync-cfgtool -s" shows:

Printing link status.
Local node ID 2036952047
LINK ID 0
addr= 192.168.21.227
status:
nodeid 1921682113:  link enabled:1  link connected:0
nodeid 1921682119:  link enabled:1  link connected:1
nodeid 2036942833:  link enabled:1  link connected:1
nodeid 2036951939:  link enabled:1  link connected:1
nodeid 2036952047:  link enabled:1  link connected:1
LINK ID 1
addr= 10.1.8.24
status:
nodeid 1921682113:  link enabled:1  link connected:0
nodeid 1921682119:  link enabled:1  link connected:1
nodeid 2036942833:  link enabled:0  link connected:1
nodeid 2036951939:  link enabled:1  link connected:1
nodeid 2036952047:  link enabled:1  link connected:1

while everything else stays the same.

Why would "link connected" show 0 for one of the existing nodes but not 
for the non-existing node (2036942833)? (All existing nodes can still 
see each other) What am I missing?


Dirk
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/