Re: [ClusterLabs] Redundant ring not recovering after node is back

Jan Friesse Thu, 23 Aug 2018 23:22:07 -0700

I tried to install corosync 3.x and it works pretty well.


Cool

But when I install pacemaker, it installs previous version of corosync as
dependency and breaks all the setup.
Any suggestions?


I can see at least following "solutions":
- make proper Debian package
- install corosync 3 to /usr/local
- (ugly) install packaged corosync and reinstall by corosync 3 from source

Regards,
  Honza


2018-08-23 9:32 GMT+02:00 Jan Friesse <jfrie...@redhat.com>:

David,

BTW, where I can download Corosync 3.x?

I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/coro
sync/


Yes, that's Alpha 4 of Corosync 3.

2018-08-23 9:11 GMT+02:00 David Tolosa <david.tol...@upcnet.es>:

I'm currently using an Ubuntu 18.04 server configuration with netplan.


Here you have my current YAML configuration:

# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
    version: 2
    renderer: networkd
    ethernets:
      eno1:
        addresses: [192.168.0.1/24]
      enp4s0f0:
        addresses: [192.168.1.1/24]
      enp5s0f0:
        {}
    vlans:
      vlan.XXX:
        id: XXX
        link: enp5s0f0
        addresses: [ 10.1.128.5/29 ]
        gateway4: 10.1.128.1
        nameservers:
          addresses: [ 8.8.8.8, 8.8.4.4 ]
          search: [ foo.com, bar.com ]
      vlan.YYY:
        id: YYY
        link: enp5s0f0
        addresses: [ 10.1.128.5/29 ]


So, eno1 and enp4s0f0 are the two ethernet ports connected each other
with crossover cables to node2.
enp5s0f0 port is used to connect outside/services using vlans defined in
the same file.

In short, I'm using systemd-networkd default Ubuntu 18 server service for

Ok, so systemd-networkd is really doing ifdown and somebody actually tries
fix it and merge into upstream (sadly with not too much luck :( )

https://github.com/systemd/systemd/pull/7403


manage networks. Im not detecting any NetworkManager-config-server

package in my repository neither.

I'm not sure how it's called in Debian based distributions, but it's just
one small file in /etc, so you can extract it from RPM.

So the only solution that I have left, I suppose, is to test corosync 3.x

and see if it works better handling RRP.

You may also reconsider to try ether completely static network
configuration or NetworkManager + NetworkManager-config-server.


Corosync 3.x with knet will work for sure, but be prepared for quite a
long compile path, because you first have to compile knet and then
corosync. What may help you a bit is that we have a ubuntu 18.04 in our
jenkins, so it should be possible corosync build log
https://ci.kronosnet.org/view/corosync/job/corosync-build-al
l-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lt
s-x86-64/consoleText, knet build log https://ci.kronosnet.org/view/
knet/job/knet-build-all-voting/lastBuild/knet-build-all-
voting=ubuntu-18-04-lts-x86-64/consoleText).

Also please consult http://people.redhat.com/ccaul
fie/docs/KnetCorosync.pdf about changes in corosync configuration.

Regards,
   Honza

Thank you for your quick response!

2018-08-23 8:40 GMT+02:00 Jan Friesse <jfrie...@redhat.com>:

David,


Hello,

Im getting crazy about this problem, that I expect to resolve here,
with
your help guys:

I have 2 nodes with Corosync redundant ring feature.

Each node has 2 similarly connected/configured NIC's. Both nodes are
connected each other by two crossover cables.

I believe this is root of the problem. Are you using NetworkManager? If
so, have you installed NetworkManager-config-server? If not, please
install
it and test again.


I configured both nodes with rrp mode passive. Everything is working

well
at this point, but when I shutdown 1 node to test failover, and this
node > returns to be online, corosync is marking the interface as
FAULTY
and rrp

I believe it's because with crossover cables configuration when other
side is shutdown, NetworkManager detects it and does ifdown of the
interface. And corosync is unable to handle ifdown properly. Ifdown is
bad
with single ring, but it's just killer with RRP (127.0.0.1 poisons every
node in the cluster).

fails to recover the initial state:


1. Initial scenario:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
           id      = 192.168.0.1
           status  = ring 0 active with no faults
RING ID 1
           id      = 192.168.1.1
           status  = ring 1 active with no faults


2. When I shutdown the node 2, all continues with no faults. Sometimes
the
ring ID's are bonding with 127.0.0.1 and then bond back to their
respective
heartbeat IP.

Again, result of ifdown.


3. When node 2 is back online:


# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
           id      = 192.168.0.1
           status  = ring 0 active with no faults
RING ID 1
           id      = 192.168.1.1
           status  = Marking ringid 1 interface 192.168.1.1 FAULTY


# service corosync status
● corosync.service - Corosync Cluster Engine
      Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
vendor
preset: enabled)
      Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min
38s ago
        Docs: man:corosync
              man:corosync.conf
              man:corosync_overview
    Main PID: 1439 (corosync)
       Tasks: 2 (limit: 4915)
      CGroup: /system.slice/corosync.service
              └─1439 /usr/sbin/corosync -f


Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
The
network interface [192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.0.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ]
The
network interface [192.168.1.1] is now up.
Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
[192.168.1.1] is now up.
Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ]
A
new membership (192.168.0.1:601760) was formed. Members
Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601760) was formed. Members
Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ]
A
new membership (192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
192.168.0.1:601764) was formed. Members joined: 2
Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
Marking ringid 1 interface 192.168.1.1 FAULTY
Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1
interface
192.168.1.1 FAULTY


If I execute corosync-cfgtool, clears the faulty error but after some
seconds return to be FAULTY.
The only thing that it resolves the problem is to restart de service
with
service corosync restart.

Here you have some of my configuration settings on node 1 (I probed
already
to change rrp_mode):

*- corosync.conf*


totem {
           version: 2
           cluster_name: node
           token: 5000
           token_retransmits_before_loss_const: 10
           secauth: off
           threads: 0
           rrp_mode: passive
           nodeid: 1
           interface {
                   ringnumber: 0
                   bindnetaddr: 192.168.0.0
                   #mcastaddr: 226.94.1.1
                   mcastport: 5405
                   broadcast: yes
           }
           interface {
                   ringnumber: 1
                   bindnetaddr: 192.168.1.0
                   #mcastaddr: 226.94.1.2
                   mcastport: 5407
                   broadcast: yes
           }
}

logging {
           fileline: off
           to_stderr: yes
           to_syslog: yes
           to_logfile: yes
           logfile: /var/log/corosync/corosync.log
           debug: off
           timestamp: on
           logger_subsys {
                   subsys: AMF
                   debug: off
           }
}

amf {
           mode: disabled
}

quorum {
           provider: corosync_votequorum
           expected_votes: 2
}

nodelist {
           node {
                   nodeid: 1
                   ring0_addr: 192.168.0.1
                   ring1_addr: 192.168.1.1
           }

           node {
                   nodeid: 2
                   ring0_addr: 192.168.0.2
                   ring1_addr: 192.168.1.2
           }
}

aisexec {
           user: root
           group: root
}

service {
           name: pacemaker
           ver: 1
}



*- /etc/hosts*


127.0.0.1       localhost
10.4.172.5      node1.upc.edu node1
10.4.172.6      node2.upc.edu node2


So machines have 3 NICs? 2 for corosync/cluster traffic and one for

regular traffic/services/outside world?


Thank you for you help in advance!

To conclude:
- If you are using NetworkManager, try to install
NetworkManager-config-server, it will probably help
- If you are brave enough, try corosync 3.x (current Alpha4 is pretty
stable - actually some other projects gain this stability with SP1 :) )
that has no RRP but uses knet for support redundant links (up-to 8 links
can be configured) and doesn't have problems with ifdown.

Honza


_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc
/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


--
*David Tolosa Martínez*
Customer Support & Infrastructure
UPCnet - Edifici Vèrtex
Plaça d'Eusebi Güell, 6, 08034 Barcelona
Tel: 934054555

<https://www.upcnet.es>


_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Redundant ring not recovering after node is back

Reply via email to