[ClusterLabs] Corosync ring marked as FAULTY

Denis Gribkov Tue, 21 Feb 2017 09:31:07 -0800

Hi Everyone.

I have 16-nodes asynchronous cluster configured with Corosync redundant ring feature.

Each node has 2 similarly connected/configured NIC's. One NIC connected to the public network,

another one to our private VLAN. When I checked Corosync rings operability I found:


# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = 192.168.1.54
        status  = Marking ringid 0 interface 192.168.1.54 FAULTY
RING ID 1
        id      = 111.11.11.1
        status  = ring 1 active with no faults

After some time of digging into I identified that if I enable back the failed ring with command:


# corosync-cfgtool -r

RING ID 0 will be marked as "active" for few minutes, but after it marked permanently as faulty.


Log has no any useful info, just single message:

corosync[21740]:   [TOTEM ] Marking ringid 0 interface 192.168.1.54 FAULTY

And no any message like:

[TOTEM ] Automatically recovered ring 1


My corosync.conf looks like:

compatibility: whitetank

totem {
        version: 2
        secauth: on
        threads: 4
        rrp_mode: passive

        interface {

                member {
                        memberaddr: PRIVATE_IP_1
                }

...

                member {
                        memberaddr: PRIVATE_IP_16
                }

                ringnumber: 0
                bindnetaddr: PRIVATE_NET_ADDR
                mcastaddr: 226.0.0.1
                mcastport: 5505
                ttl: 1
        }

       interface {

                member {
                        memberaddr: PUBLIC_IP_1
                }
...

                member {
                        memberaddr: PUBLIC_IP_16
                }

                ringnumber: 1
                bindnetaddr: PUBLIC_NET_ADDR
                mcastaddr: 224.0.0.1
                mcastport: 5405
                ttl: 1
        }

        transport: udpu

logging {
        to_stderr: no
        to_logfile: yes
        logfile: /var/log/cluster/corosync.log
        logfile_priority: info
        to_syslog: yes
        syslog_priority: warning
        debug: on
        timestamp: on
}

I had tried to change rrp_mode, mcastaddr/mcastport for ringnumber: 0, but result was the similar.

I checked multicast/unicast operability using omping utility and didn't found any issues.


Also no errors on our private VLAN was found for network equipment.

Why Corosync decided to disable permanently second ring? How I can debug the issue?


Other properties:

Corosync Cluster Engine, version '1.4.7'

Pacemaker properties:
 cluster-infrastructure: cman
 cluster-recheck-interval: 5min
 dc-version: 1.1.14-8.el6-70404b0
 expected-quorum-votes: 3
 have-watchdog: false
 last-lrm-refresh: 1484068350
 maintenance-mode: false
 no-quorum-policy: ignore
 pe-error-series-max: 1000
 pe-input-series-max: 1000
 pe-warn-series-max: 1000
 stonith-action: reboot
 stonith-enabled: false
 symmetric-cluster: false

Thank you.

--
Regards Denis Gribkov

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Corosync ring marked as FAULTY

Reply via email to