Public bug reported:

[Description]

Corosync sigaborts if it starts before the interface it has to bind to
is ready.

On boot, if no interface in the bindnetaddr range is up/configured,
corosync binds to lo (127.0.0.1). Once an applicable interface is up,
corosync crashes with the following error message:

corosync: votequorum.c:2019: message_handler_req_exec_votequorum_nodeinfo: 
Assertion `sender_node != NULL' failed.
Aborted (core dumped)

The last log entries show that the interface is trying to join the
cluster:

Dec 19 11:36:05 [22167] xenial-pacemaker corosync debug   [TOTEM ] 
totemsrp.c:2089 entering OPERATIONAL state.
Dec 19 11:36:05 [22167] xenial-pacemaker corosync notice  [TOTEM ] 
totemsrp.c:2095 A new membership (169.254.241.10:444) was formed. Members 
joined: 704573706

During the quorum calculation, the generated nodeid (704573706) for the
node is being used instead of the nodeid specified in the configuration
file (1), and the assert fails because the nodeid is not present in the
member list. Corosync should use the correct nodeid and continue running
after the interface is up, as shown in a fixed corosync boot:

Dec 19 11:50:56 [4824] xenial-corosync corosync notice  [TOTEM ]
totemsrp.c:2095 A new membership (169.254.241.10:80) was formed. Members
joined: 1

[Environment]

Xenial 16.04.3

Packages:

ii  corosync                     2.3.5-3ubuntu1    amd64    cluster engine 
daemon and utilities
ii  libcorosync-common4:amd64    2.3.5-3ubuntu1    amd64    cluster engine 
common library


[Reproducer]

Config:

totem {
        version: 2

        transport: udpu

        crypto_cipher: none
        crypto_hash: none

        interface {
                ringnumber: 0
                member {
                        memberaddr: 169.254.241.10
                }
                member {
                        memberaddr: 169.254.241.20
                }
                bindnetaddr: 169.254.241.0
                mcastport: 5405
                ttl: 1
        }
}

quorum {
        provider: corosync_votequorum
        expected_votes: 2
}

nodelist {
        node {
                ring0_addr: 169.254.241.10
                nodeid: 1
        }
        node {
                ring0_addr: 169.254.241.20
                nodeid: 2
        }
}


1. ifdown interface (169.254.241.10)
2. start corosync (/usr/sbin/corosync -f)
3. ifup interface

[Fix]

Commit
https://github.com/corosync/corosync/commit/aab55a004bb12ebe78db341dc56759dfe710c1b2
fixes the way the CMAP is populated, and seems to fix this bug.

** Affects: corosync (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to corosync in Ubuntu.
https://bugs.launchpad.net/bugs/1739033

Title:
  Corosync: Assertion 'sender_node != NULL' failed when bind iface is
  ready after corosync boots

Status in corosync package in Ubuntu:
  New

Bug description:
  [Description]

  Corosync sigaborts if it starts before the interface it has to bind to
  is ready.

  On boot, if no interface in the bindnetaddr range is up/configured,
  corosync binds to lo (127.0.0.1). Once an applicable interface is up,
  corosync crashes with the following error message:

  corosync: votequorum.c:2019: message_handler_req_exec_votequorum_nodeinfo: 
Assertion `sender_node != NULL' failed.
  Aborted (core dumped)

  The last log entries show that the interface is trying to join the
  cluster:

  Dec 19 11:36:05 [22167] xenial-pacemaker corosync debug   [TOTEM ] 
totemsrp.c:2089 entering OPERATIONAL state.
  Dec 19 11:36:05 [22167] xenial-pacemaker corosync notice  [TOTEM ] 
totemsrp.c:2095 A new membership (169.254.241.10:444) was formed. Members 
joined: 704573706

  During the quorum calculation, the generated nodeid (704573706) for
  the node is being used instead of the nodeid specified in the
  configuration file (1), and the assert fails because the nodeid is not
  present in the member list. Corosync should use the correct nodeid and
  continue running after the interface is up, as shown in a fixed
  corosync boot:

  Dec 19 11:50:56 [4824] xenial-corosync corosync notice  [TOTEM ]
  totemsrp.c:2095 A new membership (169.254.241.10:80) was formed.
  Members joined: 1

  [Environment]

  Xenial 16.04.3

  Packages:

  ii  corosync                     2.3.5-3ubuntu1    amd64    cluster engine 
daemon and utilities
  ii  libcorosync-common4:amd64    2.3.5-3ubuntu1    amd64    cluster engine 
common library

  
  [Reproducer]

  Config:

  totem {
          version: 2

          transport: udpu

          crypto_cipher: none
          crypto_hash: none

          interface {
                  ringnumber: 0
                  member {
                          memberaddr: 169.254.241.10
                  }
                  member {
                          memberaddr: 169.254.241.20
                  }
                  bindnetaddr: 169.254.241.0
                  mcastport: 5405
                  ttl: 1
          }
  }

  quorum {
          provider: corosync_votequorum
          expected_votes: 2
  }

  nodelist {
          node {
                  ring0_addr: 169.254.241.10
                  nodeid: 1
          }
          node {
                  ring0_addr: 169.254.241.20
                  nodeid: 2
          }
  }

  
  1. ifdown interface (169.254.241.10)
  2. start corosync (/usr/sbin/corosync -f)
  3. ifup interface

  [Fix]

  Commit
  
https://github.com/corosync/corosync/commit/aab55a004bb12ebe78db341dc56759dfe710c1b2
  fixes the way the CMAP is populated, and seems to fix this bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1739033/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

Reply via email to