[Ubuntu-ha] [Bug 1318441] Re: Precise corosync dies if failed_to_recv is set

Rafael David Tinoco Mon, 12 May 2014 07:31:44 -0700

######## Tests before the patch:

#
# NODE 1
#


--- MARKER --- ./failed-to-receive-crash.sh at 2014-05-09-17:33:04 --- MARKER 
--- 
May 09 17:33:04 corosync [MAIN]:  ] Corosync Cluster Engine ('1.4.2'): started 
and ready to provide service. 
May 09 17:33:04 corosync [MAIN]:  ] Corosync built-in features: nss 
May 09 17:33:04 corosync [MAIN]:  ] Successfully read main configuration file 
'/etc/corosync/corosync.conf'. 
May 09 17:33:04 corosync [TOTEM]: ] Initializing transport (UDP/IP Multicast). 
May 09 17:33:04 corosync [TOTEM]: ] Initializing transmit/receive security: 
libtomcrypt SOBER128/SHA1HMAC (mode 0). 
May 09 17:33:04 corosync [TOTEM]: ] The network interface [192.168.168.1] is 
now up. 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: openais checkpoint 
service B.01.01 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: corosync extended 
virtual synchrony service 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: corosync 
configuration service 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: corosync cluster 
closed process group service v1.01 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: corosync cluster 
config database access v1.01 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: corosync profile 
loading service 
May 09 17:33:04 corosync [SERV]:  ] Service engine loaded: corosync cluster 
quorum service v0.1 
May 09 17:33:04 corosync [MAIN]:  ] Compatibility mode set to whitetank.  Using 
V1 and V2 of the synchronization engine. 
May 09 17:33:04 corosync [TOTEM]: ] A processor joined or left the membership 
and a new membership was formed. 
May 09 17:33:04 corosync [CPG]:   ] chosen downlist: sender r(0) 
ip(192.168.168.1) ; members(old:0 left:0) 
May 09 17:33:04 corosync [MAIN]:  ] Completed service synchronization, ready to 
provide service. 
May 09 17:33:05 corosync [TOTEM]: ] A processor joined or left the membership 
and a new membership was formed. 
May 09 17:33:05 corosync [CPG]:   ] chosen downlist: sender r(0) 
ip(192.168.168.1) ; members(old:1 left:0) 
May 09 17:33:05 corosync [MAIN]:  ] Completed service synchronization, ready to 
provide service. 
May 09 17:33:10 corosync [TOTEM]: ] FAILED TO RECEIVE 

# COROSYNC HAS DIED BEFORE TEST CASE TRIES TO STOP IT

root@precise-cluster-01:~# ps -ef | grep corosync
root      1414  1306  0 17:31 pts/0    00:00:00 tail -f 
/var/log/cluster/corosync.log
root      4712  1306  0 17:33 pts/0    00:00:00 grep --color=auto corosync

######## Tests after the patch:

May 11 22:27:48 corosync [MAIN]:  ] Corosync Cluster Engine ('1.4.2'): started 
and ready to provide service. 
May 11 22:27:48 corosync [MAIN]:  ] Corosync built-in features: nss 
May 11 22:27:48 corosync [MAIN]:  ] Successfully read main configuration file 
'/etc/corosync/corosync.conf'. 
May 11 22:27:48 corosync [TOTEM]: ] Initializing transport (UDP/IP Multicast). 
May 11 22:27:48 corosync [TOTEM]: ] Initializing transmit/receive security: 
libtomcrypt SOBER128/SHA1HMAC (mode 0). 
May 11 22:27:48 corosync [TOTEM]: ] The network interface [192.168.168.1] is 
now up. 
May 11 22:27:48 corosync [SERV]:  ] Service engine loaded: openais checkpoint 
service B.01.01 
May 11 22:27:48 corosync [SERV]:  ] Service engine loaded: corosync extended 
virtual synchrony service 
May 11 22:27:48 corosync [SERV]:  ] Service engine loaded: corosync 
configuration service 
May 11 22:27:48 corosync [SERV]:  ] Service engine loaded: corosync cluster 
closed process group service v1.01 
May 11 22:27:48 corosync [SERV]:  ] Service engine loaded: corosync cluster 
config database access v1.01 
May 11 22:27:49 corosync [SERV]:  ] Service engine loaded: corosync profile 
loading service 
May 11 22:27:49 corosync [SERV]:  ] Service engine loaded: corosync cluster 
quorum service v0.1 
May 11 22:27:49 corosync [MAIN]:  ] Compatibility mode set to whitetank.  Using 
V1 and V2 of the synchronization engine. 
May 11 22:27:49 corosync [TOTEM]: ] A processor joined or left the membership 
and a new membership was formed. 
May 11 22:27:49 corosync [CPG]:   ] chosen downlist: sender r(0) 
ip(192.168.168.1) ; members(old:0 left:0) 
May 11 22:27:49 corosync [MAIN]:  ] Completed service synchronization, ready to 
provide service. 
May 11 22:27:49 corosync [TOTEM]: ] A processor joined or left the membership 
and a new membership was formed. 
May 11 22:27:49 corosync [CPG]:   ] chosen downlist: sender r(0) 
ip(192.168.168.1) ; members(old:1 left:0) 
May 11 22:27:49 corosync [MAIN]:  ] Completed service synchronization, ready to 
provide service. 
May 11 22:27:54 corosync [TOTEM]: ] FAILED TO RECEIVE 
May 11 22:27:55 corosync [TOTEM]: ] A processor joined or left the membership 
and a new membership was formed. 
May 11 22:27:55 corosync [CPG]:   ] chosen downlist: sender r(0) 
ip(192.168.168.1) ; members(old:2 left:1) 
May 11 22:27:55 corosync [MAIN]:  ] Completed service synchronization, ready to 
provide service. 
May 11 22:27:57 corosync [TOTEM]: ] A processor joined or left the membership 
and a new membership was formed. 
May 11 22:27:57 corosync [CPG]:   ] chosen downlist: sender r(0) 
ip(192.168.168.1) ; members(old:1 left:0) 
May 11 22:27:57 corosync [MAIN]:  ] Completed service synchronization, ready to 
provide service. 
May 11 22:27:59 corosync [TOTEM]: ] A processor joined or left the membership 
and a new membership was formed. 
May 11 22:28:01 corosync [TOTEM]: ] FAILED TO RECEIVE 

########

Different from the first time, corosync daemon stayed running and
alternating between a single node membership and a two node membership
(when connection was restored and before it was broke again by the
testcase). This is the expected and correct behavior corosync should
have.

** Description changed:

- If node detects itself not able to receive message it asserts the number
- of failed members considering itself and dies.
+ [Impact]
  
- -> Testing bugfix. To be released soon.
+  * On certain conditions corosync daemon may quit if it detects itself as not
+    being able to receive messages. The logic asserts the existence of at least
+    one functional node but the node is marking itself as a failed node (not
+    following the specification). It is safe not to assert this if 
failed_to_recv
+    is set.
+ 
+ [Test Case]
+ 
+  * Using "corosync test suite" on precise-test machine:
+ 
+    - Make sure to set ssh keys so precise-test can access 
precise-cluster-{01,02}.
+    - Make sure only failed-to-receive-crash.sh is executable on "tests" dir.
+    - Make sure precise-cluster-{01,02} nodes have build-dep for corosync 
installed.
+    - sudo ./run-tests.sh -c flatiron -n "precise-cluster-01 
precise-cluster-02"
+    - Check corosync log messages to see precise-cluster-01 corosync dieing. 
+ 
+ [Regression Potential]
+ 
+  * We are not asserting the existence of at least 1 node in corosync cluster
+    anymore. Since there is always 1 node in the cluster (the node itself) it
+    is very unlikely this change alters corosync logic for membership. If it 
+    does it is likely corosync will recover from the error and reestablish new 
+    membership (with 1 or more nodes).
+ 
+ [Other Info]
+ 
+  * n/a

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to corosync in Ubuntu.
https://bugs.launchpad.net/bugs/1318441

Title:
  Precise corosync dies if failed_to_recv is set

Status in “corosync” package in Ubuntu:
  In Progress

Bug description:
  [Impact]

   * On certain conditions *precise* corosync daemon may quit if it detects 
itself 
     as not being able to receive messages. The logic asserts the existence of 
     at least one functional node but the node is marking itself as a failed 
node 
     (not following the specification). It is safe not to assert this if
     failed_to_recv is set.

  [Test Case]

   * Using "corosync test suite" on precise-test machine:

     - Make sure to set ssh keys so precise-test can access 
precise-cluster-{01,02}.
     - Make sure only failed-to-receive-crash.sh is executable on "tests" dir.
     - Make sure precise-cluster-{01,02} nodes have build-dep for corosync 
installed.
     - sudo ./run-tests.sh -c flatiron -n "precise-cluster-01 
precise-cluster-02"
     - Check corosync log messages to see precise-cluster-01 corosync dieing.

  [Regression Potential]

   * We are not asserting the existence of at least 1 node in corosync cluster
     anymore. Since there is always 1 node in the cluster (the node itself) it
     is very unlikely this change alters corosync logic for membership. If it
     does it is likely corosync will recover from the error and reestablish new
     membership (with 1 or more nodes).

  [Other Info]

   * n/a

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1318441/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

[Ubuntu-ha] [Bug 1318441] Re: Precise corosync dies if failed_to_recv is set

Reply via email to