For now, I have tested the following scenarios:
- 4 nodes - stonith-enabled=true - no-quorum-policy=stop AND - 2 nodes only - stonith-enabled=true - no-quorum-policy=ignore I ran the test case (bug description) for hours and could not get a crash, although I do get the following messages (expected) from time to time: Jan 19 16:52:23 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 12418 to record non-fatal assert at logging.c:63 : Source ID 73 was not found when attempting to remove it Jan 19 16:52:23 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 12439 to record non-fatal assert at logging.c:63 : Source ID 74 was not found when attempting to remove it Jan 19 16:52:38 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 13413 to record non-fatal assert at logging.c:63 : Source ID 76 was not found when attempting to remove it Jan 19 16:52:38 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 13429 to record non-fatal assert at logging.c:63 : Source ID 77 was not found when attempting to remove it Jan 19 16:52:52 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 14411 to record non-fatal assert at logging.c:63 : Source ID 79 was not found when attempting to remove it Jan 19 16:52:52 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 14423 to record non-fatal assert at logging.c:63 : Source ID 80 was not found when attempting to remove it Jan 19 16:53:07 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 15410 to record non-fatal assert at logging.c:63 : Source ID 82 was not found when attempting to remove it Jan 19 16:53:07 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 15427 to record non-fatal assert at logging.c:63 : Source ID 83 was not found when attempting to remove it Jan 19 16:53:21 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 16409 to record non-fatal assert at logging.c:63 : Source ID 85 was not found when attempting to remove it Jan 19 16:53:21 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 16425 to record non-fatal assert at logging.c:63 : Source ID 86 was not found when attempting to remove it Jan 19 16:53:35 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 17408 to record non-fatal assert at logging.c:63 : Source ID 88 was not found when attempting to remove it Jan 19 16:53:35 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 17420 to record non-fatal assert at logging.c:63 : Source ID 89 was not found when attempting to remove it Jan 19 16:53:50 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 18409 to record non-fatal assert at logging.c:63 : Source ID 91 was not found when attempting to remove it Jan 19 16:53:50 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 18422 to record non-fatal assert at logging.c:63 : Source ID 92 was not found when attempting to remove it Jan 19 16:54:04 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 19410 to record non-fatal assert at logging.c:63 : Source ID 94 was not found when attempting to remove it Jan 19 16:54:04 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 19426 to record non-fatal assert at logging.c:63 : Source ID 95 was not found when attempting to remove it Depending on Peter to get his crash + core dump for the analysis. Thank you Rafael Tinoco -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1368737 Title: Pacemaker can seg fault on crm node online/standby Status in pacemaker package in Ubuntu: Fix Released Status in pacemaker source package in Trusty: Fix Committed Status in pacemaker source package in Utopic: Fix Committed Status in pacemaker source package in Vivid: Fix Released Bug description: [IMPACT] - Pacemaker seg fault on repeated crm node online/standy because: - Newer glib versions uses hash_table to find GSources - Glib can try to assert source being removed multiple times [TEST CASE] - Using same configuration as attached cib.xml : #!/bin/bash while true; do crm node standby clustertrusty01 sleep 7 crm node online clustertrusty01 sleep 7 crm node standby clustertrusty02 sleep 7 crm node online clustertrusty02 sleep 7 crm node standby clustertrusty03 sleep 7 crm node online clustertrusty03 sleep 7 done [REGRESSION POTENTIAL] - Based on upstream commit 568e41d - Test case ran for more than 7 hours with no problems [OTHER INFO] It was brought to my attention the following situation: """ [Issue] lrmd process crashed when repeating "crm node standby" and "crm node online" ---------------- # grep pacemakerd ha-log.k1pm101 | grep core Aug 27 17:47:06 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 49275 (lrmd) dumped core Aug 27 17:47:06 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=49275, core=1) Aug 27 18:27:14 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 1471 (lrmd) dumped core Aug 27 18:27:14 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=1471, core=1) Aug 27 18:56:41 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 35771 (lrmd) dumped core Aug 27 18:56:41 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35771, core=1) Aug 27 19:44:09 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 60709 (lrmd) dumped core Aug 27 19:44:09 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=60709, core=1) Aug 27 20:00:53 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 35838 (lrmd) dumped core Aug 27 20:00:53 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35838, core=1) Aug 27 21:33:52 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 49249 (lrmd) dumped core Aug 27 21:33:52 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=49249, core=1) Aug 27 22:01:16 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 65358 (lrmd) dumped core Aug 27 22:01:16 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=65358, core=1) Aug 27 22:28:02 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 22693 (lrmd) dumped core Aug 27 22:28:02 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=22693, core=1) ---------------- ---------------- # grep pacemakerd ha-log.k1pm102 | grep core Aug 27 15:32:48 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 5812 (lrmd) dumped core Aug 27 15:32:48 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=5812, core=1) Aug 27 15:52:52 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 35781 (lrmd) dumped core Aug 27 15:52:52 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35781, core=1) Aug 27 16:02:54 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 51984 (lrmd) dumped core Aug 27 16:02:54 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=51984, core=1) """ Analyzing core file with dbgsyms I could see that: #0 0x00007f7184a45983 in services_action_sync (op=0x7f7185b605d0) at services.c:434 434 crm_trace(" > stdout: %s", op->stdout_data); Is responsible for the core. I've checked upstream code and there might be 2 important commits that could be cherry-picked to fix this behavior: commit f2a637cc553cb7aec59bdcf05c5e1d077173419f Author: Andrew Beekhof <[email protected]> Date: Fri Sep 20 12:20:36 2013 +1000 Fix: services: Prevent use-of-NULL when executing service actions commit 11473a5a8c88eb17d5e8d6cd1d99dc497e817aac Author: Gao,Yan <[email protected]> Date: Sun Sep 29 12:40:18 2013 +0800 Fix: services: Fix the executing of synchronous actions The core can be caused by things such as this missing code: if (op == NULL) { crm_trace("No operation to execute"); return FALSE; on the beginning of "services_action_sync(svc_action_t * op)" function. And improved by commit #11473a5. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/+subscriptions _______________________________________________ Mailing list: https://launchpad.net/~ubuntu-ha Post to : [email protected] Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp

