[Ubuntu-ha] [Bug 1368737] Re: Pacemaker can seg fault on crm node online/standby

Rafael David Tinoco Tue, 20 Jan 2015 12:26:25 -0800

Okay,

So the cherry-pick (for version
trusty_pacemaker_1.1.10+git20130802-1ubuntu2.2, based on a upstream
commit) seems ok since it makes lrmd (services, services_linux) to avoid
repeating a timer when the source was already removed from glib main
loop context:


example:

+     if (op->opaque->repeat_timer) {
+         g_source_remove(op->opaque->repeat_timer);
++        op->opaque->repeat_timer = 0;


etc...

This actually solved lrmd crashes I was getting with the testcase
(explained inside this bug summary).

===
Explanation:
g_source_remove -> 
http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022690.html
libglib2 changes -> 
http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022699.html
===

Analyzing your crash file (from stonith and not lrm), it looks like we
have the following scenario:

==============

exited = child_waitpid(child, WNOHANG); 
|_> child->callback(child, child->pid, core, signo, exitcode);
    |_> stonith_action_async_done (stack shows: stonith_action_destroy()) 
<----> call g_resource_remove 2 times
        |_> stonith_action_clear_tracking_data(action);
            |_> g_source_remove(action->timer_sigterm);
                |_> g_critical ("Source ID %u was not found when attempting to 
remove it", tag);

WHERE
==============

Child here is the "monitor" (0x7f1f63a08b70 "monitor"): /usr/sbin/fence_legacy
"Helper that presents a RHCS-style interface for Linux-HA stonith plugins"

This is the script responsible to monitor a stonith resource and it has 
returned (triggering monitor 
callback) with the following data on it:

------ data (begin) ------
agent=fence_legacy
action=monitor
plugin=external/ssh
hostlist=kjpnode2
timeout=20
async=1
tries=1
remaining_timeout=20
timer_sigterm=13
timer_sigkill=14
max_retries=2
pid=1464
rc=0 (RETURN CODE)
string buffer: "Performing: stonith -t external/ssh -S\nsuccess:  0\n"
------ data (end) ------

OBS: This means that fence_legacy returned, after checking that 
st_kjpnode2 was ok, and its cleanup operation (callback) caused
the problem we faced.

As soon as it dies, the callback for this process is called:

    if (child->callback) {
        child->callback(child, child->pid, core, signo, exitcode);

In our case, callback is:

0x7f1f6189cec0 <stonith_action_async_done>                            which 
calls
0x7f1f6189af10 <stonith_action_destroy>                                    and 
then
0x7f1f6189ae60 <stonith_action_clear_tracking_data>            generating the 
2nd removal (g_source_remove)

with the 2nd call to g_source_remove, after glib2.0 change explained
before this comment, we get a

g_critical ("Source ID %u was not found when attempting to remove it",
tag);

and this generates the crash (since g_glob is called with a critical
log_level causing crm_abort to be called).

POSSIBLE CAUSE:
==============

Under <stonith_action_async_done> we have:

stonith_action_t *action = 0x7f1f639f5b50.

    if (action->timer_sigterm > 0) {
        g_source_remove(action->timer_sigterm);
    }
    if (action->timer_sigkill > 0) {
        g_source_remove(action->timer_sigkill);
    }

Under <stonith_action_destroy> we have stonith_action_t *action = 
0x7f1f639f5b50.
and a call to: stonith_action_clear_tracking_data(action);

Under stonith_action_clear_tracking_data(stonith_action_t * action) we
have AGAIN:

stonith_action_t *action = 0x7f1f639f5b50.

    if (action->timer_sigterm > 0) {
        g_source_remove(action->timer_sigterm);
        action->timer_sigterm = 0;
    }
    if (action->timer_sigkill > 0) {
        g_source_remove(action->timer_sigkill);
        action->timer_sigkill = 0;
    }

This logic probably triggered the same problem the cherry pick addressed for 
lrmd, 
but now for stonith (calling g_source_remove 2 times for the same source after 
glib2.0 was changed). 

I'll try to discuss this with upstream and confirm that this is
happening.

Thank you

Rafael Tinoco

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to pacemaker in Ubuntu.
https://bugs.launchpad.net/bugs/1368737

Title:
  Pacemaker can seg fault on crm node online/standby

Status in pacemaker package in Ubuntu:
  Fix Released
Status in pacemaker source package in Trusty:
  Fix Committed
Status in pacemaker source package in Utopic:
  Fix Committed
Status in pacemaker source package in Vivid:
  Fix Released

Bug description:
  [IMPACT]

    - Pacemaker seg fault on repeated crm node online/standy because:
        - Newer glib versions uses hash_table to find GSources
        - Glib can try to assert source being removed multiple times

  [TEST CASE]

    - Using same configuration as attached cib.xml :

          #!/bin/bash

          while true; do
              crm node standby clustertrusty01
              sleep 7
              crm node online clustertrusty01
              sleep 7
              crm node standby clustertrusty02
              sleep 7
              crm node online clustertrusty02
              sleep 7
              crm node standby clustertrusty03
              sleep 7
              crm node online clustertrusty03
              sleep 7
          done

  [REGRESSION POTENTIAL]

    - Based on upstream commit 568e41d
    - Test case ran for more than 7 hours with no problems

  [OTHER INFO]

  It was brought to my attention the following situation:

  """
  [Issue]

  lrmd process crashed when repeating "crm node standby" and "crm node
  online"

  ----------------
  # grep pacemakerd ha-log.k1pm101 | grep core
  Aug 27 17:47:06 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 49275 (lrmd) dumped core
  Aug 27 17:47:06 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=49275, core=1)
  Aug 27 18:27:14 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 1471 (lrmd) dumped core
  Aug 27 18:27:14 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=1471, core=1)
  Aug 27 18:56:41 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 35771 (lrmd) dumped core
  Aug 27 18:56:41 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=35771, core=1)
  Aug 27 19:44:09 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 60709 (lrmd) dumped core
  Aug 27 19:44:09 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=60709, core=1)
  Aug 27 20:00:53 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 35838 (lrmd) dumped core
  Aug 27 20:00:53 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=35838, core=1)
  Aug 27 21:33:52 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 49249 (lrmd) dumped core
  Aug 27 21:33:52 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=49249, core=1)
  Aug 27 22:01:16 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 65358 (lrmd) dumped core
  Aug 27 22:01:16 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=65358, core=1)
  Aug 27 22:28:02 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 22693 (lrmd) dumped core
  Aug 27 22:28:02 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=22693, core=1)
  ----------------

  ----------------
  # grep pacemakerd ha-log.k1pm102 | grep core
  Aug 27 15:32:48 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed 
process 5812 (lrmd) dumped core
  Aug 27 15:32:48 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=5812, core=1)
  Aug 27 15:52:52 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed 
process 35781 (lrmd) dumped core
  Aug 27 15:52:52 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=35781, core=1)
  Aug 27 16:02:54 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed 
process 51984 (lrmd) dumped core
  Aug 27 16:02:54 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=51984, core=1)
  """

  Analyzing core file with dbgsyms I could see that:

  #0  0x00007f7184a45983 in services_action_sync (op=0x7f7185b605d0) at 
services.c:434
  434           crm_trace(" >  stdout: %s", op->stdout_data);

  Is responsible for the core.

  I've checked upstream code and there might be 2 important commits that
  could be cherry-picked to fix this behavior:

  commit f2a637cc553cb7aec59bdcf05c5e1d077173419f
  Author: Andrew Beekhof <[email protected]>
  Date:   Fri Sep 20 12:20:36 2013 +1000

      Fix: services: Prevent use-of-NULL when executing service actions

  commit 11473a5a8c88eb17d5e8d6cd1d99dc497e817aac
  Author: Gao,Yan <[email protected]>
  Date:   Sun Sep 29 12:40:18 2013 +0800

      Fix: services: Fix the executing of synchronous actions

  The core can be caused by things such as this missing code:

  if (op == NULL) {
  crm_trace("No operation to execute");
  return FALSE;

  on the beginning of "services_action_sync(svc_action_t * op)"
  function.

  And improved by commit #11473a5.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

[Ubuntu-ha] [Bug 1368737] Re: Pacemaker can seg fault on crm node online/standby

Reply via email to