Hi Rafael,

the PPA solved the stonith crashes on my testing system. I still get an
lrmd crash (core attached) corresponding to the crm_abort error:

Jan 23 08:57:30 kjpnode1 lrmd[1363]:    error: crm_abort:
crm_glib_handler: Forked child 5341 to record non-fatal assert at
logging.c:63 : Source ID 849 was not found when attempting to remove it

Maybe next week I get the possibility to test it on the production
system.

Peter

** Attachment added: "_usr_lib_pacemaker_lrmd.0.crash"
   
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1412962/+attachment/4304112/+files/_usr_lib_pacemaker_lrmd.0.crash

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to pacemaker in Ubuntu.
https://bugs.launchpad.net/bugs/1412962

Title:
  Pacemaker (stonith) can seg fault in Trusty and Utopic after following
  message: Source ID XX was not found when attempting to remove it

Status in pacemaker package in Ubuntu:
  In Progress

Bug description:
  It was brought to my attention that pacemaker could seg fault (stonith) on 
some conditions. This problem
  was brought to me when solving the following bug:

  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/

  So you can check the problem here:

  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/34
  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/35
  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/36
  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/37
  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/38

  And possible explanation here:

  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/39
  https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/40

  (Copy and pasting here):

  So the cherry-pick (for version
  trusty_pacemaker_1.1.10+git20130802-1ubuntu2.2, based on a upstream
  commit) seems ok since it makes lrmd (services, services_linux) to
  avoid repeating a timer when the source was already removed from glib
  main loop context:

  example:

  + if (op->opaque->repeat_timer) {
  + g_source_remove(op->opaque->repeat_timer);
  ++ op->opaque->repeat_timer = 0;

  etc...

  This actually solved lrmd crashes I was getting with the testcase
  (explained inside this bug summary).

  ===
  Explanation:
  g_source_remove -> 
http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022690.html
  libglib2 changes -> 
http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022699.html
  ===

  Analyzing your crash file (from stonith and not lrm), it looks like we
  have the following scenario:

  ==============

  exited = child_waitpid(child, WNOHANG);
  |_> child->callback(child, child->pid, core, signo, exitcode);
      |_> stonith_action_async_done (stack shows: stonith_action_destroy()) 
<----> call g_resource_remove 2 times
          |_> stonith_action_clear_tracking_data(action);
              |_> g_source_remove(action->timer_sigterm);
                  |_> g_critical ("Source ID %u was not found when attempting 
to remove it", tag);

  WHERE
  ==============

  Child here is the "monitor" (0x7f1f63a08b70 "monitor"): 
/usr/sbin/fence_legacy 
  "Helper that presents a RHCS-style interface for Linux-HA stonith plugins"

  This is the script responsible to monitor a stonith resource and it
  has returned (triggering monitor callback) with the following data on
  it:

  ------ data (begin) ------
  agent=fence_legacy
  action=monitor
  plugin=external/ssh
  hostlist=kjpnode2
  timeout=20
  async=1
  tries=1
  remaining_timeout=20
  timer_sigterm=13
  timer_sigkill=14
  max_retries=2
  pid=1464
  rc=0 (RETURN CODE)
  string buffer: "Performing: stonith -t external/ssh -S\nsuccess: 0\n"
  ------ data (end) ------

  OBS: This means that fence_legacy returned, after checking that
  st_kjpnode2 was ok, and its cleanup operation (callback) caused
  the problem we faced.

  As soon as it dies, the callback for this process is called:

      if (child->callback) {
          child->callback(child, child->pid, core, signo, exitcode);

  In our case, callback is:

  0x7f1f6189cec0 <stonith_action_async_done> which calls
  0x7f1f6189af10 <stonith_action_destroy> and then
  0x7f1f6189ae60 <stonith_action_clear_tracking_data> generating the 2nd 
removal (g_source_remove)

  with the 2nd call to g_source_remove, after glib2.0 change explained
  before this comment, we get a

  g_critical ("Source ID %u was not found when attempting to remove it",
  tag);

  and this generates the crash (since g_glob is called with a critical
  log_level causing crm_abort to be called).

  POSSIBLE CAUSE:
  ==============

  Under <stonith_action_async_done> we have:

  stonith_action_t *action = 0x7f1f639f5b50.

      if (action->timer_sigterm > 0) {
          g_source_remove(action->timer_sigterm);
      }
      if (action->timer_sigkill > 0) {
          g_source_remove(action->timer_sigkill);
      }

  Under <stonith_action_destroy> we have stonith_action_t *action = 
0x7f1f639f5b50.
  and a call to: stonith_action_clear_tracking_data(action);

  Under stonith_action_clear_tracking_data(stonith_action_t * action) we
  have AGAIN:

  stonith_action_t *action = 0x7f1f639f5b50.

      if (action->timer_sigterm > 0) {
          g_source_remove(action->timer_sigterm);
          action->timer_sigterm = 0;
      }
      if (action->timer_sigkill > 0) {
          g_source_remove(action->timer_sigkill);
          action->timer_sigkill = 0;
      }

  This logic probably triggered the same problem the cherry pick
  addressed for lrmd, but now for stonith (calling g_source_remove 2
  times for the same source after glib2.0 was changed).

  ##############

  commit 0326f05c9e26f39a394fa30830e31a76306f49c7
  Author: Andrew Beekhof <[email protected]>
  Date: Thu Aug 7 13:49:24 2014 +1000

      Fix: stonith-ng: Reset mainloop source IDs after removing them

  diff --git a/lib/fencing/st_client.c b/lib/fencing/st_client.c
  index 64bd8f3..2837682 100644
  --- a/lib/fencing/st_client.c
  +++ b/lib/fencing/st_client.c
  @@ -663,9 +663,11 @@ stonith_action_async_done(mainloop_child_t * p, pid_t 
pid, int core, int signo,

       if (action->timer_sigterm > 0) {
           g_source_remove(action->timer_sigterm);
  + action->timer_sigterm = 0;
       }
       if (action->timer_sigkill > 0) {
           g_source_remove(action->timer_sigkill);
  + action->timer_sigkill = 0;
       }

       if (action->last_timeout_signo) {

  ##############

  under <stonith_action_async_done>.

  Will provide you a hotfix with this fix and ask for feedback.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1412962/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

Reply via email to