Analyzing the stacktrace for stonithd:

(gdb) bt
#0  0x00007fed094febb9 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007fed09501fc8 in __GI_abort () at abort.c:89
#2  0x00007fed0a15a6c9 in crm_abort (file=0x7fed0a17e4bb "logging.c",
    function=0x7fed0a17f790 <__PRETTY_FUNCTION__.22958> "crm_glib_handler", 
line=63,
    assert_condition=0x7fed0af9f2c0 "Source ID 21 was not found when attempting 
to remove it",
    do_core=<optimized out>, do_fork=<optimized out>) at utils.c:1118
#3  0x00007fed0920fae1 in g_logv () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#4  0x00007fed0920fd72 in g_log () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5  0x00007fed09207c5c in g_source_remove () from 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#6  0x00007fed09d23ef5 in stonith_action_clear_tracking_data 
(action=action@entry=0x7fed0afc6b00)
    at st_client.c:536
#7  0x00007fed09d23f2d in stonith_action_destroy (action=0x7fed0afc6b00) at 
st_client.c:557
#8  0x00007fed0a172cd9 in child_waitpid (child=child@entry=0x7fed0afded70, 
flags=flags@entry=1)
    at mainloop.c:948
#9  0x00007fed0a172fce in child_death_dispatch (signal=<optimized out>) at 
mainloop.c:962
#10 0x00007fed0a171de7 in crm_signal_dispatch (source=0x7fed0afb0920, 
callback=<optimized out>,
    userdata=<optimized out>) at mainloop.c:275
#11 0x00007fed09208e04 in g_main_context_dispatch () from 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#12 0x00007fed09209048 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#13 0x00007fed0920930a in g_main_loop_run () from 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#14 0x00007fed0a5bd2a9 in main (argc=<optimized out>, argv=<optimized out>) at 
main.c:1136

Based on this stack trace:

crm_glib_handler ->  crm_abort -> abort

I could see one upstream fix that is exactly about this problem
(pacemaker mailing list):

http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022690.html

Explaining that this change (in glib):

https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c
(seen first at version 2.39.91 - Trusty version is 2.40.2-0ubuntu1)

Caused g_source_remove() (frame #5 in the stacktrace, part of libglib) to 
misbehave. 
(glib is using a hash table lookup to find sources, and not an iterator.. and 
it is also 
returning NULL if source was destroyed)

corosync reports the following error on this occasions:

"""
lrmd[1632]:    error: crm_abort: crm_glib_handler: Forked child 1840 to 
record non-fatal assert at logging.c:73 : Source ID 51 was not found when 
attempting to remove it
lrmd[1632]:    crit: crm_glib_handler: GLib: Source ID 51 was not found 
when attempting to remove it
"""

this is happening because one resource is being removed twice and this
can't be done with newer libglibs.

the following upstream fix handle this problem:

>From 568e41db929a34106c8c2ff7c48716ab5c13ef49 Mon Sep 17 00:00:00 2001
From: Andrew Beekhof <[email protected]>
Date: Mon, 13 Oct 2014 13:30:58 +1100
Subject: [PATCH] Fix: lrmd: Prevent glib assert triggered by timers being 
removed from mainloop more than once
  
I'll be providing a PPA (soon) with this fix so I can get users/community 
feedback on the resolution. 

Thank you

Rafael Tinoco

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to pacemaker in Ubuntu.
https://bugs.launchpad.net/bugs/1368737

Title:
  Pacemaker can seg fault on crm node online/standy

Status in “pacemaker” package in Ubuntu:
  Confirmed

Bug description:
  It was brought to my attention the following situation:

  """
  [Issue] 

  lrmd process crashed when repeating "crm node standby" and "crm node
  online"

  ---------------- 
  # grep pacemakerd ha-log.k1pm101 | grep core 
  Aug 27 17:47:06 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 49275 (lrmd) dumped core 
  Aug 27 17:47:06 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=49275, core=1) 
  Aug 27 18:27:14 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 1471 (lrmd) dumped core 
  Aug 27 18:27:14 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=1471, core=1) 
  Aug 27 18:56:41 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 35771 (lrmd) dumped core 
  Aug 27 18:56:41 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=35771, core=1) 
  Aug 27 19:44:09 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 60709 (lrmd) dumped core 
  Aug 27 19:44:09 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=60709, core=1) 
  Aug 27 20:00:53 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 35838 (lrmd) dumped core 
  Aug 27 20:00:53 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=35838, core=1) 
  Aug 27 21:33:52 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 49249 (lrmd) dumped core 
  Aug 27 21:33:52 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=49249, core=1) 
  Aug 27 22:01:16 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 65358 (lrmd) dumped core 
  Aug 27 22:01:16 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=65358, core=1) 
  Aug 27 22:28:02 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 22693 (lrmd) dumped core 
  Aug 27 22:28:02 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=22693, core=1) 
  ---------------- 

  ---------------- 
  # grep pacemakerd ha-log.k1pm102 | grep core 
  Aug 27 15:32:48 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed 
process 5812 (lrmd) dumped core 
  Aug 27 15:32:48 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=5812, core=1) 
  Aug 27 15:52:52 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed 
process 35781 (lrmd) dumped core 
  Aug 27 15:52:52 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=35781, core=1) 
  Aug 27 16:02:54 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed 
process 51984 (lrmd) dumped core 
  Aug 27 16:02:54 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=51984, core=1) 
  """

  Analyzing core file with dbgsyms I could see that:

  #0  0x00007f7184a45983 in services_action_sync (op=0x7f7185b605d0) at 
services.c:434
  434           crm_trace(" >  stdout: %s", op->stdout_data);

  Is responsible for the core.

  I've checked upstream code and there might be 2 important commits that
  could be cherry-picked to fix this behavior:

  commit f2a637cc553cb7aec59bdcf05c5e1d077173419f
  Author: Andrew Beekhof <[email protected]>
  Date:   Fri Sep 20 12:20:36 2013 +1000

      Fix: services: Prevent use-of-NULL when executing service actions
        
  commit 11473a5a8c88eb17d5e8d6cd1d99dc497e817aac
  Author: Gao,Yan <[email protected]>
  Date:   Sun Sep 29 12:40:18 2013 +0800

      Fix: services: Fix the executing of synchronous actions

  The core can be caused by things such as this missing code:

  if (op == NULL) { 
  crm_trace("No operation to execute"); 
  return FALSE; 

  on the beginning of "services_action_sync(svc_action_t * op)"
  function.

  And improved by commit #11473a5.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

Reply via email to