Okay,
So the cherry-pick (for version
trusty_pacemaker_1.1.10+git20130802-1ubuntu2.2, based on a upstream
commit) seems ok since it makes lrmd (services, services_linux) to avoid
repeating a timer when the source was already removed from glib main
loop context:
example:
+ if (op->opaque->repeat_timer) {
+ g_source_remove(op->opaque->repeat_timer);
++ op->opaque->repeat_timer = 0;
etc...
This actually solved lrmd crashes I was getting with the testcase
(explained inside this bug summary).
===
Explanation:
g_source_remove ->
http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022690.html
libglib2 changes ->
http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022699.html
===
Analyzing your crash file (from stonith and not lrm), it looks like we
have the following scenario:
==============
exited = child_waitpid(child, WNOHANG);
|_> child->callback(child, child->pid, core, signo, exitcode);
|_> stonith_action_async_done (stack shows: stonith_action_destroy())
<----> call g_resource_remove 2 times
|_> stonith_action_clear_tracking_data(action);
|_> g_source_remove(action->timer_sigterm);
|_> g_critical ("Source ID %u was not found when attempting to
remove it", tag);
WHERE
==============
Child here is the "monitor" (0x7f1f63a08b70 "monitor"): /usr/sbin/fence_legacy
"Helper that presents a RHCS-style interface for Linux-HA stonith plugins"
This is the script responsible to monitor a stonith resource and it has
returned (triggering monitor
callback) with the following data on it:
------ data (begin) ------
agent=fence_legacy
action=monitor
plugin=external/ssh
hostlist=kjpnode2
timeout=20
async=1
tries=1
remaining_timeout=20
timer_sigterm=13
timer_sigkill=14
max_retries=2
pid=1464
rc=0 (RETURN CODE)
string buffer: "Performing: stonith -t external/ssh -S\nsuccess: 0\n"
------ data (end) ------
OBS: This means that fence_legacy returned, after checking that
st_kjpnode2 was ok, and its cleanup operation (callback) caused
the problem we faced.
As soon as it dies, the callback for this process is called:
if (child->callback) {
child->callback(child, child->pid, core, signo, exitcode);
In our case, callback is:
0x7f1f6189cec0 <stonith_action_async_done> which
calls
0x7f1f6189af10 <stonith_action_destroy> and
then
0x7f1f6189ae60 <stonith_action_clear_tracking_data> generating the
2nd removal (g_source_remove)
with the 2nd call to g_source_remove, after glib2.0 change explained
before this comment, we get a
g_critical ("Source ID %u was not found when attempting to remove it",
tag);
and this generates the crash (since g_glob is called with a critical
log_level causing crm_abort to be called).
POSSIBLE CAUSE:
==============
Under <stonith_action_async_done> we have:
stonith_action_t *action = 0x7f1f639f5b50.
if (action->timer_sigterm > 0) {
g_source_remove(action->timer_sigterm);
}
if (action->timer_sigkill > 0) {
g_source_remove(action->timer_sigkill);
}
Under <stonith_action_destroy> we have stonith_action_t *action =
0x7f1f639f5b50.
and a call to: stonith_action_clear_tracking_data(action);
Under stonith_action_clear_tracking_data(stonith_action_t * action) we
have AGAIN:
stonith_action_t *action = 0x7f1f639f5b50.
if (action->timer_sigterm > 0) {
g_source_remove(action->timer_sigterm);
action->timer_sigterm = 0;
}
if (action->timer_sigkill > 0) {
g_source_remove(action->timer_sigkill);
action->timer_sigkill = 0;
}
This logic probably triggered the same problem the cherry pick addressed for
lrmd,
but now for stonith (calling g_source_remove 2 times for the same source after
glib2.0 was changed).
I'll try to discuss this with upstream and confirm that this is
happening.
Thank you
Rafael Tinoco
--
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to pacemaker in Ubuntu.
https://bugs.launchpad.net/bugs/1368737
Title:
Pacemaker can seg fault on crm node online/standby
Status in pacemaker package in Ubuntu:
Fix Released
Status in pacemaker source package in Trusty:
Fix Committed
Status in pacemaker source package in Utopic:
Fix Committed
Status in pacemaker source package in Vivid:
Fix Released
Bug description:
[IMPACT]
- Pacemaker seg fault on repeated crm node online/standy because:
- Newer glib versions uses hash_table to find GSources
- Glib can try to assert source being removed multiple times
[TEST CASE]
- Using same configuration as attached cib.xml :
#!/bin/bash
while true; do
crm node standby clustertrusty01
sleep 7
crm node online clustertrusty01
sleep 7
crm node standby clustertrusty02
sleep 7
crm node online clustertrusty02
sleep 7
crm node standby clustertrusty03
sleep 7
crm node online clustertrusty03
sleep 7
done
[REGRESSION POTENTIAL]
- Based on upstream commit 568e41d
- Test case ran for more than 7 hours with no problems
[OTHER INFO]
It was brought to my attention the following situation:
"""
[Issue]
lrmd process crashed when repeating "crm node standby" and "crm node
online"
----------------
# grep pacemakerd ha-log.k1pm101 | grep core
Aug 27 17:47:06 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 49275 (lrmd) dumped core
Aug 27 17:47:06 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=49275, core=1)
Aug 27 18:27:14 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 1471 (lrmd) dumped core
Aug 27 18:27:14 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=1471, core=1)
Aug 27 18:56:41 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 35771 (lrmd) dumped core
Aug 27 18:56:41 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=35771, core=1)
Aug 27 19:44:09 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 60709 (lrmd) dumped core
Aug 27 19:44:09 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=60709, core=1)
Aug 27 20:00:53 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 35838 (lrmd) dumped core
Aug 27 20:00:53 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=35838, core=1)
Aug 27 21:33:52 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 49249 (lrmd) dumped core
Aug 27 21:33:52 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=49249, core=1)
Aug 27 22:01:16 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 65358 (lrmd) dumped core
Aug 27 22:01:16 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=65358, core=1)
Aug 27 22:28:02 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 22693 (lrmd) dumped core
Aug 27 22:28:02 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=22693, core=1)
----------------
----------------
# grep pacemakerd ha-log.k1pm102 | grep core
Aug 27 15:32:48 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed
process 5812 (lrmd) dumped core
Aug 27 15:32:48 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=5812, core=1)
Aug 27 15:52:52 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed
process 35781 (lrmd) dumped core
Aug 27 15:52:52 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=35781, core=1)
Aug 27 16:02:54 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed
process 51984 (lrmd) dumped core
Aug 27 16:02:54 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=51984, core=1)
"""
Analyzing core file with dbgsyms I could see that:
#0 0x00007f7184a45983 in services_action_sync (op=0x7f7185b605d0) at
services.c:434
434 crm_trace(" > stdout: %s", op->stdout_data);
Is responsible for the core.
I've checked upstream code and there might be 2 important commits that
could be cherry-picked to fix this behavior:
commit f2a637cc553cb7aec59bdcf05c5e1d077173419f
Author: Andrew Beekhof <[email protected]>
Date: Fri Sep 20 12:20:36 2013 +1000
Fix: services: Prevent use-of-NULL when executing service actions
commit 11473a5a8c88eb17d5e8d6cd1d99dc497e817aac
Author: Gao,Yan <[email protected]>
Date: Sun Sep 29 12:40:18 2013 +0800
Fix: services: Fix the executing of synchronous actions
The core can be caused by things such as this missing code:
if (op == NULL) {
crm_trace("No operation to execute");
return FALSE;
on the beginning of "services_action_sync(svc_action_t * op)"
function.
And improved by commit #11473a5.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/+subscriptions
_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help : https://help.launchpad.net/ListHelp