Attaching fix for Utopic ** Patch added: "utopic_pacemaker_1.1.10+git20130802-4ubuntu3.2.debdiff" https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1412962/+attachment/4306495/+files/utopic_pacemaker_1.1.10%2Bgit20130802-4ubuntu3.2.debdiff
-- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1412962 Title: Pacemaker (stonith) can seg fault in Trusty and Utopic after following message: Source ID XX was not found when attempting to remove it Status in pacemaker package in Ubuntu: In Progress Bug description: It was brought to my attention that pacemaker could seg fault (stonith) on some conditions. This problem was brought to me when solving the following bug: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/ So you can check the problem here: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/34 https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/35 https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/36 https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/37 https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/38 And possible explanation here: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/39 https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/40 (Copy and pasting here): So the cherry-pick (for version trusty_pacemaker_1.1.10+git20130802-1ubuntu2.2, based on a upstream commit) seems ok since it makes lrmd (services, services_linux) to avoid repeating a timer when the source was already removed from glib main loop context: example: + if (op->opaque->repeat_timer) { + g_source_remove(op->opaque->repeat_timer); ++ op->opaque->repeat_timer = 0; etc... This actually solved lrmd crashes I was getting with the testcase (explained inside this bug summary). === Explanation: g_source_remove -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022690.html libglib2 changes -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022699.html === Analyzing your crash file (from stonith and not lrm), it looks like we have the following scenario: ============== exited = child_waitpid(child, WNOHANG); |_> child->callback(child, child->pid, core, signo, exitcode); |_> stonith_action_async_done (stack shows: stonith_action_destroy()) <----> call g_resource_remove 2 times |_> stonith_action_clear_tracking_data(action); |_> g_source_remove(action->timer_sigterm); |_> g_critical ("Source ID %u was not found when attempting to remove it", tag); WHERE ============== Child here is the "monitor" (0x7f1f63a08b70 "monitor"): /usr/sbin/fence_legacy "Helper that presents a RHCS-style interface for Linux-HA stonith plugins" This is the script responsible to monitor a stonith resource and it has returned (triggering monitor callback) with the following data on it: ------ data (begin) ------ agent=fence_legacy action=monitor plugin=external/ssh hostlist=kjpnode2 timeout=20 async=1 tries=1 remaining_timeout=20 timer_sigterm=13 timer_sigkill=14 max_retries=2 pid=1464 rc=0 (RETURN CODE) string buffer: "Performing: stonith -t external/ssh -S\nsuccess: 0\n" ------ data (end) ------ OBS: This means that fence_legacy returned, after checking that st_kjpnode2 was ok, and its cleanup operation (callback) caused the problem we faced. As soon as it dies, the callback for this process is called: if (child->callback) { child->callback(child, child->pid, core, signo, exitcode); In our case, callback is: 0x7f1f6189cec0 <stonith_action_async_done> which calls 0x7f1f6189af10 <stonith_action_destroy> and then 0x7f1f6189ae60 <stonith_action_clear_tracking_data> generating the 2nd removal (g_source_remove) with the 2nd call to g_source_remove, after glib2.0 change explained before this comment, we get a g_critical ("Source ID %u was not found when attempting to remove it", tag); and this generates the crash (since g_glob is called with a critical log_level causing crm_abort to be called). POSSIBLE CAUSE: ============== Under <stonith_action_async_done> we have: stonith_action_t *action = 0x7f1f639f5b50. if (action->timer_sigterm > 0) { g_source_remove(action->timer_sigterm); } if (action->timer_sigkill > 0) { g_source_remove(action->timer_sigkill); } Under <stonith_action_destroy> we have stonith_action_t *action = 0x7f1f639f5b50. and a call to: stonith_action_clear_tracking_data(action); Under stonith_action_clear_tracking_data(stonith_action_t * action) we have AGAIN: stonith_action_t *action = 0x7f1f639f5b50. if (action->timer_sigterm > 0) { g_source_remove(action->timer_sigterm); action->timer_sigterm = 0; } if (action->timer_sigkill > 0) { g_source_remove(action->timer_sigkill); action->timer_sigkill = 0; } This logic probably triggered the same problem the cherry pick addressed for lrmd, but now for stonith (calling g_source_remove 2 times for the same source after glib2.0 was changed). ############## commit 0326f05c9e26f39a394fa30830e31a76306f49c7 Author: Andrew Beekhof <[email protected]> Date: Thu Aug 7 13:49:24 2014 +1000 Fix: stonith-ng: Reset mainloop source IDs after removing them diff --git a/lib/fencing/st_client.c b/lib/fencing/st_client.c index 64bd8f3..2837682 100644 --- a/lib/fencing/st_client.c +++ b/lib/fencing/st_client.c @@ -663,9 +663,11 @@ stonith_action_async_done(mainloop_child_t * p, pid_t pid, int core, int signo, if (action->timer_sigterm > 0) { g_source_remove(action->timer_sigterm); + action->timer_sigterm = 0; } if (action->timer_sigkill > 0) { g_source_remove(action->timer_sigkill); + action->timer_sigkill = 0; } if (action->last_timeout_signo) { ############## under <stonith_action_async_done>. Will provide you a hotfix with this fix and ask for feedback. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1412962/+subscriptions _______________________________________________ Mailing list: https://launchpad.net/~ubuntu-ha Post to : [email protected] Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp

