It was discovered that corosync exposes itself for a self-crash under rare circumstance whereby corosync executable is run when there is already a daemon instance around (does not apply to corosync serving without any backgrounding, i.e. launched with "-f" switch).
Such a circumstance can be provoked unattendedly by the third party, incl. "corosync -v" probe triggered internally by pcs (since 9e19af58 ~ 0.9.145), which is what makes the root cause analysis of such inflicted crash somewhat difficult to guess & analyze (the other reason may be rather runaway core dump if produced at all due to fencing coming, based on the few observed cases). The problems comes from the fact that corosync is arranged such that the logging is set up very early, even before the main control flow of the program starts. And part of this early enabling is also starting "blackbox" recording, which uses mmap'd file stored in /dev/shm that, moreover, only varies on PID that is part of the file name -- and when corosync perform the fork so as to detach itself from the environment it started it, such PID is free to be reused. And against all odds, when that happens with this fresh new corosync process, it happily mangles the file underneath the former daemon one, leading to crashes indicated by SIGBUS, rarely also SIGFPE. * * * There are two quick mitigation techniques that can be readily applied: 1. make on-PATH corosync executable rather a "careful" wrapper: cp -a /sbin/corosync /sbin/corosync.orig > /sbin/corosync cat <<EOF #!/bin/sh test "\$1" != -v || { echo "$(/sbin/corosync.orig -v)"; exit 0; } exec /sbin/corosync.orig "\$@" EOF (when using SELinux, check the function and possibly fix the contexts on these files) 2. extend the PID space so as to move its wrap-around (precondition for reproducing the issue) further to the future (hence make the critical moments spread less frequently, lowering the overall probability), for instance with Linux kernel: echo 4194303 > /proc/sys/kernel/pid_max * * * The claim this problem is fixed, at least all three mentioned components will have to do its part to limit the problem in the future: - corosync (do something new after fork?) - libqb (be more careful about the crashing condition?) - pcs (either find a different way to check "is-old-stack", or double check if the probe's PID doesn't happen to hit the one baked in existing files in /dev/shm?) so as to cover the-counterpart-not-up2date cases, and also will likely lead to augmenting and/or overloading semantics of libqb's API. All is being worked on, stay tuned. -- Jan (Poki)
pgp8Rc6ZNLTew.pgp
Description: PGP signature
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org