It was discovered that corosync exposes itself for a self-crash
under rare circumstance whereby corosync executable is run when there
is already a daemon instance around (does not apply to corosync serving
without any backgrounding, i.e. launched with "-f" switch).

Such a circumstance can be provoked unattendedly by the third party,
incl. "corosync -v" probe triggered internally by pcs (since 9e19af58
~ 0.9.145), which is what makes the root cause analysis of such
inflicted crash somewhat difficult to guess & analyze (the other
reason may be rather runaway core dump if produced at all due to
fencing coming, based on the few observed cases).

The problems comes from the fact that corosync is arranged such that
the logging is set up very early, even before the main control flow
of the program starts.  And part of this early enabling is also
starting "blackbox" recording, which uses mmap'd file stored in
/dev/shm that, moreover, only varies on PID that is part of the file
name -- and when corosync perform the fork so as to detach itself
from the environment it started it, such PID is free to be reused.
And against all odds, when that happens with this fresh new corosync
process, it happily mangles the file underneath the former daemon one,
leading to crashes indicated by SIGBUS, rarely also SIGFPE.

* * *

There are two quick mitigation techniques that can be readily applied:

1. make on-PATH corosync executable rather a "careful" wrapper:

    cp -a /sbin/corosync /sbin/corosync.orig
    > /sbin/corosync cat <<EOF
    #!/bin/sh
    test "\$1" != -v || { echo "$(/sbin/corosync.orig -v)"; exit 0; }
    exec /sbin/corosync.orig "\$@"
    EOF

    (when using SELinux, check the function and possibly fix the
    contexts on these files)

2. extend the PID space so as to move its wrap-around (precondition
    for reproducing the issue) further to the future (hence make the
    critical moments spread less frequently, lowering the overall
    probability), for instance with Linux kernel:

    echo 4194303 > /proc/sys/kernel/pid_max

* * *

The claim this problem is fixed, at least all three mentioned components
will have to do its part to limit the problem in the future:

- corosync (do something new after fork?)

Patch proposal:

https://github.com/corosync/corosync/pull/308

Also problem is really very rare and reproducing it is quite hard.


- libqb (be more careful about the crashing condition?)

- pcs (either find a different way to check "is-old-stack", or double
   check if the probe's PID doesn't happen to hit the one baked in
   existing files in /dev/shm?)

so as to cover the-counterpart-not-up2date cases, and also will likely
lead to augmenting and/or overloading semantics of libqb's API.
All is being worked on, stay tuned.



_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to