10.02.2016 12:31, Vladislav Bogdanov wrote:
10.02.2016 11:38, Ulrich Windl wrote:
Vladislav Bogdanov <bub...@hoster-ok.com> schrieb am 10.02.2016 um
05:39 in
Nachricht <6e479808-6362-4932-b2c6-348c7efc4...@hoster-ok.com>:

[...]
Well, I'd reword. Generally, RA should not exit with error if validation
fails on stop.
Is that better?
[...]

As we have different error codes, what type of error?

Any which makes pacemaker to think resource stop op failed.
OCF_ERR_* particularly.

If pacemaker has got an error on start, it will run stop with the same
set of parameters anyways. And will get error again if that one was from
validation and RA does not differentiate validation for start and stop.
And then circular fencing over the whole cluster is triggered for no
reason.

Of course, for safety, RA could save its state if start was successful
and skip validation on stop only if that state is not found. Otherwise
removed binary or config file would result in resource running on
several nodes.

Well, this all seems to be very complicated to make some general
algorithm ;)

Well, after some thinking, I've got an approach which sounds both elegant and safe enough to me and my colleagues. Please look at the following excerpt (part of hypothetical RA before the main 'case'):

-----
VALIDATION_FAILURE_FLAG="${HA_RSCTMP}/${OCF_RESOURCE_INSTANCE}.invalid"

case "${__OCF_ACTION}" in
    meta-data)
        meta_data
        exit $OCF_SUCCESS
        ;;
    usage|help)
        usage
        exit $OCF_SUCCESS
        ;;
    start)
        validate
        ret=$?
        if [ ${ret} -ne $OCF_SUCCESS ] ; then
            touch "${VALIDATION_FAILURE_FLAG}"
            exit ${ret}
        fi
        ;;
    stop)
        validate
        ret=$?
        if [ ${ret} -ne $OCF_SUCCESS ] ; then
            if [ -f "${VALIDATION_FAILURE_FLAG}" ] ; then
                rm -f "${VALIDATION_FAILURE_FLAG}"
                exit $OCF_SUCCESS
            else
                exit ${ret}
            fi
        fi
        ;;
    *) # monitor | notify | reload | etc
        validate
        ret=$?
        if [ ${ret} -ne $OCF_SUCCESS ] ; then
            if ocf_is_probe ; then
                exit $OCF_NOT_RUNNING
            fi
            exit $?
        fi
        ;;
esac
-----

Above assumes that validation function does not call exit (and thus uses have_binary instead of check_binary, etc.) but returns an error code.

The main difference to the current ocf_rarun implementation is that changes to machine environment (deleted binaries, configs, etc.) still result in stop failure (and thus fencing) if that changes were made after the successful validation on resource start.

I plan to extensively test such approach in my RAs shortly.

Comments are welcome.

Best,
Vladislav





Regards,
Ulrich



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to