Re: [ClusterLabs] Q: ordering for a monitoring op only?

2018-08-21 Thread Ryan Thomas
You could accomplish this be creating a custom RA which normally acts as a
pass-through and calls the "real" RA.  However, it intercepts "monitor"
actions, checks nfs, and if nfs is down it returns success, otherwise it
passes though the monitor action to the real RA.  If nfs fails the monitor
action is in-flight, the customer RA can intercept the failure, check if
nfs is down, and if so change the failure to a success.

On Mon, Aug 20, 2018 at 3:51 AM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> Hi!
>
> I wonder whether it's possible to run a monitoring op only if some
> specific resource is up.
> Background: We have some resource that runs fine without NFS, but the
> start, stop and monitor operations will just hang if NFS is down. In effect
> the monitor operation will time out, the cluster will try to recover,
> calling the stop operation, which in turn will time out, making things
> worse (i.e.: causing a node fence).
>
> So my idea was to pause the monitoing operation while NFS is down (NFS
> itself is controlled by the cluster and should recover "rather soon" TM).
>
> Is that possible?
> And before you ask: No, I have not written that RA that has the problem; a
> multi-million-dollar company wrote it (Years before I had written a monitor
> for HP-UX' cluster that did not have this problem, even though the
> configuration files were read from NFS (It's not magic: Just periodically
> copy them to shared memory, and read the config from shared memory).
>
> Regards,
> Ulrich
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Q: ordering for a monitoring op only?

2018-08-20 Thread Jan PokornĂ˝
On 20/08/18 10:51 +0200, Ulrich Windl wrote:
> I wonder whether it's possible to run a monitoring op only if some
> specific resource is up.
> Background: We have some resource that runs fine without NFS, but
> the start, stop and monitor operations will just hang if NFS is
> down. In effect the monitor operation will time out, the cluster
> will try to recover, calling the stop operation, which in turn will
> time out, making things worse (i.e.: causing a node fence).
> 
> So my idea was to pause the monitoing operation while NFS is down
> (NFS itself is controlled by the cluster and should recover "rather
> soon" TM).
> 
> Is that possible?
> And before you ask: No, I have not written that RA that has the
> problem; a multi-million-dollar company wrote it (Years before I had
> written a monitor for HP-UX' cluster that did not have this problem,
> even though the configuration files were read from NFS (It's not
> magic: Just periodically copy them to shared memory, and read the
> config from shared memory).

Sorry for stating likely obvious;  in a similar spirit, if the agent
at hand allows configuring the config location, you can synchronize
the shared copy in the offline node-local mirrors, e.g. using csync2.
The problem then boils down to whether "cluster approved,
synchronized and fresh" version is what gets used.

It doesn't look there's any silver bullet, any attempt to overcome
"holistic integrity" (on its own the native approach with pacemaker,
anything else is swimming against the stream) may bite you/affect HA
at some possibly unanticipated point.

If you don't want or cannot mangle (wrap call outs, etc.) with the
resource agents, your best bet is to ask the respective author/vendor
to honour OCF_CHECK_LEVEL[1] in "monitor" action properly, meaning
that no file-based traversal (possibly getting stuck on NFS access)
would be attempted by default (level "0", but could be with level of
"10" or more), and do not set it artificially to higher levels
in your configuration (or conditionalize similarly to what Ken
suggested).  Apparently, this won't fix "stop" issues, for instance.

[1] 
https://github.com/ClusterLabs/OCF-spec/blob/42697cc9fd716173c7da6fa67148dd579282da96/ra/1.0/resource-agent-api.md#parameters-specific-to-the-monitor-action

-- 
Nazdar,
Jan (Poki)


pgptP2PzMxxeI.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Q: ordering for a monitoring op only?

2018-08-20 Thread Ken Gaillot
On Mon, 2018-08-20 at 10:51 +0200, Ulrich Windl wrote:
> Hi!
> 
> I wonder whether it's possible to run a monitoring op only if some
> specific resource is up.
> Background: We have some resource that runs fine without NFS, but the
> start, stop and monitor operations will just hang if NFS is down. In
> effect the monitor operation will time out, the cluster will try to
> recover, calling the stop operation, which in turn will time out,
> making things worse (i.e.: causing a node fence).
> 
> So my idea was to pause the monitoing operation while NFS is down
> (NFS itself is controlled by the cluster and should recover "rather
> soon" TM).
> 
> Is that possible?

A possible mitigation would be to set on-fail=block on the dependent
resource monitor, so if NFS is down, the monitor will still time out,
but the cluster will not try to stop it. Of course then you lose the
ability to automatically recover from an actual resource failure.

The only other thing I can think of probably wouldn't be reliable: you
could put the NFS resource in a group with an ocf:pacemaker:attribute
resource. That way, whenever NFS is started, a node attribute will be
set, and whenever NFS is stopped, the attribute will be unset. Then,
you can set a rule using that attribute. For example you could make the
dependent resource's is-managed property depend on the node attribute
value. The reason I think it wouldn't be reliable is that if NFS
failed, there would be some time before the cluster stopped the NFS
resource and updated the node attribute, and the dependent resource
monitor could run during that time. But it would at least diminish the
problem space.

Probably any dynamic solution would have a similar race condition --
the NFS will be failed in reality for some amount of time before the
cluster detects the failure, so the cluster could never prevent the
monitor from running during that window.

> And before you ask: No, I have not written that RA that has the
> problem; a multi-million-dollar company wrote it (Years before I had
> written a monitor for HP-UX' cluster that did not have this problem,
> even though the configuration files were read from NFS (It's not
> magic: Just periodically copy them to shared memory, and read the
> config from shared memory).
> 
> Regards,
> Ulrich
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Q: ordering for a monitoring op only?

2018-08-20 Thread Kristoffer Grönlund
On Mon, 2018-08-20 at 10:51 +0200,  Ulrich Windl  wrote:
> Hi!
> 
> I wonder whether it's possible to run a monitoring op only if some
> specific resource is up.
> Background: We have some resource that runs fine without NFS, but the
> start, stop and monitor operations will just hang if NFS is down. In
> effect the monitor operation will time out, the cluster will try to
> recover, calling the stop operation, which in turn will time out,
> making things worse (i.e.: causing a node fence).
> 
> So my idea was to pause the monitoing operation while NFS is down
> (NFS itself is controlled by the cluster and should recover "rather
> soon" TM).
> 
> Is that possible?

It would be a lot better to fix the problem in the RA which causes it
to fail when NFS is down, I would think?

> And before you ask: No, I have not written that RA that has the
> problem; a multi-million-dollar company wrote it (Years before I had
> written a monitor for HP-UX' cluster that did not have this problem,
> even though the configuration files were read from NFS (It's not
> magic: Just periodically copy them to shared memory, and read the
> config from shared memory).
> 
> Regards,
> Ulrich
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
> 
-- 

Cheers,
Kristoffer

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Q: ordering for a monitoring op only?

2018-08-20 Thread Ulrich Windl
Hi!

I wonder whether it's possible to run a monitoring op only if some specific 
resource is up.
Background: We have some resource that runs fine without NFS, but the start, 
stop and monitor operations will just hang if NFS is down. In effect the 
monitor operation will time out, the cluster will try to recover, calling the 
stop operation, which in turn will time out, making things worse (i.e.: causing 
a node fence).

So my idea was to pause the monitoing operation while NFS is down (NFS itself 
is controlled by the cluster and should recover "rather soon" TM).

Is that possible?
And before you ask: No, I have not written that RA that has the problem; a 
multi-million-dollar company wrote it (Years before I had written a monitor for 
HP-UX' cluster that did not have this problem, even though the configuration 
files were read from NFS (It's not magic: Just periodically copy them to shared 
memory, and read the config from shared memory).

Regards,
Ulrich


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org