Re: [ClusterLabs] Q: ordering for a monitoring op only?
You could accomplish this be creating a custom RA which normally acts as a pass-through and calls the "real" RA. However, it intercepts "monitor" actions, checks nfs, and if nfs is down it returns success, otherwise it passes though the monitor action to the real RA. If nfs fails the monitor action is in-flight, the customer RA can intercept the failure, check if nfs is down, and if so change the failure to a success. On Mon, Aug 20, 2018 at 3:51 AM Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote: > Hi! > > I wonder whether it's possible to run a monitoring op only if some > specific resource is up. > Background: We have some resource that runs fine without NFS, but the > start, stop and monitor operations will just hang if NFS is down. In effect > the monitor operation will time out, the cluster will try to recover, > calling the stop operation, which in turn will time out, making things > worse (i.e.: causing a node fence). > > So my idea was to pause the monitoing operation while NFS is down (NFS > itself is controlled by the cluster and should recover "rather soon" TM). > > Is that possible? > And before you ask: No, I have not written that RA that has the problem; a > multi-million-dollar company wrote it (Years before I had written a monitor > for HP-UX' cluster that did not have this problem, even though the > configuration files were read from NFS (It's not magic: Just periodically > copy them to shared memory, and read the config from shared memory). > > Regards, > Ulrich > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Q: ordering for a monitoring op only?
On 20/08/18 10:51 +0200, Ulrich Windl wrote: > I wonder whether it's possible to run a monitoring op only if some > specific resource is up. > Background: We have some resource that runs fine without NFS, but > the start, stop and monitor operations will just hang if NFS is > down. In effect the monitor operation will time out, the cluster > will try to recover, calling the stop operation, which in turn will > time out, making things worse (i.e.: causing a node fence). > > So my idea was to pause the monitoing operation while NFS is down > (NFS itself is controlled by the cluster and should recover "rather > soon" TM). > > Is that possible? > And before you ask: No, I have not written that RA that has the > problem; a multi-million-dollar company wrote it (Years before I had > written a monitor for HP-UX' cluster that did not have this problem, > even though the configuration files were read from NFS (It's not > magic: Just periodically copy them to shared memory, and read the > config from shared memory). Sorry for stating likely obvious; in a similar spirit, if the agent at hand allows configuring the config location, you can synchronize the shared copy in the offline node-local mirrors, e.g. using csync2. The problem then boils down to whether "cluster approved, synchronized and fresh" version is what gets used. It doesn't look there's any silver bullet, any attempt to overcome "holistic integrity" (on its own the native approach with pacemaker, anything else is swimming against the stream) may bite you/affect HA at some possibly unanticipated point. If you don't want or cannot mangle (wrap call outs, etc.) with the resource agents, your best bet is to ask the respective author/vendor to honour OCF_CHECK_LEVEL[1] in "monitor" action properly, meaning that no file-based traversal (possibly getting stuck on NFS access) would be attempted by default (level "0", but could be with level of "10" or more), and do not set it artificially to higher levels in your configuration (or conditionalize similarly to what Ken suggested). Apparently, this won't fix "stop" issues, for instance. [1] https://github.com/ClusterLabs/OCF-spec/blob/42697cc9fd716173c7da6fa67148dd579282da96/ra/1.0/resource-agent-api.md#parameters-specific-to-the-monitor-action -- Nazdar, Jan (Poki) pgptP2PzMxxeI.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Q: ordering for a monitoring op only?
On Mon, 2018-08-20 at 10:51 +0200, Ulrich Windl wrote: > Hi! > > I wonder whether it's possible to run a monitoring op only if some > specific resource is up. > Background: We have some resource that runs fine without NFS, but the > start, stop and monitor operations will just hang if NFS is down. In > effect the monitor operation will time out, the cluster will try to > recover, calling the stop operation, which in turn will time out, > making things worse (i.e.: causing a node fence). > > So my idea was to pause the monitoing operation while NFS is down > (NFS itself is controlled by the cluster and should recover "rather > soon" TM). > > Is that possible? A possible mitigation would be to set on-fail=block on the dependent resource monitor, so if NFS is down, the monitor will still time out, but the cluster will not try to stop it. Of course then you lose the ability to automatically recover from an actual resource failure. The only other thing I can think of probably wouldn't be reliable: you could put the NFS resource in a group with an ocf:pacemaker:attribute resource. That way, whenever NFS is started, a node attribute will be set, and whenever NFS is stopped, the attribute will be unset. Then, you can set a rule using that attribute. For example you could make the dependent resource's is-managed property depend on the node attribute value. The reason I think it wouldn't be reliable is that if NFS failed, there would be some time before the cluster stopped the NFS resource and updated the node attribute, and the dependent resource monitor could run during that time. But it would at least diminish the problem space. Probably any dynamic solution would have a similar race condition -- the NFS will be failed in reality for some amount of time before the cluster detects the failure, so the cluster could never prevent the monitor from running during that window. > And before you ask: No, I have not written that RA that has the > problem; a multi-million-dollar company wrote it (Years before I had > written a monitor for HP-UX' cluster that did not have this problem, > even though the configuration files were read from NFS (It's not > magic: Just periodically copy them to shared memory, and read the > config from shared memory). > > Regards, > Ulrich -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Q: ordering for a monitoring op only?
On Mon, 2018-08-20 at 10:51 +0200, Ulrich Windl wrote: > Hi! > > I wonder whether it's possible to run a monitoring op only if some > specific resource is up. > Background: We have some resource that runs fine without NFS, but the > start, stop and monitor operations will just hang if NFS is down. In > effect the monitor operation will time out, the cluster will try to > recover, calling the stop operation, which in turn will time out, > making things worse (i.e.: causing a node fence). > > So my idea was to pause the monitoing operation while NFS is down > (NFS itself is controlled by the cluster and should recover "rather > soon" TM). > > Is that possible? It would be a lot better to fix the problem in the RA which causes it to fail when NFS is down, I would think? > And before you ask: No, I have not written that RA that has the > problem; a multi-million-dollar company wrote it (Years before I had > written a monitor for HP-UX' cluster that did not have this problem, > even though the configuration files were read from NFS (It's not > magic: Just periodically copy them to shared memory, and read the > config from shared memory). > > Regards, > Ulrich > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org > -- Cheers, Kristoffer ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Q: ordering for a monitoring op only?
Hi! I wonder whether it's possible to run a monitoring op only if some specific resource is up. Background: We have some resource that runs fine without NFS, but the start, stop and monitor operations will just hang if NFS is down. In effect the monitor operation will time out, the cluster will try to recover, calling the stop operation, which in turn will time out, making things worse (i.e.: causing a node fence). So my idea was to pause the monitoing operation while NFS is down (NFS itself is controlled by the cluster and should recover "rather soon" TM). Is that possible? And before you ask: No, I have not written that RA that has the problem; a multi-million-dollar company wrote it (Years before I had written a monitor for HP-UX' cluster that did not have this problem, even though the configuration files were read from NFS (It's not magic: Just periodically copy them to shared memory, and read the config from shared memory). Regards, Ulrich ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org