>>> Dejan Muhamedagic <[email protected]> schrieb am 26.06.2015 um 10:20 in
Nachricht <[email protected]>:

[...]
>> First, I think if a resource cannot be started on a node it's better to 
> return
>> OCF_ERR_INSTALLED rather than OCF_NOT_RUNNING, because it does not make any
>> sense to try to start the resource on that particular node. Then, how would 
> you
> 
> It is not always that simple. The part deemed not installed
> at the probes time may reappear later. For instance, some
> deployments have software on an NFS mount (I can recall that it
> was the case with SAP) and that NFS mount may not be available at
> the time.

Well, if you use NFS client mounted filesystems in resources, the NFS client 
should be started before the cluster. Once (hard) mounted, the filesystems 
should be there, even if the server fails. If the server is unavailable at 
mount-time, and you are using background mounts, you may be right (the 
mountpoints might appear later).
If you are providing and using NFS in the same cluster (maybe especially when 
providing /home) things may become tricky...

[A very annoying thing with SAP is the extensive use of NFS; if there is a NFS 
problem, the RAs time out, and the cluster thinks the system is not running, 
and to make things worse, a restart of the service will just hang (while 
waiting for NFS access). To make things worse, the cluster schedules a node 
fencing (killing more resources) when the stop times out...]

Still I think it's better to return "not installed" if it's clear the resource 
won't start (or stop) and use a "reprobe" at any later time if things changed 
rather than reporting "not running" and causing multiple start attempts that 
will surely fail. Opinions may vary, this is mine...


> 
> So, it's safer to return OCF_NOT_RUNNING and that is what quite a
> few RA do.

I don't know whether it is "safer", but it's simpler for sure.

[I once had written a monitor for SAP that carefully avoided to access NFS 
(actually it used asynchronous sub-processes to read from NFS into shared 
memory). And it reported three states (a different cluster system): not 
running, unknown, running
Especially for slow systems that was important, because a transition from "not 
running" to "unknown" might mean "starting", while the transition from 
"running" to "unknown" might mean "stopping". In the case of "unknown" my agent 
simply waited rechecking until some timeout... Definitely this was not simple, 
but being the result of several years of evolution (and system failures) it was 
quite "safe"]

Regards,
Ulrich





_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to