On Tue, 2019-07-23 at 07:57 +0200, Ulrich Windl wrote: > > > > Ken Gaillot <[email protected]> schrieb am 22.07.2019 um > > > > 18:14 in Nachricht > > <[email protected]>: > > On Mon, 2019-07-22 at 15:45 +0200, Ulrich Windl wrote: > > > Hi! > > > > > > My RA actually sends OCF_ERR_ARGS if checking the args detects a > > > problem. > > > But as the error can be resolved sometimes without changing the > > > args > > > (eg. > > > providing some resource by other means), I suspect CRM does not > > > handle that > > > properly. Even after a resource cleanup. > > > > > > My RA logs any parameter check, and I can see that no parameter > > > check > > > is being > > > performed... > > > > > > I also noticed that the "invalid parameter" persists on a node > > > even > > > after > > > restarting pacemaker on that node. > > > > Pacemaker treats OCF_ERR_ARGS as a "hard" failure, meaning it won't > > be > > retried on the same node. But it should attempt to start on any > > other > > eligible nodes. > > This makes _some_ sense: If the parameters are unacceptable > (OCF_ERR_ARGS) it really makes no sense to retry (Like havinf > specified a host name that does not exist). > However there are _two_ events that may change the state: > > 1) If the parameters (e.g. hostname) is changed > > 2) If the configuration outside the cluster was changed (e.g. making > the hostname valid now) > > In thge light of 2) I don't really see why a resource cleanup really > does not reset the error condition. That is really unexpected. > > > > > The failure should be cleared by either cleanup or pacemaker > > restart. > > According to my impression a cleanup did not change the condition but > a cluster node restart did.
If a cleanup doesn't take care of it, something's going wrong. > > > That's the mystery here. I can't even imagine how it would be > > possible > > to survive a pacemaker restart -- are you sure it wasn't simply a > > new > > attempt getting the same result? > > According to the logs of my RA there were less parameter checks than > expected, and the only explanation to me was that the result was > cached somewhere. > > > > > > > > > > So: > > > # crm_resource -r prm_idredir_test -VV start > > > warning: unpack_rsc_op_failure: Processing failed start > > > of > > > prm_idredir_test on h02: invalid parameter | rc=2 > > > > > > (Start was not even tried) > > > > > > Eventually I was able to start the resource. Some other process > > > had a > > > socket > > > address in use my resource needed... > > > > Since you control the RA, you might want to set exit reasons, which > > will be shown in the status display (the exitreason='' in your > > output > > below). There's an ocf_exit_reason convenience function, e.g. > > > > ocf_exit_reason "Some other process has the socket address in > > use" > > exit $OCF_ERR_ARGS > > Oh, this must be rather new ;-) > > Since when is that available? > > Regards, > Ulrich If you consider 2014 new :) Of course it always takes a little longer to find its way into distributions. -- Ken Gaillot <[email protected]> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
